This is the third post in Hashing – Online Classes.
Please check the previous articles Hashing in detail – Part One and Hashing in detail – Part Two. In the first two articles we established a better understanding of the data structure which can be used to store objects using hashing, we also learnt about handling collisions and chaining. In “Hashing in detail – Part Three” post we will extend the discussion and wrap up basic concepts of hashing.
Purpose of the article
In the article Hashing in detail – Part 3 we will learn another collision resolution method called Open Addressing and we will also discuss probing strategies in detail.
As we learnt about the linear chaining there is a need for a linked list and there is an advantage and a disadvantage of using it. The advantage is that your table never gets exhausted, because you add upon to the linked list when there is a collision, the disadvantage is that with growing sizes of various linked lists, it becomes tough to add/search elements.
Hence, we look for a better option called Open Addressing, in this technique we would not store the objects in the linked list. If the hash function generates a hash code which maps to a already occupied slot then we hash it with a different function, and keep doing this until we get a vacant slot.
Also, few may argue that even this is a cumbersome and ineffective way, I agree, it may prove to be in-effective only if we do not put a check on , the number of times we have to re-calculate a new hash to insert one object. To make it a effective method, it is generally considered good to allow a maximum of four or five re-calculations. If still we meet a collision, we must stop and re-hash the whole table.
Procedure for Open Addressing
The procedure is very simple, we probe the table systematically until we find a vacant slot, it would be better if I explain it with an example.
We have a different kind of hash function, something similar to the below:
public int hashCode(Car c, int probeSequence)
return (10 * probeSequence) + (31 * c.hetRegistrationNumber() / 100);
So it is similar to our previous hash function except for the fact that it adds up the hash code with the probe sequence. Now lets see how this works:
Consider the table below which has few cells populated with hash codes, now suppose we want to store a car object with hash code with registration number 111. We always start with a probeSequence 0 and move forward if a new probe is required. So we calculate the hash code for the car with registration number 111 and it comes out to be 34. Now we try to map it with the index in table and find that it maps to index 3 which is all ready populated.
H(111, 0) = 34.
Now we increase the probeSequence by 1 and re-calculate the hash code like below:
H(111, 1) = 44. which means that we can map it to index = 44/10 = 4, but the index 4 is also occupied, hence we increase the probeSequence by 1 and it calculates the below hash code:
H(111, 2) = 54. which means that we can map it to index = 54/10 = 5, and that is empty so we put the hash code in the KEY column. We say that the car object took two probes for insertion.
This is how we insert into hash tables using open addressing. If you see, then the subsequent probe maps to an index which is 1 greater than the previous index, this is called linear probing.
Consider performing a search on the hash table, it will use the same probe sequence as it used while insertion. If we are trying to search the car we just inserted, we will find H(111,0) and get 34 and hence we check at the index 3, when we compare the car at index 3 it doesn’t have the registration number 111. So we increase the probeSequence by 1 and find calculate another hash code by H(111, 1) = 44 which maps to index 4 but the car at index 4 doesn’t have the registration number 111. We try again, with a higher probeSequence and we re calculate the hash code by H(111, 2) = 54 which maps to index 5 and the car at index 5 has the registration number 111, and out search is over.
The big problem
What possibly could go wrong? What if we allow deletion from the hash table as well? Assume that we have already completed the above steps, now we removed the car at index 3, which makes the index 3 vacant.
If I want to find the car with registration number 111, I calculate the hash code using H(111, 0) = 34. It maps to index 3 which is vacant, how do I know if this is an unsuccessful search (the car is not present in the hash table) or the car might be present at a different location due to a higher probeSequence?
We can’t tell this with 100% guarantee, hence we need a solution. The proposed method is to keep a track of the highest probeSequence which was used during insertion, and we do not conclude an unsuccessful search until we have completed number of probes equal to the highest probeSequence.
This means, if various elements were inserted in the hash table each with probe p1, p2 … pn such that 0 <= p1 < p2 < ... < pn then while searching any element in the hash table, we cannot conclude if the element is present in the hash table or not, until we do pn probes.
Also, there is another problem with linear probing, we can see cluster of keys getting accumulated in a particular section of the table, which will make insertions much harder when the generated hash codes are in a small range of number. It will un-necessarily force to use more number of probes even if another section of the table is completely empty. This is also called linear clustering.
To resolve the issues with linear probing and to get rid of the linear clusters, quadratic probing is considered as an alternative, In this case, the hash function generates a hash code which instead of mapping with the subsequent index, maps with an index which is i greater than the previously mapped index, where i is the probeSequence.
For e.g. if H(Key1 , 0) maps to an index 4 in the table, then H(Key1, 1) maps to (4+1) = 5 , H(Key1, 2) maps to (5+2) = 7 , H(Key1, 3) maps to (7+3) = 10 and so on. Yes it will also end up in a different kind of clustering called quadratic clustering. But still it is far better than the linear clustering and hence a better way of doing it.
All the above methods are improving over the previous ones, for further improvement in hashing we can look upon the method called Double hashing, which precisely uses two hash functions and the calculation may look like the below:
H(Key, probeSequence) = h2(Key) + probeSequence * H2(Key) modulus m.
If you closely observe, then we can easily make this clear that for each key only h2(Key) and H2(Key) has to be calculated just once, after which we can just replace it and multiply by the probeSequence, no matter what is the probeSequence.
In the above equation we generally pick m as a power of 2 (for explanation on the power of two, you car read the previous articles) and prefer H2(Key) to be odd.
A minimum of one probe is necessary to search an element, without having a single probe we can say nothing about the existence or non-existence of the element in the hash table. Now, what is the probability of getting a collision if there are N elements in M slots?
The answer is N/M, because there is just N elements so N slots are filled and chances of getting into those N slots out of M slots is N/M, so let us say that P1 = N/M.
Hence, if collision occurs we need to do a second probe. What is the probability of a collision while doing a second probe?
The answer is (N-1) / (M-1), because it will not probe to the same slot again and there are N-1 slots occupied , other than the present one. P2 = (N-1) / (M-1)
Similarly the probability of third probe leading to collision is P3 = (N-2) / (M-2) and so on..
Note that (N-i)/(M-i) < N/M [Reason : N < M hence subtracting the same number from both of them makes the ratio much smaller than the ratio N/M. So expected number of probes for a given search would be : = 1 + (N/M)[ 1 + (N-1)/(M-1) [ 1 + (N-2) / (M-2) [ 1 + .....) ] ] ] ] <= 1 + α [ 1 + α [ 1 + α [ 1 + .....] ] ] ] ] <= 1 + ( α * α ) + ( α * α * α ) + ..... <= 1 / (1-α ). [which is sum of geometric series, considering α to be less than one] So expected number of probes depends on α which is our load factor.
Effects of load factor on number of expected probes to search or insert an element
If α is a very small number, 0.1 or may be 0.01 which means the table has many empty slots because N/M is a small number where N is the number of elements and M is total number of slots.
Expected number of probes would be 1 / ( 1 – 0.1) = 1 / 0.9 which is nearly equal to 1. So one probe is needed to search or insert.
If α is 0.5, which means the table is half filled or half empty,
Expected number of probes would be 1 / ( 1 – 0.5) = 1 / 0.5 , which is equal to an average of two probes for each insert or search.
If α is 0.9, which means the table is 90% full, something like maximum utilization (which is not necessarily the best thing)
Expected number of probes would be 1 / ( 1 – 0.9) = 1 / 0.1, which is equal to an average of ten probes for each insert or search.
Here we wrap our Basics of Hashing in detail article, we might re-visit this topic to understand some more advanced concept at some point in time.
We learnt that α or the load factor plays a really important role in having a good hashing setup, never fill your tables to more than 50 or 60 percent, it will always give better results.
In java the load factor is around 0.75 which means that a maximum of four probes is needed for any search or insert in the java HashTable.
Thank you and happy learning. Stay connected and stay Subscribed