Hashing in detail – Part Two

Introduction

This is the second post in Hashing – Online Classes.

This is in continuation of the article Hashing in detail – Part One, in which we understood the meaning of the word hash, hash code and hashing. We established a better understanding of the data structure which can be used to store objects using hashing.
Also, we found that storing, searching and fetching objects from a hash table is almost a constant time operation and hence, it is an effective mechanism.

Purpose of the article

The article Hashing in detail – Part Two is to focus on the challenges we face while employing hashing and to visit the widely accepted solutions for hashing. We might also look into common usages of hashing in real world applications etc. We will also analyze the effectiveness of different hash functions, and will try to draw a line between a good and a not so good hash function. We might need to extend it to a third post where we will learn advanced concepts in hashing.

What Next?

Continuing the discussion from the previous post, we used the below hashCode function to generate hash codes for each Car object:

Below are the hash codes for the first five cars,
Car_1 : Registration Number = 054, Color = Red , HashCode = 16
Car_2 : Registration Number = 074, Color = Blue , HashCode = 22
Car_3 : Registration Number = 100, Color = Green, HashCode = 31
Car_4 : Registration Number = 151, Color = White, HashCode = 46
Car_5 : Registration Number = 180, Color = Black, HashCode = 55

Now let us add some more cars,
Car_6 : Registration Number = 200, Color = Red , HashCode = 62
Car_7 : Registration Number = 250, Color = Blue , HashCode = 77
Car_8 : Registration Number = 160, Color = Green, HashCode = 49
Car_9 : Registration Number = 111, Color = White, HashCode = 34

As we can see the hash codes are 62, 77, 49 and 34 and the respective indices where we must store these object hash codes in the hash table are 62/10 = 6, 77/10 = 7, 49/10 = 4 and 34/10 = 3.

Now we try to add our cars in the hash table as per our logic, the hash code for Car_6 goes to index 6, Car_7 goes to index 7 and Car_8 should go to index 4 (which is already occupied by the hash code of Car_4).

This situation is called a collision, hence we need to find a way to handle this situation, and we cannot store the hash code for Car_8 in the hash table, because the slot is not empty.

Collision Handling

Below are few ways to handle collisions and we will take a good amount of time to discuss them because they are very important to understand the advanced concepts of hashing which will come in the later posts.

1. Chaining

For each index of the table we can use a linked list where we can add nodes if there is a collision for that index. So, if there are three cars mapping to the index 1, we can add a linked list with three nodes at index 1.
If we employ this technique then how do we search and fetch?

Once the cars are stored in the hash table using chaining, then at the time of searching or fetching we again follow the same technique. Find the hash code and then divide the hash code by 10 to get the index. Look at the index, if there is more than one node then traverse through each node and compare the properties of the car you want to search with the one present at the node (this will basically make a linear search in the linked list present at any index).
The diagram below descries the hash table in detail:
hash_table2
If we closely examine the hash table we can see that at each index of the table, if there is an entry for the KEY column then there is a linked list (with one or more node) associated with the VALUE column.

For e.g. the value for index 3 has a linked list with two nodes and the value for index 1 has a linked list with one node. Also, each node of the linked lists store information about the actual storage of the respective Car objects.
This whole process is called Chaining.

Analysis for Chaining

What can eventually go wrong in any hashing mechanism?
The answer is very simple; hash function is the key to hashing, and all the hashing revolves around it. Now let us assume that we have a hash function which always returns the same value, no matter what parameter we pass in to it.

In this case all the hash codes would be same and they will eventually map to a single index in the table, that would result in the growth of the linked list associated with that index and we end up storing all the elements in a single linked list. This is as crude as the scenario we discussed in the beginning of the article Hashing in detail – Part One.

The hash table would more or less look like below:
hashtable3

All the cars have the same hash code i.e. 31 and they all map to the same index which is 3. The linked list at index 3 will grow to the size of the total number of cars and hence all the cars are stored at the same index.

If we want to fetch the car object with registration number 111, we will find the hash code for 111 and it equals to 31, the index for hash code 31 would be 31/10 = 3 and so we go and look at the index 3 in the hash table, pick each node and get the object from it, compare the object’s registration number to 111 if it matches return the object else keep moving forward till we get to the end of the list. A very inefficient mechanism I must say, this is the worst which can happen, so let’s call it the Worst Case scenario.

What would be a better case?

A better or average case would be when each key is equally likely to be hashed to any slot in the table independent of where the other keys are hashed.
In this case there would be linked lists associated to almost every index of the hash table, and we can say that the hash codes are evenly distributed and we can also exploit the best of hashing.

In the above discussion we concluded that the choice of a hash function is very important, our hash function should be such that it can generate hash codes in a sufficiently random manner, which means for different sets of parameters supplied to the hash function, it must generates different hash codes.

How big our hash table should be?

A very common question which arises here is how big our hash table should be? Is it really required to bother about the size of the table? Is there a way to find that out? The next section is all about finding answers to these questions.

Well, if we want our searches to be faster in a hash table, we would certainly not like he individual linked list for each index to grow beyond a certain limit( if it grows beyond it, then it might be inefficient linear search). So, we need to keep a check on the length of the linked list, which narrows down the problem to making the table sufficiently large so that it can have enough rooms for all the elements we want to store.

Does this mean we need to decide in advance how many elements we need to store?
The answer to this question is a Yes and a No and I reserve this discussion for a later time.
We must have a table which has at least as many rows as the number of elements we want/plan to store. Secondly we need a good hash function which theoretically guarantees an even distribution of hash codes across its indexes. Now it’s the right time to introduce a new term called the load factor.

Load Factor: This is defined as the ratio of total number of keys stored to the total number of slots in a hash table. So if there are M slots and n keys to be stored then the load factor α = n / M. If we examine this closely then α can also be described as the average number of keys per slot.

Two cases arise; first being α > 1 and the other case is α < 1. If α is greater than 1, it is clear that the average length of a linked list associated with one index of the table would be α. If we want to search an element in the hash table then we need to spend a constant amount of time to calculate the hash code and find the mapped index and then we need to search through the linked list of length α. Hence the total running time of a search would be θ(1+ α). If α < 1 then the time taken to search an element in the hash table would be a constant time. This is because the size of the linked list would be at max one for each index. You must have noticed that in the complete article, I always mentioned that the searches in the hash table is almost a constant running time affair, I never pointed out that, it is exactly a constant time affair. And this is true, if you see the running time for search, it actually depends on α it’s not just constant. So, here we make a statement in bold “Hash tables do not guarantee constant time searches, it totally depends on α”

We will conclude this article with one more discussion about how to choose a hash function. Please read the next following section.

How to choose a good hash function

Below are the expectations from a good hash function:
1) It should distribute keys uniformly into slots.
2) Regularity in key distribution should not affect uniformity.

Our hash function can implement a division logic and always return a remainder, so we can use h(k) = k % z, where z has to be chosen carefully.

Few tips:
We must not pick z with a small divisor d, for e.g. if d = 2 or any other even number and all the keys happen to be even then the slots with odd indexes will remain vacant for ever and there will be an extra load on the even indexes.

Also, we must not choose z = 2 raised to r because in this case the generated hash code will not even depend on all the bits of the key because when we do k % 2 raised to r it eventually returns the last r bits of the number. So, if a lot of keys have the same lower order bits, it will again hash to the slots with same index.

So always try a z which is not very close to the power of 2 or 10. Moreover, divisions take a lot of computation compared to addition or multiplication in most of the processors. So we can employ another hash function which uses multiplication logic.

For e.g.: h (k) = (A * k % 2 raised to [w-r]) can be a good hash function where w is the word length of the processor (e.g. 32 bit, 64 bit etc) if 2 raised to w-1 < A < 2 raised w and A is an odd number not closer to either of the powers of 2. Also, if possible try to get w-r less than or equal to 32 because in most of the 32 bit processors will do this processing faster for low order 32 bits.

Conclusion

We will wrap up this article with this discussion and will continue the same with another method of resolving collisions, that is Open Addressing.
Stay connected and stay Subscribed

  • Byte me

    one question.. in the hash function h (k) = (A * k % 2 raised to [w-r]) what is the reason behind the word length of a processor (w) ??

    • dharam

      Hi,

      There is no hard and fast rule for this but point worth noting is that If you take modulus of any number by 2x it returns you the last x digits.

      If you computer word length is w and you user some thing greater than w in the power of 2 it might extend to 2 words and that will take more time because it is easy to do a one word calculation rather than doing a two word calculation..

      • Byte me

        oh yeah,,,clever…thanks for the explanation.

  • Pingback: TrackBack()