Hashing in detail – Part One

Introduction

This is the first post in Hashing – Online Classes.

I always wanted to write this article after writing couple of articles on sorting algorithms. It doesn’t make any difference, but of course it helps to establish a flow to the reader. There is a lot to understand and learn about hashing and I do not consider it good to throw Hashing on someone as the first article.
Please read any of my articles in the Sorting Series before reading further, so that a mindset is properly developed.

Purpose of the Article

The article Hashing in detail – Part One might not end up in just one article, so we will try to understand the meaning of the word hashing in this article, we will try to find the reason why hashing is important and we will build some concepts to make our understanding about hashing better. We will also try to find out the running time of operations like (storing, searching and fetching) on a hash table. If the length of the article permits, we might dwell into advanced hashing concept like chaining, linear probing, quadratic probing and bi-quadratic probing etc.

What is Hashing?

Let us suppose that we have thousand objects; we want to store them for future reference. All these objects can be assumed to be of same kind, i.e. either they are instances of the same class, extend the same abstract class or may be implement the same interface.

So, what are the options for us to store them?

We can store them in a continuous memory location (arrays) and every time we want to fetch/search an object with given set of properties, we start from the beginning of the array and keep matching each object of the array with the properties of the object we need to fetch.
The same would happen, no matter what basic data structure we choose, if we tend to choose a linked list, we still need to start at the head of the linked list and traverse till we find the element (for a successful hit) or till the end of the linked list.

This seems to be a perfect approach, I did not see a flaw in the beginning, definitely when I want to search something in an array or a linked list, I have to examine all the elements to get a hit. Now think of a case where we have 100,000 elements and the element we want to find lies in the end of the linked list. We might end up spending a lot of comparisons, which will be approximately equal to 100,000.

Does it sound correct now?

Every time we want to fetch an element there is a possibility of getting it either in the first comparison or in the 100,000th comparison or some un-predictable time. This might work good if we are using a super computer, but again there would be a time when the number of elements would be too huge to iterate every time if we need search operation very frequently.

**Note : This is a deviation from the topic, but I want to seed the idea of a different approach which mostly works at least four times faster than sequential searching. This is called move to front heuristic (MTF) and it can be easily implemented in a linked list.

The idea behind MTF: It is assumed that a localized search environment will regularly have the same set of search queries, which means if I am trying to search something then it is very much possible to search similar things again and again. Hence the MTF handles this scenario by placing the recently searched items in the beginning of the list. More on this in another article.

And here the traditional storage approach fails. We need to find something which can be more efficient and more responsive in terms of turnaround time for a search, add or delete.

Why is hashing important?

Now that we have identified the problem, we can try to think for a solution. Before we move further, let’s spend some time in understanding hash. Hash or hash code, basically is a number which is calculated using a function called hash function. Nothing peculiar about the hash word, you can use any word to describe it but it is famous with the name hash. So if you want to convey it to someone, it’s better to stick with the same coined term.
Hence, hash function is a simple/complex mathematical function which will accept few parameters, apply some mathematical logic on the parameters and returns a number which is also called the hash code. We will put one more condition on the hash function which is quite important; it should always return the same hash code for the same set of parameters.

Let’s try to solve our problem which we described above after we have known the hash code and hash function. The idea is to use a table with two columns ( KEY and VALUE ), where in for each row the KEY contains the hash code of an object and VALUE contains a pointer to the memory location where the object is stored (in a more casual way we say, the VALUE contains the information about the object’s location). Also, we have some mechanism to map the indexes of the table with the hash codes generate by the hashCode method.
With this setup if we try to store some objects, we will first invoke the hashCode method to generate a hash code for the object and then put the key in the KEY column at some index in the table (using our mapping logic) and the VALUE column will contain the information about where the object is stored.

The below illustration might explain it better, let us say, our hashCode function is

This function will return an integer which is ( 31 / 100 ) times the car’s registration number. Now let’s have five Cars which are to be stored in the hash table for future references.

Car_1 : Registration Number = 054, Color = Red
Car_2 : Registration Number = 074, Color = Blue
Car_3 : Registration Number = 100, Color = Green
Car_4 : Registration Number = 151, Color = White
Car_5 : Registration Number = 180, Color = Black
Let’s calculate the hash code for all the cars and then store it in the table.
Car_1 : HashCode = 16
Car_2 : HashCode = 22
Car_3 : HashCode = 31
Car_4 : HashCode = 46
Car_5 : HashCode = 55

We employ a very simple logic to map the indexes of the table with the hash code. We divide the hash code by 10 and whatever the result is, we store the hash code at that index.
For e.g. the hash code for Car_1 is 16, so we store it at the index 16/10 = 1 [integer division], similar for Car_2 it would be 22/10 = 2 and so on.
The hash table would look something like below:
hash_table

Let’s do the same exercise of fetching one of the objects from the hash table. Suppose we want to fetch the car with registration number 151. We follow the same process which we followed while storing the Car object:
1) Calculate the hash code for 151, it would be equal to 151*31/100 = 46.
2) Calculate the index for the hash code 46, the index would be 46/10 = 4.
3) Look at index 4 of our hash table and get the value column which contains the pointer4.
4) So we can know the location of the object and retrieve it by simple memory access.

How much time the retrieval took and is it dependent on the size of the list of objects we stored?
The answer to the question is, it is theoretically independent of the size of the list of objects, and it took almost constant time to retrieve the object. This is any day better than the traditional approach of storing objects in the array or linked list if we want frequent retrievals.

Conclusion

In this article we learnt the basics of hashing and the importance of hash code and using hashing as a technique to store objects. We also learnt that the retrieval from hash tables is better than the retrievals from the linked list or arrays.

In the next article we will learn more about hashing techniques, collisions and collision handling.

For more details please check Wiki
Stay connected and stay Subscribed