Building an autocomplete system using Trie

Introduction

Few days back I wrote a post about the Trie data structure. This post is a continuation to it, and intends to introduce some practical problems which Trie solves. An autocomplete system is a perfect use case.

Many of the websites use this feature to help there users with suggestions & autocomplete. Also, in this post we will define a Trie interface and then work through this problem statement.

An autocomplete system using Trie

As we know from the previous text that a Trie is a tree like data structure which stores words such that the search for a word is proportional to the length of the word. Imagine a situation where a user is typing a word and he is mid way through. If you want to display all possible suggestions which he could have meant, then you need to find all the  words which starts with that prefix.

This is one class of problems which Trie is meant to solve. When the user starts typing, we initialize a character buffer which serves as a prefix. And every time there is a new character we append it to the prefix and search our Trie for the prefix. If the prefix exists, we return all the words followed by that.

Implementing  Trie

For simplicity, we will consider our alphabet to be from [A-Z] all caps. Also our dictionary is a list of few cities in India. We will ensure that there is no space or special character and the valid words contain characters from the alphabet.

Trie class

The features we are planning to implement are

  • addWord(String word) – given a word, add it to the Trie.
  • wordListForPrefix(String prefix) – given a prefix, return the list of words starting from the prefix.

The class definition will look like below:

To support this class definition, we need a TrieNode data structure as shown below:

This data structure is very important.

  • Notice the ALPHABET_SIZE which is 26 , that means it will support characters from a-z
  • The boolean property terminal, which marks that a word ends at this node.
  • The label, stores the character stored at that node.
  • If this node is a terminal node, then the node also stores the word which ends here.

Here is our sample Trie

tries

Adding a word to the Trie

Here is the algorithm to add a word to the Trie

  • Start from the root of the Trie
  • For the first character in the word, find if there exist a child node to the root containing that character.
  • If no node is present, then add a node with the first character of the word, and subsequently add child nodes for the following characters.
    • For e.g. If AGRA was the first word to be added, we add the node with A as a child of the root and subsequent children for G, R and A
  • If the node is found, move to the next character in the word and the child node in the Trie.
  • Repeat this process until either the word is exhausted or we reach to a null child node.
    • For e.g. In an attempt to add AJMER after adding AGRA, we start from the root, and discover that the node with label A exists. Where A is the first character in AJMER.
    • Then we move to the node A and try to find J in its children list. But we do not find J, instead if we try to fetch a node with J which is the child of node A, we will get a null node.
  • If reached a null child and there is still part of the word remaining, just add it child nodes  one after the other. SO J becomes the child of A, M gets to be the child of J and so on.

Source Code

Fetching word list for a prefix

The algorithm to fetch the word list is almost similar to the one for adding a word

  • Traverse through the Trie to find the node which stores the last character of the prefix.
  • Two cases may occur
    • Either the prefix gets exhausted
    • We reach a null node.
  • If we reach a null node then there are no words with this prefix so return
  • If the prefix gets exhausted at a node TN, then the whole sub tree of TN qualifies for the result.
  • Print all the paths to all leaves from this node. (You can use any technique to print all paths from a given node, here we will use a queue.

Here is a pictorial representation for the same.

Source Code

Autosuggesting

To make the autosuggest work, we need to add all the words from the dictionary to the Trie. You can do that by repeatedly invoking the addWord method. Once the Trie is fully built, we need to support auto suggest whenever a user types something.

Take the input, convert it into String and invoke the wordsByPrefix method. Show all the words returned from this method.

Summary

This is a very efficient method used for auto suggesting. The auxiliary space requirement is constant as we do not use any extra space. The run time complexity of adding a word is proportional to the length of the word to be added. We can safely assume that a word is not more than 25 characters, so it is constant time work. Or if you want to say it L as the max length of the word, then the addWord runtime is O(L).

Adding N words in the Trie would take O(LN) time.

Finding all words by prefix will take O(P) * O(S) where S is the sum of lengths of all suffixes returned and P is the length of the prefix supplied.

  • Ram Parashar

    We can save some space by using Suffix Trees instead of Trie.

  • Aurélie Fakambi

    Hello,

    Thanks for this post.

    Is it complicated to extend it to 52 letters ? Meaning that I would like to include lower case too.

    Thanks

    • Its very simple, just use ALPHABET_SIZE = 52 and take care of temp.children[cArray[index] – ‘A’]; in a case sensitive manner

  • Siddhartha Mishra

    Not working for Strings with spaces , shows array index out of bound exception. Tried to modify ALPHABET_SIZE but it will affect the overall indexing , if possible provide appropriate solution for this .

    • Yes of course! It won’t work for strings with spaces. Reason : Tries are there to store word, the moment you start using spaces, they are no more words, but phrases.

      If an attempt is made to include phrases, then you loose the advantage of Tries. The data structure is used to store a dictionary, and search word in near constant time (assuming that an average word is no more than 15 characters and there are as many as 100,000 words in the English dictionary)

      • Siddhartha Mishra

        No man , it can even made to work for special characters and exceptions and using two data structures dequeue and list is not justified. Trying to optimise it will post the solution soon.

        • Yes sure! that definitely is possible. Its just that this is a beginners post and hence was kept simple to consider a word which can only contain characters from the English alphabet..

          Thanks for your effort. I would be happy to incorporate your gist link in the post once you are done optimizing..

          Regards