Tries for Text Processing

Introduction

Now that we have talked so much about efficient ways of pattern finding using Boyer Moore and KMP algorithm. Let us try and understand other aspects of Text Processing as well.

This post will mostly focus on using tries for text processing, storing text efficiently and also discuss about data structures which may make more sense when we talk about storage and retrieval of text based on some given criteria.

Here are some real world problems which often demand quicker answers:

  • Ever wondered why you are able to search a word in a dictionary so fast? How do you think one would design a dictionary in a program? At most how many comparisons you do before successfully finding or unsuccessfully rejecting a word from the dictionary?
  • In a large document, I want to search all the words which start with a given prefix. May be replace all forms of a given verb with another verb. For e.g. in the document I might find that want is not the proper word in all its form, perhaps the word need makes more sense. In such a scenario, I would like to replace the prefix want with need.

Tries – Understanding the Data Structure

So here is how a Trie would more or less look like. It is advisable to talk about a Trie in context of an alphabet, because in a Standard Trie, the operations would depend on the size of the alphabet. For demonstrations in this post we will consider the English Alphabet for our Trie.

Tries for Text Processing

Trie data structure

The trie basically is an extension over the trees. You must be wondering why the name trie. No it is not related to the word tree for its nomenclature, it comes from the word retrieval. This data structure is mostly used for the purpose of building a dictionary which in turn is used for retrieving words, hence the name.

Let us say that our alphabet is represented by the Greek symbol ∑ and the size of the alphabet (number of characters in the alphabet) is given by |∑| . Also, let us say that we have a collection S which contains |S| strings. For simplicity let us define the convention where |S| = n and |∑| = d. Then the trie must hold the below properties:

  • The root node of the trie is an empty node.
  • The leaf nodes of the trie are nill nodes, which mark the end of one valid string. This ensures that if a string is also a substring of another string, they must be independently traceable.
  • All other nodes of the trie would contain one character each and the character must be within the range of our chosen alphabet.
  • The number of children in each node of the trie could range from 0 to d.
  • One path from the root to the leaf gives one valid word from the set S.
  • If an internal node has more than one children then it is a good practice to order them in their canonical ordering.

A fancy thing which is obvious from the above diagram is that, we can easily find all prefixes of length 0 to x to a given word where the length of the word is x. So, you can also call this a prefix tree, of course after certain modifications.

Note : You must understand that Tries independently would not prove much worth alone except for implementing a dictionary. For advanced text processing, we must use them with some other primary data structure which can store the complete document.

Here is a possible node structure for a node in the Trie.

Drawbacks of Tries

The above diagram for the Trie has a lot of nodes with just one child. This will usually happen in a Trie for any language containing sufficiently large number of characters. The total number of nodes in this Trie is proportional to the sum of total length of all the words we need to store. so we can say that the space required is O(kn), where k is the average length over the set S of strings we have to store in the Trie.

Also, talking about searching a word in the Trie, at each level we have to do d comparisons to arrive at the correct character and then we have to perform this for k levels where the length of the word to be searched is k.

Hence average running time of a search is O(dk).

Compressed Trie

Due to the above drawback, we can choose to create our Trie in a different way. We can alternatively shrink such nodes and combine them into one node which represents the suffix of the particular string. Here is an example.
Tries for Text Processing

This one looks much more shallow compared to the last one and it significantly reduces the number of nodes required to store the same set of strings. Also, we can probably remove the nill nodes if the Trie is smartly created. It can be mathematically proven that the upper bound on the number of nodes required for this Trie is of the order |S|. It depends on a lot of probability, but it is certainly a lot less than O(dk).

The following node structure will be sufficient for what we just discussed:

This can be called as a Compressed Trie.

Conclusion

Another interesting case about the Trie would be a binary system, where the alphabet contains only two characters 0 and 1. This Trie certainly is a Binary Tree and we may use it for some wonderful purpose later.

We will also talk about the applications of Compressed Tries and other text processing techniques in the coming posts.

Don’t forget to subscribe to TechieMe to get updates on latest posts.