Text Processing

Count Word Frequency in Stream

Introduction Continuing our discussions on Tries, here is another practical problem which is used in many text processing scenarios. Counting word frequency in stream translates to a scenario where you are given a list of words (dictionary). Now, there is an infinite stream of characters coming in. Your task is to print the frequency of words appearing in the stream. To know more about the basics of building a Trie, please visit my previous post Example Problem for counting word frequency in stream Let us say we have a dictionary which contains the following words aca cat hell...
Read More

Building an autocomplete system using Trie

Introduction Few days back I wrote a post about the Trie data structure. This post is a continuation to it, and intends to introduce some practical problems which Trie solves. An autocomplete system is a perfect use case. Many of the websites use this feature to help there users with suggestions & autocomplete. Also, in this post we will define a Trie interface and then work through this problem statement. An autocomplete system using Trie As we know from the previous text that a Trie is a tree like data structure which stores words such that the search for a word is proportional to ...
Read More

Huffman Decoding

Introduction Huffman coding is an encoding mechanism by which a variable length code word is assigned to each fixed length input character that is purely based on their frequency of occurrence of the character in the text to be encoded. It is common to assign shorter codes to more frequent characters and the less frequent characters are assigned longer code words. A code word is a binary string (a sequence of zeroes and ones). A Huffman tree is made for an input string and characters are decoded based on their position in the tree. The decoding process is as follows: We start from t...
Read More

Tries for Text Processing

Introduction Now that we have talked so much about efficient ways of pattern finding using Boyer Moore and KMP algorithm. Let us try and understand other aspects of Text Processing as well. This post will mostly focus on using tries for text processing, storing text efficiently and also discuss about data structures which may make more sense when we talk about storage and retrieval of text based on some given criteria. Here are some real world problems which often demand quicker answers: Ever wondered why you are able to search a word in a dictionary so fast? How do you think one w...
Read More

Pattern Finding – KMP Algorithm

Introduction Now that you are fully equipped with the Boyer Moore Algorithm and have a notion of Pattern Finding. I would suggest you to get deeper into pattern finding. You can read about the benefits of pattern finding in my previous post about Boyer Moore. This post will try to make you familiar with all the thought processes which you can use to exploit the known properties of texts and patterns using KMP Algorithm. However, Boyer Moore seems to be slightly intuitive, this one is a real geeky way of finding patterns. Knuth Morris Pratt Yes! this is what KMP stands for, you can learn ...
Read More