Count Word Frequency in Stream

Introduction

Continuing our discussions on Tries, here is another practical problem which is used in many text processing scenarios. Counting word frequency in stream translates to a scenario where you are given a list of words (dictionary). Now, there is an infinite stream of characters coming in. Your task is to print the frequency of words appearing in the stream.

To know more about the basics of building a Trie, please visit my previous post

Example Problem for counting word frequency in stream

Let us say we have a dictionary which contains the following words

  • aca
  • cat
  • hell
  • hello
  • lock
  • world

Also, here is the beginning of the infinite stream acacathellockword and characters keep on pouring in. For this example, we expect our program to produce the following result:

  • aca  – 2
  • cat – 1
  • hell – 1
  • hello – 1
  • lock – 1

Approach the problem

We can solve this problem in different ways. Here as a continuation of the Tries concept, we can use the data structure to solve this problem. Here is the basic idea.

  • Build a Trie from the given dictionary
  • Create a map to store word frequency
  • Define an empty queue Q
  • For each incoming character c
    • Add the root node of the Trie in Q.
    • Until Q is empty, remove a node from the queue say it N,
      • Check if N is a terminal node i.e. some word ends at N.
      • If N is a terminal node, add the word in a map or increment the count of the word in the map (if the word is already in the map)
      • Find the child node of N which has the character label c. If any, add it to a temporary queue R.
    • Once Q is empty copy all the elements from the temporary queue  R to the queue Q
    • Repeat the above process till the stream ends.
    • Anytime you print the map, you will find the frequency of all the words appeared till this point

Here is the Trie for the above dictionary

Word Frequency in Stream

The terminal flag is not marked here just for clarity of the diagram, but in our program, all these leaf nodes will have a terminal flag as true. Also the dictionary contains the world hell, so the second l will also have the terminal flag marked as true.

Why does this algorithm work?

Why do we add the root node in the queue every time?

For each character in the stream, there is a possibility that a word might start from the character. Hence, adding the root node will allow us to remove the root node from the queue further down in the algorithm step and allow us to test for direct children of root, for this character.

Why do we store all the nodes in the queue?

For a given character, it is important to know that which all words contain it in the chain. For example at the second occurrence of l , it is a part of two words here,  hello and lock. So, it is important to store two different instances of the node l, one which has a child o and ends in hello and other with a child o which is a part of the word lock.

So, next time if we get the character we will be able to fetch both the instances of nodes containing o. This will help us keeping a track of all possible words which may be present.

Isn’t it a overkill to store the nodes repetitively?

Not exactly, if the node is not leading to any valid word, in the next iteration it is anyways removed from the queue. Also, for every character, we only store the number of nodes in the previous iteration + a root node.

Source Code

Here is the code for counting/printing the words as they are formed by the incoming characters. The rest of the code for building a Trie (inserting the words from the dictionary) remains same as the previous post

Analysis

Let us find out the space complexity first.

The algorithm uses two queues to store the possible number of nodes, which can result into valid words. Ho many valid words can be there in the worst case. It would be the length of the stream, where each character till now is a possible candidate for a valid word in the dictionary. This might seem to be a huge number, because the stream is infinite.

But how many words can be there with more than 20 characters?

In a practical scenario and a non alien dictionary the space required won’t be more than 20, which means it is constant O(1)

Now the time complexity

The time complexity is simple to define. Here is an argument, if the character stream is infinite, then the outer loop will run for ever.

But that is not what we are measuring. Let us consider that in the middle of the stream when the queue size is at its maximum (which is 20, the lengthiest valid word in our dictionary). At this point, the inner loop runs for the length of the queue (which we proved to be of constant size). Hence, the loop runs for constant time. This also means that testing for a incoming character will take constant running time O(1).

Hence, for the complete stream, the run time is proportional to O(N), where N is the length of the stream.