Huffman Decoding

Introduction

Huffman coding is an encoding mechanism by which a variable length code word is assigned to each fixed length input character that is purely based on their frequency of occurrence of the character in the text to be encoded.

It is common to assign shorter codes to more frequent characters and the less frequent characters are assigned longer code words. A code word is a binary string (a sequence of zeroes and ones).

A Huffman tree is made for an input string and characters are decoded based on their position in the tree. The decoding process is as follows:

  • We start from the root of the binary tree and start searching for the character.
  • A zero is added to the code word when we move left in the binary tree.
  • A ‘1’ when is added to the code when we move right in the binary tree.

The leaf node contains the input character and is assigned the code formed by subsequent 0s and 1s.

Problem Statement

The task at hand is to perform Huffman Decoding i.e. decoding a given code word to find the corresponding encoded characters against the given Huffman Tree. All the internal nodes of the Huffman Tree contains a special character which is not present in the actual input string.

Huffman Decoding

Right above is a Huffman Tree for a string where A appears thrice, both E and T appears twice and B, M and S appears once. Following are the Huffman code for each of the characters:

  • A – 01
  • B – 000
  • E – 10
  • I – 11
  • S – 100
  • T – 101
  • M – 111

One important rule about Huffman Code is that no code word is a prefix of another code word , this also means that Huffman encoding is a prefix free encoding. And because all the characters are located at the leaves we can certify that its a true statement.

Approach

Now, we need to find the decoded string for a code word 01100110110110 or any similar word. Now let us first manually do it and then we will write an algorithm and code.

Because we know that its a prefix free encoding, we can start from the beginning of the encoded string and traverse the tree left or right as we encounter zeroes and ones. If we end up at a leaf, we print the character and reset our pointer to the root of the tree. Pretty simple Ah!

Lets try that out for our tree.

Huffman Decoding

Source Code

Analysis

The algorithm basically starts from the root and ends up to a leaf for every bit in the bit string (code word). For a balanced tree that would be proportional to the height of the tree which is O(logN) where N is the number of leaves (or unique characters in the input string).

It is clearly proportional to the length (L) of the code word. If the code word is pretty big then we can safely say that L >> N. In such a scenario we can ignore the running time O(logN) as low order term and we can claim that the running time of the algorithm is linear in L.

For e.g.:

Let say that we have an input text with 32 unique characters and we want to encode it, this means our Huffman tree has 32 leaves which in turn will lead to a tree with height 5.
Also, if the text to encode if 5000 characters long then the upper bound on the code word length would be 5000 * 5 = 25000. This is significantly smaller than 5, so it can be ignored and the running time can be treated as proportion to 25000 i.e. O(L) .

Conclusion

Huffman Encoding/Decoding is effective and in the next post we will learn about creating a balanced Huffman Tree.

Thanks for reading…