Pattern Finding – Boyer Moore

Introduction

String matching problems are important part of many Programming Contests and also find applications in real world problems. Imagine you are asked to implement a text editor.

What are few of the most common operations in a text editor?

  • Copy and Paste.
  • Search and Replace.

The performance of your text editor will completely depend on how efficiently it can search and/or replace patterns in a huge text.

Basics of Pattern Finding

In a given pattern matching problem, we have two strings one is called the text T and the other is called the pattern P. The task is to find the pattern P in the text T and return the first index of P in T.

There are various ways of finding it and this seems to be a simple problem.
So, it starts with comparing the characters at index j = 0 in P with index i+j in T where i = 0, if the corresponding characters match, we move forward to the next character and so on. If the characters do not match then we increase i and start from the index j = 0 of P and the index i+j of T.

By doing this we either exhaust the text T or we eventually find P in T. Here is a simple program which does so.

This is the most basic idea of finding patterns and the good news is, it works every time. But wait! is it really good?

Do we know the cost we paid in terms of running time for this program?

In the worst case scenario we might end up comparing all the characters of P for for each P.length characters of T. That precisely means that we will be doing P.length times T.length comparisons.

If the length of T is N and the length of P is M then the running time would be O(M*N). Now consider that the pattern we are searching is almost of the same length as the text and all characters but the last one matches. We will end up having O(N2) comparisons.

This also means that in the worst case the running time would be O(N2) if you do not put some checks.


Boyer Moore – A better way of pattern finding

What would be better than the above approach?
And the answer is less number of comparisons. The basic idea is to do as less comparisons as possible and eliminate the unnecessary character comparisons.

Let us consider the strings below:
Text = I LIKE TECHIEME
Pattern = TECH

Let us just align our pattern P with the text T such that the first character of each of P and T align with each other.
Observe that the last character of P is aligned to the forth character of T. But the characters H and I which lie in the same column do not match.
Boyer Moore
By doing this one comparison what all we can deduce?

  1. There are exactly three characters in both the strings before the characters under comparison.
  2. And because the last character do not match, there is no way that we can find our P in T before this point even if the first three characters match.
  3. This means, we can skip comparing the first three characters.

Great! We just saved three comparison.
What else can we do?

Let us now see if the non matching character I in the text T is present in the pattern P. (We can make this a O(1) process if some pre-processing is done with the help of some extra space).
We found that I is not present in P. So, what can we deduce by this?

  1. This means that there is no possible alignments where we can slide the Pattern P to the right and align the I in T with a possible existence of I in P
  2. This means, we can skip the first four characters in T and realign our pattern to the fifth character in T and re do the above tests. Below is a diagram.

textandpattern1

The current alignment puts the character T from the text string and H from the pattern string in the same column. Again, they do not match. So, there is a possibility that we can eliminate the comparison of first three characters.

But wait, let us check if the character T in the text also exists in the pattern. Yes! the character T occurs at three spaces from the end in pattern P.
textandpattern2
This means, if we align the character T in both the strings, we might have a possibility of finding the pattern P in the text T.
So, let us slide the pattern P by three spaces to the right, and align the matching character.
Boyre Moore

Again, repeat the same steps:

  1. Compare the forth character of pattern P with the character H in text T.
  2. The characters match, so we now compare the third character C in both the strings.
  3. The characters match, so we now compare the second character E in both the strings.
  4. The characters match, so we now compare the first character T in both the strings.

Now there is no more characters left in pattern P, so that means we found our pattern in the text string. Had it been the case where we exhaust the text string T and the last character still doesn’t match then we can return false or -1.

The bad thing is that the post is yet not over ๐Ÿ™
But to my surprise! the good thing is that the above process is called the Boyre Moore Algorithm for Pattern Finding and you know that now. ๐Ÿ™‚

Few notes

There are two concepts which are used for this algorithm:

  • Looking Glass Heuristic : This says that while testing a possible placement of P against T, start from the end of P and move backward.
  • Character Jump Heuristic : This says that in a comparison if a mismatch occurs at index i for a character c’ in T and index j of pattern P and c’ doesn’t occur in P then slide the pattern completely past T[i]. If c’ is contained in P then slide P until an occurrence of c’ in P gets aligned with T[i].

We talked about lookup for the non matching character of text T in the patter P. Yes we need a lookup table of size P.length. This is the auxiliary space required for the algorithm.

I also talked about pre-processing, in the above section. That pre processing is required to populate the lookup table.

Another point worth noting : There might be repeated characters in the pattern, so in case of non matching characters C’ in text T if C’ exists multiple times in P we always need to look at the first occurrence of C’ character from the end of P, to decide the space by which we need to slide the pattern to the right.

Deciding the number of steps to slide is slightly tricky. Here are two cases which surfaces around:

  1. The character c’ appears in pattern P at an index k such that k < j. In this case we are sure that if we slide right we will end up aligning c’ in both the strings.
  2. The character c’ appears in pattern P at an index k such that k > j. In this case we can never align by sliding partially, we need to shift the pattern completely to the right.
Building the table

The idea behind the table is very simple, it stores an integer the last index of each unique character.
Let us check the above text and pattern and populate the table. Here is a list of questions we will ask and populate the table.

If we see a T in the text then the value of T in the table would be zero, the last index at which T appears.
If we see a E in the text then the value of E in the table would be one, the last index at which E appears.
If we see a C in the text then the value of C in the table would be two, the last index at which C appears.
If we see a H in the text then the value of H in the table would be three, the last index at which H appears.

For all other characters which do not occur in P, we can add a -1.
table

You can populate the table with other ways, which might make your code more readable and easy to understand (Refer CLR or any other book on Pattern Matching). The basic idea is to find the number of spaces to slide for in O(1) time.

Source Code – Boyre Moore

As usual, you can download the complete source code from Github Repository for TechieMe. Here is the important part of the code.



Analysis of Boyre Moore Algorithm

The lookup table is one auxiliary space which is used. Depending on what data structure we use and what is the alphabet for the language the size may vary.

If we are using the Map data structure, then the auxiliary space required is same as the size off unique characters in the pattern which is O(M) in case of a pattern of length M consisting of all unique characters.

If we are using an array where characters can be used as indices, we need the array of length equal to the size of the alphabet. For the known English Alphabet with all symbols, the array of length 255 is good enough. Which still is O(1)

The time complexity is O(M) for building the lookup table where M is the size of the pattern.
The time complexity for matching in the worst case would again be O(N*M) same as the naive approach for strings where all but the first character matches.

An example of text which achieves worst case running time is as follows:
Text : bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Pattern : abbbbbbbbbbbbbb

However, this algorithm in English text is able to skip many characters during the sliding or character jump heuristic. It is really not a good idea to use it for Binary Strings.

Conclusion

I tried to explain it in the simplest way possible, but there can be still some confusion. In case you have any, please write it in the comment section. I will try to make it much more simpler.

This algorithm is really better than the Naive approach we discussed in the beginning. We will discuss few more pattern finding algorithms like KMP in coming posts.

Donโ€™t forget to subscribe to TechieMe to get updates on latest posts.