Pattern Finding – KMP Algorithm

Introduction

Now that you are fully equipped with the Boyer Moore Algorithm and have a notion of Pattern Finding. I would suggest you to get deeper into pattern finding. You can read about the benefits of pattern finding in my previous post about Boyer Moore.

This post will try to make you familiar with all the thought processes which you can use to exploit the known properties of texts and patterns using KMP Algorithm. However, Boyer Moore seems to be slightly intuitive, this one is a real geeky way of finding patterns.

Knuth Morris Pratt

Yes! this is what KMP stands for, you can learn about the history and those who published it here on the Wikipedia. Let us recap something we learnt in the previous post.

Recap from Boyer Moore

We used the looking glass heuristic and character jump heuristic to find a possible placement of a character from the Text T in the pattern P. In case we found it, we used to slide the pattern to align it with the matching character. In case we didn’t find it, we use to throw away this failure information and start over again.

However, a software engineer or a vivid learner must not throw away information, no matter how useless it looks like. Yes! information is precious and it is tough to make sense out of information which seems not so useful.

If you can do this, you might end up getting a linear time algorithm for Pattern Finding.

Failure Function

Well, did we just read failure information above? So what is this failure information or function?

A Failure Function is expected to tell us the proper number of spaces by which we need to shift pattern P to the largest extent possible.

To be more precise, we calculate the failure function f(k) as the length of the longest prefix of P that is also a suffix of P[1..k]. Also, f(0) = 0. Before we proceed further and create more confusions, here is an example of the table generated by the failure function.
KMP Algorithm
Pattern P : abcabc, is the pattern under consideration, which we need to find in a relatively larger text T.

Take a deep breath before reading further.

As per our definition, we find the following values:

  • f(0) = 0
  • f(1) is the length of longest prefix of P which is also a suffix of P[1], here P[1] is b, and the answer would be zero because the longest prefix that is also a suffix of b could have been b itself and the length would have been 1, but b does not appear as a prefix in P and hence the length is 0.
  • f(2) is the length of longest prefix of P which is also a suffix of P[1..2], here P[1..2] is bc, and the answer would be zero because the longest prefix that is also the suffix of bc could have been bc itself and the length would have been 2, but bc does not appear as a prefix in P, also c does not appear as a prefix and hence length 0.
  • f(3) is the length of longest prefix of P which is also a suffix of P[1..3], here P[1..3] is bca, and the answer would be one because the longest prefix that is also the suffix of bca could have been bca itself and the length would have been 3, but bca does not appear as a prefix in P, neither does ca but a does. Hence the length is 1.
  • f(4) is the length of longest prefix of P which is also a suffix of P[1..4], here P[1..4] is bcab, and the answer would be two because the longest prefix that is also the suffix of bcab could have been bcab itself and the length would have been 4, but bcab does not appear as a prefix in P, neither does cab but ab does appear. Hence the length is 2.
  • f(5) is the length of longest prefix of P which is also a suffix of P[1..5], here P[1..5] is bcabc, and the answer would be three because the longest prefix that is also the suffix of bcabc could have been bcabc itself and the length would have been 5, but bcabc does not appear as a prefix in P, neither does cabc but abc does appear. Hence the length is 3.

Congratulations! for making out safely from the deep sea above. You deserve a pat on the back award :).

What exactly we did above? More importantly, Why ?

Let us answer the ‘Why’ first. The ‘What’ will make sense as you go on!
Note : This algorithm does not use the looking glass heuristic, so we would not compare the characters from the last as we did in Boyer Moore.

So, when we start comparing T[i] with P[j] where 0 < i < T.length and 0 < j < P.length , the following cases may occur :

  1. T[i] will match P[j]
  2. T[i] will not match P[j]. Which in turn will have two cases
    1. We are at the first character of P.
    2. We made some progress and we have already matched couple of characters in P before this mismatch.

The first case is very simple, T[i] == P[j] hence increment i and j and compare the next character.
The second case is slightly tricky.
Case A, is straight forward, you know this is the first character of P and it didn’t match, so simply increment only i and start comparing with the first character of P again.
Case B, happens if we have successfully matched j-1 characters in P with corresponding characters in T and now the jth one mismatched.

The first thing we do is to check the previous character that is j-1th and consult our failure function for some advice. The advice we get from the failure function is that the longest prefix in P for the character sequence P[1..j-1] is of a certain length k.

Which precisely means that k characters exist as prefix in P and hence it will always match with character sequence T[i-k, i-1] and hence this time we can start comparing from index i in T and k in P.

I know I am not good at explaining things, and now I will take the aid of a diagram to make things clear
KMP Algorithm

Now that we have an image, may be it will help. Let us start comparing from the beginning.

  • Index 0 – 4 of T and P fall in case 1 above, and we just keep increasing the indices i and j. Note that i is index in T and j is index in P
  • Index 5 in T and P mismatch and it falls in case 2(B), as we have made progress into P and this is not the first character of P that we are comparing.
  • Let us see if any prefix exists for P[1..4], which can be given by the failure function f(4) and its value is 2.
  • The value 2 means that we can directly align index 2 of P with index i of T. Something like below:

KMP Algorithm

Why? Because we know that the two characters appear in the same sequence in the start off P as they are at indices j-2 and j-1. So if T[i-2] and T[i-1] matched with two characters P[j-2] and P[j-1] respectively, then they will definitely match with P[0] and P[1]. Hence, no need to compare the first two character again.

Note that we didn’t start comparing from the beginning of pattern P this time, we conveniently skipped the first k characters (here 2) index (0 to k-1). This is a great advantage, we saved k comparisons.

Dry Run

Let us just quickly do a dry run for the rest of the part. Considering we reached at a stage which is shown in the second image. we start comparing from index j = 2 of P and index i = 5 of T.

Here we find that the characters do not match, but still we are not at the first character of P, so this is case 2(B) and we need to consult the failure function for f(j-1) = f(1) which is zero.

This means we can align the index 0 of P with index i of T. Here is how the new state will look like.
kmp3

Now, we compare the index i = 5 of T with index j = 0 of P. It didn’t match again, and this is case 2 (A). This means we just started matching and we failed, so let us slide our pattern T by just one step, which is equivalent to increasing the index i. Now the state looks as below:
kmp4

Now, the comparison will result into a success. and we can return i - P.length

Source Code – Pattern Finding KMP Algorithm

Analysis

Of course we have used an auxiliary space to store the failure function, this space might exceed the space required for the Boyer Moore algorithm, the space here is O(m) where m is the size of the pattern to be found.

The running time is divided into two, finding the failure function and matching.
The loop depends on i and the bound to i is m. In each iteration there is a possibility that i will increase by 1. In case i is not incremented, j moves by certain amount. The maximum j can move is i spaces. Hence, the upper bound on the number of iterations would be less than or equal to 2 times m. Asymptotically that is equal to O(m).

On similar lines, the findPattern method will have an asymptotic running time of O(m+n).

Conclusion

The algorithm, looks tricky but is beautiful. It gives a tighter bound on the running time which is asymptotically O(m+n). I might have rushed in the analysis part, or if you are un clear about the code of Failure function. I would be happy to write a follow up post on this, if you can comment below.

Don’t forget to subscribe to TechieMe to get updates on latest posts.