Dynamic Programming – Longest Common Subsequence

Second Approach

  1. Let the length of the longest common subsequence and let us say it is LS
  2. Extend the algorithm such that it finds the longest common subsequence using the LS itself.
What strategy should we employ?

Let me break down the problem to a smaller problem, consider the two strings in the below diagram.
sequenecs
One of the longest common subsequence of the above two strings is B C A B as listed above.

Now look at the strings below which merely are prefixes of the above strings:
sequences2
If I list down a longest common subsequence for the below strings I will get B C.

Point worth noting is that the longest common subsequence of the prefix strings, is a prefix of the longest common subsequence of the original strings. If this is a confusing line then I will put it in a simpler way.

The LCS(Longest common subsequence) of the strings in image 2 is B C which is a prefix of the LCS of the strings in image 1 i.e. B C A B

I am just trying to chop the problem down into smaller pieces.

Let us define few terms here.
X[1..i] is a prefix of X[1..M], and Y[1..j] is a prefix of Y[1..N]

Let us also define the length of longest common subsequence as Li,j which is equal to LCS of X[1..i] and Y[1..j].
Then by definition Lm,n would be equal to LCS of X[1..M] and Y[1..N].

Now assume that we are in the middle of the process where we have travelled to index i-1 in X and j-1 in Y and the length of the longest common subsequence according to our definition would be Li-1,j-1.

Now we compare the next characters X[i] and Y[j], there can be two cases:

  1. If both of them match then Li,j would be Li-1,j-1 + 1, which is the one greater than the length of the last longest common subsequence.
  2. If both of them don’t match then Li,j would be maximum of (Li,j-1 and Li-1,j), which is maximum of the length of the two longest common subsequences till now.

The statements above might not be easy to digest, so we will try and prove the statement so that we have a clear understanding.

Consider the strings below:
Case 1: if X[i] = Y[j]

Let us assume that Z[1..k] = LCS(X[1..i],Y[1..j]) , which means that the longest common subsequence of a prefix of X (upto index i) and prefix of Y (up to index j) is of length k and is represented by the sequence Z[1..k].

We can also say that Li,j is k, where Li,j is the longest sub sequence of X[1..i] and Y[1..j]. This also produces the result that Z[k] = X[i] = Y[j] because the Case 1 demands the X[i] to be equal to Y[j] and we defined Z[1..k] in a way that it is the longest common subsequence, so the last character of Z must be equal to X[i] or Y[j].

Case 2: if X[i] is not equal to Y[j] then it means that either X[i] or Y[j] is not a part of the longest common subsequence. So we cannot say much about Z[1..k] but we can definitely say that Z[1..k-1] is the longest common sub sequence of X[1..i-1], Y[1..j-1].

This is true because we are just considering the Z string to contain k-1 characters and hence there is no way by which there can be more than one common character between X[i-1], Y[j-1] and X[i], Y[j]

If this is still confusing, please write a comment and I will explain it more in detail.

The whole thing we discussed in last few 10-15 lines is just to come to an agreement on the statement that I made previously, which is “the longest common subsequence of the prefix strings is a prefix of the longest common subsequence of the original strings.” We can re-frame it to say that if Z is a LCS of X and Y then any prefix of Z is an LCS of a prefix of X and a prefix of Y.

Using this we can define a behavior of such kind of problems which says that, “An optimal solution to a problem contains and optimal solution to a subproblem.”

Now let us use the result of the above discussion to write a pseudo code to calculate the length of the longest common subsequence.
LCS
What will be the worst case scenario for this code?
It is pretty obvious that the worst case would be when X[i] != Y[j] for all value of i and j. I would like to elaborate this using a recursion tree.

Let us consider solving this when X is of length 4 and Y is of length 3, then we start at the root with (4,3) and the tree looks like the below in each subsequent level. every time we evaluate with i-1, j and i,j-1. The tree will grow till at least one argument in each of the leaves is not zero.
p3
What is the height of this tree?
It seems that it has to be M+N to accommodate all the cases and also, this is a binary tree hence at each level total number of nodes will increase by a power of 2, i.e. exponentially.

This seems to be almost the same as the Brute Force algorithm which we discussed in the beginning. Similar to the Brute force this also has a exponential running time.

I can’t stop noticing that in the tree most of the nodes contain same values. For e.g there are multiple nodes with values (2,2) and (3,1) etc. So we can apply some optimization here, also we can state that “the recursive solution contains few distinct sub problems repeated more than once.”

So, how many distinct sub problems are there, if most of them are repeated?

The answer would be M*N, because each unique sub problem in the tree is a function of i and j, and there are M distinct i and N distinct j, so total unique sub problems would be a product of M * N.

Hence, if we can employ some technique to remember what we already calculated and have a mechanism to fetch and return the previously calculate value, we might be saving lots of repetitive work.

We can actually maintain a map of i,j and the value Li,j, fetch and return Li,j if i and j are passed in to the above code. if there is nothing in the map then only we have to calculate it.

If we achieve it then the running time becomes proportional to the total number of unique problems. Hence, it will improve significantly because if a repeated sub problem occurs we simply return the already calculated value from the map.

So we can say that the running time to find the length of the Longest Common subsequence would be T(M,N) = O(M*N). Also, lets see the total space required, it would be the space required to save the already calculated solutions for the unique problems which is again proportional to M*N. Hence Space = O(M*N).

It is fairly visible that we started from the top of the tree where the values of i and j were maximum i.e. M and N respectively. So this can be termed as a top down approach, as we started from the max numbers and reduced it down to i = 0 and j = 0.

We can do the same thing in a better way if we can do it in a bottom up approach, because there might be many similar problems where we do not like to use recursion and we have the base case values handy. For e.g. in most of the cases while solving an equation in i and j we already know the result for i = 0 and j = 0, or i = 1 and j = 1, and it is fairly complicated to predict the result of i = m and j = n at the start of the calculation, so we must strive to use a bottom up approach to solve such complicated problems.