Dynamic Programming – Longest Common Subsequence

Introduction

This is the first post in Dynamic Programming – Online Classes.

Dynamic programming have been a very interesting topic and technique to solve problems which exhibit a specific set of behaviors. We will try to understand the behaviors and the way it can be implemented to solve complex problem without requiring a mathematical approach. We will try to simplify it to make it clearly understandable and in later articles we will pick up two or three problem statements which can be easily solved using Dynamic Programming.

Purpose of the article

The purpose of this article is to enable the reader to analyze the complex programming problems, try to understand and employ the appropriate solution and help in learning dynamic programming. This may stretch to 2-3 articles to understand all the aspects of dynamic programming.

The Longest Common Subsequence Problem

This is a famous problem in the field of Biology to find the matching trends between two genes and many more application and it is also asked in most of the good interviews. Before describing this problem in detail, I would like to explicitly point down the difference between a substring and a subsequence.

Difference between Substring and Subsequence

A substring is a continuous sequence of characters from a string where as a subsequence is not necessarily a continuous sequence but maintains an order or characters.

Let us understand this with an example, where we have the below string

String 1: A L P H A N U M E R I C
We will list down few substrings and subsequences as below:
Substrings

  1. A L P H A
  2. N U M E R I C
  3. P H A N U
  4. M E R I

Subsequences

  1. A L P H A
  2. N M E R C
  3. A P H A U R I C

Above is a small list of substrings and subsequences. There can be many more, pointing it again, in a subsequence the order is important. We cannot say that A L N U C R is a subsequence.

Now with a better understanding of the differences, the problem statement becomes clearer.

Problem Statement

Given two strings of characters X[1…m] and Y[1…n], our job is to find a longest subsequence common to both. Point worth noting is that there may be more than one longest common subsequence in the given two strings.

For e.g. let us consider the below strings:
X : A B C B D A B
Y : B D C A B A

Let us try to list down the possible longest common subsequences in both the strings above.

1) B D A B
2) B C A B
3) B C A B
4) B C B A

First Approach

What exactly we did there? To simplify it, we chose one of the strings ( mostly we choose the string smaller in size) and started from the beginning of that string and kept comparing the character which comes in order to the characters in the second string. If they do not match and we reach the end of the second string, we skip the character and if it matches we move to the next character in both the strings.

This is called the Brute Force technique, it checks every subsequence of X[1…N] to see if it is also a subsequence of Y[1…M].. Hence, it involves finding all the possible subsequences of the first string. Then evaluates if any of the subsequence is also a subsequence of the second string.

Analyzing the Brute Force method

This is a very crude way of doing it, and also a time consuming one.
Part 1: Can we find out what is the total number of subsequences possible for a string of length N. The answer is very simple, its 2N. This is because every character in the string has a equal chance of appearing or not appearing in the subsequence. Hence at each character we have 2 options.

Now if there are only two characters in a string, then total number of subsequences would be four as below:

  • An empty subsequence
  • A subsequence containing the first character.
  • A subsequence containing the second character.
  • A subsequence containing both the characters.

Part 2: What is the time taken to find out if a given subsequence is also a subsequence of another string? It will take time proportional to the length of the second string in which we want to check. Let us say that the length of the string is M.

From part 1 and part 2 it is clear that the time taken to find out the longest common subsequence would be of the order M * 2N We can also say that the running time T = O(M * 2N)

Which is an exponential running time, assuming a string contains 1000 characters, the order of total number of subsequences would be 21000. Now this is tough to calculate or analyze. Hence, we need to find some other better performing approach.

We would try to simplify the problem so that we can solve it in polynomial time and not exponential time.

  • Byte me

    learned couple of new things…dint get complete clarity though…may be because i read smthing on Dynamic programming for the first time….will read this article again after couple of days…may be i need sm time to let it sink…

    • dharam

      Well true, I remember the days when I started learning this programming technique, I used to write a lot of things and strategies to find out the repeating sub problems. More over the LCS itself is slightly complicated, In the next article I will take a simpler program to demonstrate the dynamic programming.

  • Praveenkumar

    I am now pasting the code for this one is this correct??

    #include

    using namespace std;
    int maxm(int a,int b)
    {
    return a>b?a:b;
    }

    int main()
    {
    string s1,s2;
    cin>>s1>>s2;
    int m=s1.length()+1,n=s2.length()+1;
    int a[m][n];
    for(int i=0;i<=m;i++)
    {
    a[i][0]=0;
    }
    for(int j=0;j<n;j++)
    {
    a[0][j]=0;
    }
    for(int i=1;i<m;i++)
    {
    for(int j=1;j<n;j++)
    {
    if(s1.at(i-1)==s2.at(j-1))
    {
    a[i][j]=a[i-1][j-1]+1;
    }
    else
    {
    a[i][j]=maxm(a[i-1][j],a[i][j-1])+1;
    }
    }
    }
    cout<<a[m-1][n-1]<<endl;
    return 0;
    }

    • dharam

      Seems to be correct…thanks for the code. I would really appreciate if people can use this and help themselves

  • dharam

    Guys, I missed the table when the article was first published. I have added it now, the post should make more sense now 🙂

  • amit

    If both of them don’t match then Li,j would be maximum of (Li,j-1 and Li-1,j), which is maximum of the length of the two longest common subsequences till now.

    can you please explain this ?
    I am not looking into the code , want to understand the concept completely then will code myself

    • dharam

      Hi Amit,
      I am really sorry that you didn’t understand this derivation. Here I try to make it simpler, hope this helps. If not, just let me know I will send you a detailed analysis with step by step derivation.

      Index : 0 1 2 3 4 5 6 7 8
      String1: A B C A B C T Y S
      String2: B C A C B T S

      Think this way, how did we arrive to an i index in string1 and a j index in string 2. It is a result of either of the below two cases:
      Case1: If the last comparison resulted in a match of characters at i-1 and j-1, so we increased the length by 1 and incremented I and j. Remember it only makes sense to increase both when the characters at i-1 and j-1 matches.

      Case2: If the last comparison didn’t result in a successful, then there were two situations:
      i) Either we increased i-1 to i as a result of last failed mismatch, or
      ii) We increased j-1 to j for the same reason.

      So there will be two LCS at that point, one as a result of comparing i-1 and j and the other as a result of comparing i to j-1.

      • Amit

        Thanks Dharam.

        I read the article once again and understood the concept .
        Excellent .

  • topoden

    In the pseudo code above, was
    Li,j=LCS(X, Y, i-1, j-1)
    supposed to be
    Li,j=LCS(X, Y, i-1, j-1) + 1
    ?

    • dharam

      Yes! absolutely correct.