Dynamic Programming – Longest Common Subsequence

Introduction

This is the first post in Dynamic Programming – Online Classes.

Dynamic programming have been a very interesting topic and technique to solve problems which exhibit a specific set of behaviors. We will try to understand the behaviors and the way it can be implemented to solve complex problem without requiring a mathematical approach. We will try to simplify it to make it clearly understandable and in later articles we will pick up two or three problem statements which can be easily solved using Dynamic Programming.

Purpose of the article

The purpose of this article is to enable the reader to analyze the complex programming problems, try to understand and employ the appropriate solution and help in learning dynamic programming. This may stretch to 2-3 articles to understand all the aspects of dynamic programming.

The Longest Common Subsequence Problem

This is a famous problem in the field of Biology to find the matching trends between two genes and many more application and it is also asked in most of the good interviews. Before describing this problem in detail, I would like to explicitly point down the difference between a substring and a subsequence.

Difference between Substring and Subsequence

A substring is a continuous sequence of characters from a string where as a subsequence is not necessarily a continuous sequence but maintains an order or characters.

Let us understand this with an example, where we have the below string

String 1: A L P H A N U M E R I C
We will list down few substrings and subsequences as below:
Substrings

  1. A L P H A
  2. N U M E R I C
  3. P H A N U
  4. M E R I

Subsequences

  1. A L P H A
  2. N M E R C
  3. A P H A U R I C

Above is a small list of substrings and subsequences. There can be many more, pointing it again, in a subsequence the order is important. We cannot say that A L N U C R is a subsequence.

Now with a better understanding of the differences, the problem statement becomes clearer.

Problem Statement

Given two strings of characters X[1…m] and Y[1…n], our job is to find a longest subsequence common to both. Point worth noting is that there may be more than one longest common subsequence in the given two strings.

For e.g. let us consider the below strings:
X : A B C B D A B
Y : B D C A B A

Let us try to list down the possible longest common subsequences in both the strings above.

1) B D A B
2) B C A B
3) B C A B
4) B C B A

First Approach

What exactly we did there? To simplify it, we chose one of the strings ( mostly we choose the string smaller in size) and started from the beginning of that string and kept comparing the character which comes in order to the characters in the second string. If they do not match and we reach the end of the second string, we skip the character and if it matches we move to the next character in both the strings.

This is called the Brute Force technique, it checks every subsequence of X[1…N] to see if it is also a subsequence of Y[1…M].. Hence, it involves finding all the possible subsequences of the first string. Then evaluates if any of the subsequence is also a subsequence of the second string.

Analyzing the Brute Force method

This is a very crude way of doing it, and also a time consuming one.
Part 1: Can we find out what is the total number of subsequences possible for a string of length N. The answer is very simple, its 2N. This is because every character in the string has a equal chance of appearing or not appearing in the subsequence. Hence at each character we have 2 options.

Now if there are only two characters in a string, then total number of subsequences would be four as below:

  • An empty subsequence
  • A subsequence containing the first character.
  • A subsequence containing the second character.
  • A subsequence containing both the characters.

Part 2: What is the time taken to find out if a given subsequence is also a subsequence of another string? It will take time proportional to the length of the second string in which we want to check. Let us say that the length of the string is M.

From part 1 and part 2 it is clear that the time taken to find out the longest common subsequence would be of the order M * 2N We can also say that the running time T = O(M * 2N)

Which is an exponential running time, assuming a string contains 1000 characters, the order of total number of subsequences would be 21000. Now this is tough to calculate or analyze. Hence, we need to find some other better performing approach.

We would try to simplify the problem so that we can solve it in polynomial time and not exponential time.