Minimum Edit Distance

How to create the Minimum Edit Distance Table

It is quite interesting to understand the significance of each cell in previous table. So, we are now confident enough to find and report the Minimum Edit Distance of the two strings (or their sub strings), given the tables.

Which brings me to the next point in our discussion. How to build the table?

In the previous section we learnt to populate the first row and first column. Let us look at it again.

The first grey column represents P.

  • The cell containing <empty> in the grey column means that the String P is empty at that point.
  • The cell containing H in the grey column means the String P has only one character i.e. H
  • The cell containing E means the String P has two characters “HE”
  • Similarly the cell containing O means the String P has five characters “HELLO”

We can do the same for the grey row (which in turn represents the string Q)

  • The cell containing H in the grey row  means Q has only one character i.e. H
  • Similarly the cell containing W in the grey row means, Q has six characters “HOLLOW”

With this newly acquired understanding of the table, we know that the values in the first green row represent the various minimum edit distances for a P(which is an empty string) and all sub strings of Q(starting from empty string, H, HO, HOL, HOLL, HOLLO, HOLLOW)

Hint: We also know that to make anything out of an empty string, we need to insert characters in to it. So, the values in the first rows are basically number of characters inserted into P to get the corresponding Q

In a similar way, we know that making an empty string from any given string P requires deletion of characters from the string P. Hence, the first green row contains the number of deletion at different instances of P to get an empty string Q.

Simply put the row values represent INSERTIONS, where as the column values represents DELETION

Building the table

With the basic understanding of first row and column. Let us try to populate the row 1.

As per our understanding, the value of cell(1,1) is the number of operations it takes to transform P(0..1) to Q(0..1). Here P(0..1) means “H” in the row 1 and Q(0..1) means “H” in column 1.

  • But as we know that both the “H”s are equal, that means we do not require any special operation to transform H to H. Hence, the value zero.
  • Now let us look at cell (1, 2), which is number of operations to transform P(0..1) to Q(0..2). As P(1) is not equal to Q(2), we may have to choose an operation, now as we learnt that moving along the rows inserts the corresponding character (O in this case) into Q. Hence the minimum edit distance of P(0..1) and Q(0..2) is 1.
  • Not the cell(1,3), is to transform P(0..1) to Q(0..3). Or “H” to “HOL”. Simply put it will take 2 operations as we can see the two strings are different by two characters and insertions at the end of P will make it equal. But how do we derive it mathematically.

When we plan to populate cell(i,j), the expectation is that all the cells (m,n) are populated where 0<=m<=i and 0<=n<=j.

Now, we are looking at cell(i, j) and we know that the value of the cell will give us the minimum edit distance of P(0..i), Q(0..j). There are three possibilities with characters P(i) and Q(j).

  • Either P(i) equals to Q(j), which means, we literally don’t need an extra operation to match the characters and hence the minimum edit distance of P(i-1), Q(j-1) will be the minimum edit distance of P(i), Q(j). In simple terms, the values of cell(i,j) will be same as the value of cell[i-1][[j-1]
  • P(i) not equals Q(j). In this case, the value of cell(i,j) depends on one of the following choices we exercise :
    • Insert Q(j) in P at position i – this will add an insertion cost to the minimum distance P(i), Q(j-1) or cell(i, j-1)
    • Delete Q(j) – this will add a deletion cost to the minimum distance of P(i-1), Q(j) or cell(i-1, j)
    • Replace P(i) with Q(j) – this will add a replacement cost to the minimum distance of P(i-1), Q(j-1) or cell (i-1, j-1)
  • What ever results in the minimum of these three edit is the minimum distance P(i), Q(j) or cell(i,j).

In our example, we consider the insertion, deletion and replacement cost to be one. You can change the values to get more realistic answers.

Now with the above established understanding lets try to populate our table for P (“HELLO”), Q(“HOLLOW”)

We already know the green row and column values.

  • Now let us look at first orange cell(1,1). As P(1) is equal to Q(1), the value cell(1,1) would be same as the minimum edit distance cell(0,0).
  • Looking at cell(1,2). P(1) does not match Q(2). Hence, the value cell(1,2) is the minimum of
    • cell(0,2) + deletionCost, or
    • cell(1,1)+ insertionCost], or
    • cell(0,1) + replacementcost
  • This would translate to minimum of (3, 1, 2), which is 1.
  • Similarly, the cell(1,3). P(1) does not match Q(3). Hence the minimum value of
    • cell(0,3) + deletionCost, or
    • cell(1,2) + insertionCost, or
    • cell(0,2) + replacementCost
  • This would translate to minimum of (4, 2, 3), which is 2.

This is how we can populate the complete table.