Monday, October 4, 2010

Something on the edit distance (Levenshtein distance)

The edit distance intuitively measures the “distance” between two strings in terms some operations considered elementary (e.g., adding or removing a character). We could say that the edit distance between two words w1 and w2 is the minimum number of such operations necessary to transform w1 in w2.

The more mathematically inclined may notice this is actually a metric in mathematical sense. Let d(u,v):{\Sigma ^*} \times {\Sigma ^*} \to {\mathbb{N}^ + } \subset \mathbb{R}the edit function over the alphabet Σ. It is trivial to prove that. for all u and v, d(u,v) = 0\,\,\,\,{\rm{iff}}\,\,\,\,u = v and d(u,v) = d(v,u). The proof that the triangle inequality (d(u,v) \ge d(u,w) + d(w,v), with w \in {\Sigma ^*}) holds is slightly trickier and is provided just should have been provided at the end of the post, if your author were not a very lazy person (though he reserves to write it dow some time in the future).

There are multiple edit distance. The most popular is the Levenshtein distance. In fact, it is so popular that is called “Edit distance” tout court. The elementary operations considered by the Levenshtein distance are:
  1. Add a character
  2. Remove a character
  3. Substitute a character
Essentially, since we want the minimum number of operation, we should try all possible paths of modifications from the string w1 to w2. Luckily enough a maximum bound is easy to estimate. If len(w1) is n and len(w2) is m, we know for sure that we can remove all n characters from w1 and add the m characters of w2. That is to say, we can automatically discard all the paths of operations longer than m + n. Luckily enough, we can build algorithms much more efficient than that. The more popular algorithm to compute the edit distance can be expressed as:

package string_distance.levenshtein;

public class Distance {
    static public int distance(String u, String v) {
        int n = u.length();
        int m = v.length();

        int distance[][] = new int[n+1][m+1];

        for(int i = 0; i <= n; ++i) {
            distance[i][0] = i;
        }
        for(int j = 0; j <= m; ++j) {
            distance[0][j] = j;
        }

        for(int i = 1; i <= n; ++i) {
            for(int j = 1; j <= m; ++j) {
                if(u.charAt(i-1) == v.charAt(j-1)) {
                    distance[i][j] = distance[i-1][j-1];
                } else {
                    distance[i][j] = 1 + minimum(
                            distance[i-1][j],
                            distance[i][j-1],
                            distance[i-1][j-1]
                    );
                }
            }
        }

        return distance[n][m];
    }

    private static int minimum(int i, int j, int k) {
        return Math.min(Math.min(i,j), k);
    }
}
It is quite important to understand the basic idea. The algorithm uses an (n+1)x(m+1) matrix (provided that len(w1) = n and len(w2) = m) to store intermediate results. In an imperative setting, this data structure comes particularly handy and natural. In order to simplify the notation, we use Python slicing notation, e.g., s[0:4] is the substring of s starting at position 0 and ending before position 4 (that is to say, len(s[a:b]) = b-a). Essentially the i,j cell of the matrix holds the Levenshtein distance between w1[0:i] and w2[0:j]. This is why the matrix holds n+1 rows and m+1 columns: how many substring starting at 0 has string s of length l? l+1, because also the empty string and the full string are substrings. Also notice that the indexes i and j used throughout the algorithm do not refer to the “i-th character”, but to the i-th substring.

The first two loops consider the case when one of the two substrings is empty (s[0:0] is always the empty string). In this case, the edit distance is the length of the non-empty string. Intuitively, the first loop means that the edit distance of w1[0:i] with the empty substring of w2 is i, because we have to delete i characters, the second one that the edit distance of the empty substring of w1 with w2[0:j] is j because we have to add j characters.
The main part of the algorithm is constituted by the two nested for loops. They dominate the cost of the algorithm which always runs in O(nm), which means that is essentially quadratic in both space and time in the size of the longest string. In other words it quite unpractical to run the algorithm as described using large strings as input.
 For each possible pair of i and j, if the (i-1)-th character and the (j-1)-th character are the same, we can simply report the value computed for the (i-1)-th and the (j-1)-th substrings. That is to say, if u[i-1] = v[j-1], then the edit distance between u[0:i] and v[0:i] is by definition the same edit distance between u[0:i-1] and v[0:i-1]. Suppose it is not the case. We know that u[i-1] != v[j-1] and we know the edit distances between the pairs (u[0:i-1], v[0:j]), (u[0:i], v[0:j-1]), (u[0:i-1], v[0:j-1]). We say that the edit distance between s[0:i] and s[0:j] is 1 plus the minimum value among the edit distances between the aforementioned pairs. What we mean?
Suppose that the minimum edit distance is between u[0:i-1] and v[0:j]. Then it means that u[0:i-1] and v[0:j] are more similar than the other possibilities. And we know that if we get from u[0:i-1] to v[0:j] with k operations, then we can get from u[0:i] to v[0:j] with k+1 operations, because we can “delete” the last character from u[0:i] and then it is like starting from u[0:i-1]. The same ideas apply in the other cases, but it the additional action is an addition (the last character of v[0:j]) or a modification (we modify u[i-1] into v[j-1]).

This basically the idea. I have to say that I felt very uncomfortable with the indexing schema. I am used to half closed intervals when looping. That is to say I usually write loops where indexes range j \in \left[ {a,b} \right) \cap \mathbb{Z} (with a and b integers, of course), as for example for(int j = a; j < b; ++j).

This indexing stuff is not a minor matter, as off-by-one errors are very frequent and only thorough unit testing or demanding formal verification can provide use with sufficient confidence in the code itself[0].

 Perhaps the code could have been clearer if I used the convention that i and j are the indexes in the string (thus using i+1 and j+1 to access the matrix). However, not many presentations use this schema and I felt that this could make confusion when studying the code and comparing it to different implementations. A third possibility would have been to loop over the Java strings and manually increment the indexes. But for-each is not applicable to Java strings, thus I would have to create more ad-hoc code (which is what I would have done in real world code, at least until proved too inefficient).

 Luckily enough, in Python I was able to express the code in a more elegant way using the enumerate built-in. I am quite fond of that one! I also decided to use numpy arrays to implement the matrix. An alternative would have been creating a custom Matrix class (perhaps implemented with a linearized list, with a list of lists or with a dictionary). However, this would have been far less efficient and more grievous to write. Notice how two for loops have been substituted with matricial operations (and more to come… ;) ).

def fast_lev(a, b):
    m = len(a) 
    n = len(b)
    distance = np.zeros((m+1, n+1), dtype=int)
    distance[0,::] = np.arange(n+1)
    distance[::,0] = np.arange(m+1)
    
    for i, ca in enumerate(a, start=1):
        for j, cb in enumerate(b, start=1):
            if ca == cb:
                distance[i,j] = distance[i-1,j-1]
            else:
                distance[i,j] = min(
                    distance[i-1,j] + 1,
                    distance[i,j-1] + 1,
                    distance[i-1,j-1] + 1
                )
    return distance[m,n]

[0] Andrew Koenig, “C Traps and Pitfals”, Addison-Wesley Professional (some available here)

No comments: