Name
  • String
  • Edit distance
Edit

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is closely related to pairwise string alignments.

Definition

Mathematically, the Levenshtein distance between two strings a, b is given by \operatorname{lev}_{a,b}(|a|,|b|) where

\qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases}
  \max(i,j) &, \min(i,j)=0 \\
  \min \begin{cases}
          \operatorname{lev}_{a,b}(i-1,j) + 1 \\
          \operatorname{lev}_{a,b}(i,j-1) + 1 \\
          \operatorname{lev}_{a,b}(i-1,j-1) + [a_i \neq b_j]
       \end{cases} &, \text{ else}
\end{cases}

Note that the first element in the minimum corresponds to deletion (from a to b), the second to insertion and the third to match or mismatch, depending on whether the respective symbols are the same.


Example

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of "s" for "k")
  2. sitten → sittin (substitution of "i" for "e")
  3. sittin → sitting (insertion of "g" at the end).


Upper and lower bounds

The Levenshtein distance has several simple upper and lower bounds that are useful in applications which are applied with many of them and compare them. These include:

  • It is always at least the difference of the sizes of the two strings.
  • It is at most the length of the longer string.
  • It is zero if and only if the strings are equal.
  • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
  • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

C++