Bounds and Estimates on the Average Edit Distance
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let e_k(n) denote the average edit distance between random, independent strings of n characters from an alphabet of size k. For k ≥ 2, it is an open problem how to efficiently compute the exact value of α_k(n) = e_k(n)/n as well as of α_k = lim_n →∞α_k(n), a limit known to exist. This paper shows that α_k(n)-Q(n) ≤α_k ≤α_k(n), for a specific Q(n)=Θ(√(log n / n)), a result which implies that α_k is computable. The exact computation of α_k(n) is explored, leading to an algorithm running in time T=𝒪(n^2kmin(3^n,k^n)), a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how α_k(n) can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of n say up to a quarter million. Correspondingly, 99.9% confidence intervals of width approximately 10^-2 are obtained for α_k. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound β_k^* to α_k, such that lim_k →∞β_k^*=1. In general, β_k^* ≤α_k ≤ 1-1/k; for k greater than a few dozens, computing β_k^* is much faster than generating good statistical estimates with confidence intervals of width 1-1/k-β_k^*. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
READ FULL TEXT