Bounds and Estimates on the Average Edit Distance

11/13/2022
by   Gianfranco Bilardi, et al.
0

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let e_k(n) denote the average edit distance between random, independent strings of n characters from an alphabet of size k. For k ≥ 2, it is an open problem how to efficiently compute the exact value of α_k(n) = e_k(n)/n as well as of α_k = lim_n →∞α_k(n), a limit known to exist. This paper shows that α_k(n)-Q(n) ≤α_k ≤α_k(n), for a specific Q(n)=Θ(√(log n / n)), a result which implies that α_k is computable. The exact computation of α_k(n) is explored, leading to an algorithm running in time T=𝒪(n^2kmin(3^n,k^n)), a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how α_k(n) can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of n say up to a quarter million. Correspondingly, 99.9% confidence intervals of width approximately 10^-2 are obtained for α_k. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound β_k^* to α_k, such that lim_k →∞β_k^*=1. In general, β_k^* ≤α_k ≤ 1-1/k; for k greater than a few dozens, computing β_k^* is much faster than generating good statistical estimates with confidence intervals of width 1-1/k-β_k^*. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset