An Algorithmic Bridge Between Hamming and Levenshtein Distances

11/22/2022
by   Elazar Goldenberg, et al.
0

The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a=1) with Hamming distance (a→∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ϵ)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k≥ ED_a(X,Y). For simplicity, let us focus on k≥ 1 and a constant ϵ > 0; then, our algorithm takes Õ(n/a + ak^3) time. Unless a=Õ(1) and for small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment – an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ(nk_I/k_S + k_S· k_I^3)-time (1,1+ϵ)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a=Θ(k_S / k_I). These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ϵ)-approximation in sublinear time, even for a favorable choice of k.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2020

Sublinear-Time Algorithms for Computing Embedding Gap Edit Distance

In this paper, we design new sublinear-time algorithms for solving the g...
research
10/02/2019

Sublinear Algorithms for Gap Edit Distance

The edit distance is a way of quantifying how similar two strings are to...
research
11/12/2017

Longest Alignment with Edits in Data Streams

Analyzing patterns in data streams generated by network traffic, sensor ...
research
06/15/2023

On the k-Hamming and k-Edit Distances

In this paper we consider the weighted k-Hamming and k-Edit distances, t...
research
10/02/2019

Approximating the Geometric Edit Distance

Edit distance is a measurement of similarity between two sequences such ...
research
01/01/2020

Approximating Text-to-Pattern Hamming Distances

We revisit a fundamental problem in string matching: given a pattern of ...
research
07/28/2020

A Simple Sublinear Algorithm for Gap Edit Distance

We study the problem of estimating the edit distance between two n-chara...

Please sign up or login with your details

Forgot password? Click here to reset