How Compression and Approximation Affect Efficiency in String Distance Measures

12/10/2021
by   Arun Ganesh, et al.
0

Real-world data often comes in compressed form. Analyzing compressed data directly (without decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes. For two strings of total length N and total compressed size n, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time Õ(nN), as opposed to O(N^2) for the uncompressed setting. Many applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of k strings run in O(N^k) time. This naturally raises the question of whether compression can help to reduce the running time significantly for k ≥ 3, perhaps to O(N^k/2n^k/2) or O(Nn^k-1). Unfortunately, we show lower bounds that rule out any improvement beyond Ω(N^k-1n) time for any of these problems assuming the Strong Exponential Time Hypothesis. At the same time, we show that approximation and compression together can be surprisingly effective. We develop an Õ(N^k/2n^k/2)-time FPTAS for the median edit distance of k sequences. In comparison, no O(N^k-Ω(1))-time PTAS is known for the median edit distance problem in the uncompressed setting. For two strings, we get an Õ(N^2/3n^4/3)-time FPTAS for both edit distance and LCS. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than a polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain results for many distance measures including the edit, Hamming, and shift distances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2023

Weighted Edit Distance Computation: Strings, Trees and Dyck

Given two strings of length n over alphabet Σ, and an upper bound k on t...
research
04/10/2019

Reducing approximate Longest Common Subsequence to approximate Edit Distance

Given a pair of strings, the problems of computing their Longest Common ...
research
07/14/2023

Approximating Edit Distance in the Fully Dynamic Model

The edit distance is a fundamental measure of sequence similarity, defin...
research
12/06/2021

On Complexity of 1-Center in Various Metrics

We consider the classic 1-center problem: Given a set P of n points in a...
research
02/19/2020

Space Efficient Deterministic Approximation of String Measures

We study approximation algorithms for the following three string measure...
research
03/04/2020

Pivot Selection for Median String Problem

The Median String Problem is W[1]-Hard under the Levenshtein distance, t...
research
12/04/2019

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string

Strings are a natural representation of biological data such as DNA, RNA...

Please sign up or login with your details

Forgot password? Click here to reset