Approximating LCS and Alignment Distance over Multiple Sequences

10/24/2021
by   Debarati Das, et al.
0

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS)), or minimizes the number of unaligned symbols (the alignment distance (AD)). Multiple sequence alignment is a well-studied problem in bioinformatics and is used to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or AD of m sequences each of length n requires Θ(n^m) time unless the Strong Exponential Time Hypothesis is false. In this paper, we provide several results to approximate LCS and AD of multiple sequences. If the LCS of m sequences each of length n is λ n for some λ∈ [0,1], then in Õ_m(n^⌊m/2⌋+1) time, we can return a common subsequence of length at least λ^2 n/2+ϵ for any arbitrary constant ϵ >0. It is possible to approximate the AD within a factor of two in time Õ_m(n^⌈m/2⌉). However, going below-2 approximation requires breaking the triangle inequality barrier which is a major challenge in this area. No such algorithm with a running time of O(n^α m) for any α < 1 is known. If the AD is θ n, then we design an algorithm that approximates the AD within an approximation factor of (2-3θ/16+ϵ) in Õ_m(n^⌊m/2⌋+2) time. Thus, if θ is a constant, we get a below-two approximation in Õ_m(n^⌊m/2⌋+2) time. Moreover, we show if just one out of m sequences is (p,B)-pseudorandom then, we get a below-2 approximation in Õ_m(nB^m-1+n^⌊m/2⌋+3) time irrespective of θ.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2021

Approximating Length-Restricted Means under Dynamic Time Warping

We study variants of the mean problem under the p-Dynamic Time Warping (...
research
06/24/2020

Hardness of Approximation of (Multi-)LCS over Small Alphabet

The problem of finding longest common subsequence (LCS) is one of the fu...
research
07/14/2023

A (3/2 + ε)-Approximation for Multiple TSP with a Variable Number of Depots

One of the most studied extensions of the famous Traveling Salesperson P...
research
10/02/2019

Approximating the Geometric Edit Distance

Edit distance is a measurement of similarity between two sequences such ...
research
08/09/2018

Longest Increasing Subsequence under Persistent Comparison Errors

We study the problem of computing a longest increasing subsequence in a ...
research
02/26/2020

Asymmetric Streaming Algorithms for Edit Distance and LCS

The edit distance (ED) and longest common subsequence (LCS) are two fund...
research
03/06/2022

An Interactive Gameplay to Crowdsource Multiple Sequence Alignment of Genome Sequences: Genenigma

Comparative genomics is a field of research that compares genomes of dif...

Please sign up or login with your details

Forgot password? Click here to reset