A Linear-Time n^0.4-Approximation for Longest Common Subsequence

06/15/2021
by   Karl Bringmann, et al.
Universität Saarland
0

We consider the classic problem of computing the Longest Common Subsequence (LCS) of two strings of length n. While a simple quadratic algorithm has been known for the problem for more than 40 years, no faster algorithm has been found despite an extensive effort. The lack of progress on the problem has recently been explained by Abboud, Backurs, and Vassilevska Williams [FOCS'15] and Bringmann and Künnemann [FOCS'15] who proved that there is no subquadratic algorithm unless the Strong Exponential Time Hypothesis fails. This has led the community to look for subquadratic approximation algorithms for the problem. Yet, unlike the edit distance problem for which a constant-factor approximation in almost-linear time is known, very little progress has been made on LCS, making it a notoriously difficult problem also in the realm of approximation. For the general setting, only a naive O(n^ε/2)-approximation algorithm with running time Õ(n^2-ε) has been known, for any constant 0 < ε≤ 1. Recently, a breakthrough result by Hajiaghayi, Seddighin, Seddighin, and Sun [SODA'19] provided a linear-time algorithm that yields a O(n^0.497956)-approximation in expectation; improving upon the naive O(√(n))-approximation for the first time. In this paper, we provide an algorithm that in time O(n^2-ε) computes an Õ(n^2ε/5)-approximation with high probability, for any 0 < ε≤ 1. Our result (1) gives an Õ(n^0.4)-approximation in linear time, improving upon the bound of Hajiaghayi, Seddighin, Seddighin, and Sun, (2) provides an algorithm whose approximation scales with any subquadratic running time O(n^2-ε), improving upon the naive bound of O(n^ε/2) for any ε, and (3) instead of only in expectation, succeeds with high probability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/15/2021

Approximating the Longest Common Subsequence problem within a sub-polynomial factor in linear time

The Longest Common Subsequence (LCS) of two strings is a fundamental str...
12/22/2017

Longest common substring with approximately k mismatches

In the longest common substring problem we are given two strings of leng...
02/24/2020

Upper Tail Analysis of Bucket Sort and Random Tries

Bucket Sort is known to run in expected linear time when the input keys ...
03/16/2020

Approximating LCS in Linear Time: Beating the √(n) Barrier

Longest common subsequence (LCS) is one of the most fundamental problems...
05/07/2021

Improved Approximation for Longest Common Subsequence over Small Alphabets

This paper investigates the approximability of the Longest Common Subseq...
05/22/2021

How Packed Is It, Really?

The congestion of a curve is a measure of how much it zigzags around loc...
08/09/2018

Longest Increasing Subsequence under Persistent Comparison Errors

We study the problem of computing a longest increasing subsequence in a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The longest common subsequence (LCS) of two strings and is the longest string that appears as a subsequence of both strings. The length of the LCS of and , which we denote by , is one of the most fundamental measures of similarity between two strings and has drawn significant interest in last five decades, see, e.g. [35, 6, 26, 27, 30, 32, 10, 31, 11, 36, 22, 16, 28, 2, 19, 4, 20, 3, 33, 34, 24]. On strings of length , the LCS problem can be solved exactly in quadratic time using a classical dynamic programming approach [35]. Despite an extensive line of research the quadratic running time has been improved only by logarithmic factors [30]. This lack of progress is explained by a recent result showing that any truly subquadratic algorithm for LCS would falsify the Strong Exponential Time Hypothesis (SETH); this has been proven independently by Abboud et al. [2] and by Bringmann and Künnemann [19]. Further work in this direction shows that even a high polylogarithmic speedup for LCS would have surprising consequences [4, 3]. For the closely related edit distance the situation is similar, as the classic quadratic running time can be improved by logarithmic factors, but any truly subquadratic algorithm would falsify SETH [12].

These strong hardness results naturally bring up the question whether LCS or edit distance can be efficiently approximated (namely, whether an algorithm with truly subquadratic time for any constant , can produce a good approximation in the worst-case). In the last two decades, significant progress has been made towards designing efficient approximation algorithms for edit distance [14, 13, 15, 9, 7, 21, 23, 29, 17]; the latest achievement is a constant-factor approximation in almost-linear111By almost-linear we mean time for a constant that can be chosen arbitrarily small. time [8].

For LCS the picture is much more frustrating. The LCS problem has a simple -approximation algorithm with running time for any constant , and it has a trivial -approximation algorithm with running time for strings over alphabet . Yet, improving upon these naive bounds has evaded the community until very recently, making LCS a notoriously hard problem to approximate. In 2019, Rubinstein et al. [33] presented a subquadratic-time -approximation, where is the ratio of the string length to the length of the optimal LCS. For binary alphabet, Rubinstein and Song [34] recently improved the -approximation. In the general case (where and the alphabet size are arbitrary), the naive -approximation in near-linear222By near-linear we mean time , where hides polylogarithmic factors in . time was recently beaten by Hajiaghayi et al. [24], who designed a linear-time algorithm that computes an -approximation in expectation.333While the SODA proceedings version of [24] claimed a high probability bound, the newer corrected Arxiv version [25] only claims that the algorithm outputs an -approximation in expectation. Personal communications with the authors confirm that the result indeed holds only in expectation, see also Remark 14. Nonetheless, the gap between the upper bound provided by Hajiaghayi et al. [24] and the recent results on hardness of approximation [1, 5] remains huge.

1.1 Our Contribution

We present a randomized -approximation for LCS running in linear time , where the approximation guarantee holds with high probability444We say that an event happens with high probability (w.h.p.) if it has probability at least , where the constant can be chosen in advance.. More generally, we obtain a tradeoff between approximation guarantee and running time: For any we achieve approximation ratio in time . Formally we prove the following:

Theorem 1.

There is a randomized algorithm that, given strings of length and a time budget , with high probability computes a multiplicative -approximation of the length of the LCS of and in time .

The improvement over the state of the art can be summarized as follows:

  1. An improved approximation ratio for the linear time regime: from  [24] to ;

  2. The first algorithm which improves upon the naive bound with high probability;

  3. A generalization to running time , breaking the naive approximation ratio in general.

2 Technical Overview

We combine classic exact algorithms for LCS with different subsampling strategies to develop several algorithms that work in different regimes of the problem. A combination of these algorithms then yields the full approximation algorithm.

Our Algorithm 1 covers the regime of short LCS, i.e., when the LCS has length at most for an appropriate constant depending on the running time budget. In this regime, we decrease the length of the string  by subsampling. This naturally allows to run classic exact algorithms for LCS on the subsampled string (which now has significantly smaller size) and the original string , while not deteriorating the LCS between the two strings too much.

For the remaining parts of the algorithm, the strings and are split into substrings and of length where denotes the total running time budget. For any block  we write for the length of the LCS of and . We call a set with and a block sequence. Since we can assume the LCS of and to be long, it follows that there exists a good “block-aligned LCS”, more precisely there exists a block sequence with large LCS sum .

Now, a natural approach is to compute estimates

for all blocks and to determine the maximum sum over all block sequences . Once we have estimates , the maximum  can be computed by dynamic programming in time , which is for our choice of . In the following we describe three different strategies to compute estimates . The major difficulty is that on average per block we can only afford time to compute an estimate .

The first strategy focuses on matching pairs. A matching pair of strings is a pair of indices such that . We write for the number of matching pairs of the strings and . Our Algorithm 2 works well if some block sequence has a large total number of matching pairs . Here the key observation (Lemma 7) is that for each block there exists a symbol that occurs at least times in both and . If is large, matching this symbol provides a good approximation for . Unfortunately, since we can afford only running time per block, finding a frequent symbol is difficult. We develop as a new tool an algorithm that w.h.p. finds a frequent symbol in each block with an above-average number of matching pairs, see Lemma 8.

For our remaining two strategies we can assume the optimal LCS to be large and to be small (i.e., every block sequence has a small total number of matching pairs). In our Algorithm 3, we analyze the case where is large. Here we pick some diagonal and run our basic approximation algorithm on each block along the diagonal. Since there are diagonals, an above-average diagonal has a total LCS of . If  is large then this provides a good estimation of the LCS. The main difficulty is how to find an above-average diagonal. A random diagonal has a good LCS sum in expectation, but not necessarily with good probability. Our solution is a non-uniform sampling, where we first test random blocks until we find a block with large LCS, and then choose the diagonal containing this seed block. This sampling yields an above-average diagonal with good probability.

Recall that there always exists a block sequence with large LCS sum (see Lemma 11). The idea of our Algorithm 4 is to focus on a uniformly random subset of all blocks, where each block is picked with probability . Then on each picked block we can spend more time (specifically time ) to compute an estimate . Moreover, we still find a -fraction of . We analyze this algorithm in terms of and (the choice of depends on these two parameters) and show that it works well in the complementary regimes of Algorithms 1-3.

Comparison with the Previous Approach of Hajiaghayi et al. [24]

The general approach of splitting and into blocks and performing dynamic programming over estimates was introduced by Hajiaghayi et al. [24]. Moreover, our Algorithm 1 has essentially the same guarantees as [24, Algorithm 1], but ours is a simple combination of generic parts that we reuse in our later algorithms, thus simplifying the overall algorithm.

Our Algorithm 2 follows the same idea as [24, Algorithm 3], in that we want to find a frequent symbol in and and match only this symbol to obtain an estimate . Hajiaghayi et al. find a frequent symbol by picking a random symbol in each block ; in expectation appears at least times in and . In order to obtain with high probability guarantees, we need to develop a new tool for finding frequent symbols not only in expectation but even with high probability, see Lemma 8 and Remark 14.

The remainder of the approach differs significantly; our Algorithms 3 and 4 are very different compared to [24, Algorithms 2 and 4]. In the following we discuss their ideas. In [24, Algorithm 2], they argue about the alphabet size, splitting the alphabet into frequent and infrequent letters. For infrequent letters the total number of matching pairs is small, so augmenting a classic exact algorithm by subsampling works well. Therefore, they can assume that every letter is frequent and thus the alphabet size is small. We avoid this line of reasoning. Finally, [24, Algorithm 4] is their most involved algorithm. Assuming that their other algorithms have failed to produce a sufficiently good approximation, they show that each part and can be turned into a semi-permutation by a little subsampling. Then by leveraging Dilworth’s theorem and Tuŕan’s theorem they show that most blocks have an LCS length of at least ; this can be seen as a triangle inequality for LCS and is their most novel contribution. This results in a highly non-trivial algorithm making clever use of combinatorial machinery.

We show that these ideas can be completely avoided, by instead relying on classic algorithms based on matching pairs augmented by subsampling. Specifically, we replace their combinatorial machinery by our Algorithms 3 and 4 described above (recall that Algorithm 3 considers a non-uniformly sampled random diagonal while Algorithm 4 subsamples the set of blocks to be able to spend more time per block). We stress that our solution completely avoids the concept of semi-permutation or any heavy combinatorial machinery as used in [24, Algorithm 4], while providing a significantly improved approximation guarantee.

Organization of the Paper.

Section 3 introduces notation and a classical algorithm by Hunt and Szymanski. In Section 4 we present our new tools, in particular for finding frequent symbols. Section 5 contains our main algorithm, split into four parts that are presented in Sections 5.1, 5.3, 5.4, and 5.5, and combined in Section 5.6. In the appendix, for completeness we sketch the algorithm by Hunt and Szymanski (Appendix A) and we present pseudocode for all our algorithms (Appendix B).

3 Preliminaries

For we write . By the notation and we hide factors of the form . We use “with high probability” (w.h.p.) to denote probabilities of the form , where the constant can be chosen in advance.

String Notation.

A string over alphabet is a finite sequence of letters in . We denote its length by and its -th letter by . We also denote by the substring consisting of letters . For any indices the string forms a subsequence of . For strings we denote by the length of the longest common subsequence of and . In this paper we study the problem of approximating for given strings of length . We focus on the length , however, our algorithms can be easily adapted to also reconstruct a subsequence attaining the output length. If are clear from the context, we may replace by . Throughout the paper we assume that the alphabet is (this is without loss of generality after a -time preprocessing).

Matching Pairs.

For a symbol , we denote the number of times that appears in by , and call this the frequency of in . For strings and , a matching pair is a pair with . We denote the number of matching pairs by . If are clear from the context, we may replace by . Observe that . Using this equation we can compute in time .

Hunt and Szymanski [27] solved the LCS problem in time . More precisely, their algorithm can be viewed as having a preprocessing phase that only reads  and runs in time , and a query phase that reads and and takes time .

Theorem 2 (Hunt and Szymanski [27]).

We can preprocess a string in time . Given a string and a preprocessed string , we can compute their LCS in time .

For convenience, we provide a proof sketch of their theorem in Appendix A.

4 New Basic Tools

4.1 Basic Approximation Algorithm

Throughout this section we abbreviate and . We start with the basic approximation algorithm that is central to our approach; most of our later algorithms use this as a subroutine. This algorithm subsamples the string and then runs Hunt and Szymanski’s algorithm (Theorem 2).

Lemma 3 (Basic Approximation Algorithm).

Let . We can preprocess in time . Given , the preprocessed string , and , in expected time we can compute a value that w.h.p. satisfies .

Proof.

In the preprocessing phase, we run the preprocessing of Theorem 2 on .

Fix a constant . If , then in the query phase we simply run Theorem 2, solving LCS exactly in time .

Otherwise, denote by a random subsequence of , where each letter is removed independently with probability (i.e., kept with probability ) for . Note that by our assumption on . We can sample in expected time

, since the difference from one unremoved letter to the next is geometrically distributed, and geometric random variates can be sampled in expected time

, see, e.g., [18]. Note that this subsampling yields and .

In the query phase, we sample and then run the query phase of Theorem 2 on and . This runs in time , which is in expectation.

Finally, consider a fixed LCS of and , namely for some and . Each letter survives the subsampling to with probability . Therefore, we can bound

from below by a binomial random variable

(the correct terminology is that statistically dominates ). Since is a sum of independent -variables, multiplicative Chernoff applies and yields . If then and , and thus . Otherwise, if , then we can only bound . In both cases, we have with high probability. ∎

The above lemma behaves poorly if , due to the “” in the approximation guarantee. We next show that this can be avoided, at the cost of increasing the running time by an additive .

Lemma 4 (Generalised Basic Approximation Algorithm).

Given and , in expected time we can compute a value that w.h.p. satisfies .

Proof.

We run the basic approximation algorithm from Lemma 3, which computes a value . Additionally, we compute the number of matching pairs in time . If , then there exists a matching pair, which yields a common subsequence of length 1. Therefore, if we set .

In the proof of Lemma 3 we showed that if then w.h.p. we have . We now argue differently in the case . If , then and we are done. If , then there must exist at least one matching pair, so , so the second part of our algorithm yields . Hence, in all cases w.h.p. we have . ∎

We now turn towards the problem of deciding for given and whether . To this end, we repeatedly call the basic approximation algorithm with geometrically decreasing approximation ratio . Note that with decreasing approximation ratio we get a better approximation guarantee at the cost of higher running time. The idea is that if the LCS is much shorter than the threshold , then already approximation ratio allows us to detect that . This yields a running time bound depending on the gap .

Lemma 5 (Basic Decision Algorithm).

Let . We can preprocess in time . Given , the preprocessed , and a number , in expected time we can w.h.p. correctly decide whether . Our algorithm has no false positives (and w.h.p. no false negatives).

Proof.

In the preprocessing phase, we run the preprocessing of Lemma 3. In the query phase, we repeatedly call the query phase of Lemma 3, with geometrically decreasing values of :

  1. Preprocessing: Run the preprocessing of Lemma 3.

  2. For :

    1. Run the query phase of Lemma 3 with parameter to obtain an estimate .

    2. If : return “

    3. If : return “

Let us first argue correctness. Since Lemma 3 computes a common subsequence of , we have . Thus, if , we correctly infer . Moreover, w.h.p.  satisfies . Therefore, if , we can infer , and this decision is correct with high probability. Finally, in the last iteration (where ), we have , and thus one of or must hold, so the algorithm indeed returns a decision.

The expected time of the query phase of Lemma 3 is . Since decreases geometrically, the total expected time of our algorithm is dominated by the last call.

If , the last call is at the latest for . This yields running time .

If , note that for any we have , and thus we return “”. Because we decrease by a factor 2 in each iteration, the last call satisfies . Hence, the expected running time is . If then this time bound simplifies to . If , then also , and the time bound becomes . In both cases we can bound the expected running time by the claimed , since . ∎

4.2 Approximating the Number of Matching Pairs

Recall that for given strings of length the number of matching pairs can be computed in time , which is linear in the input size. However, later in the paper we will split into substrings and into substrings , each of length , and we will need estimates of the numbers of matching pairs . In this setting, the input size is still (the total length of all strings and ) and the output size is (all numbers ), but we are not aware of any algorithm computing the numbers in near-linear time in the input plus output size .555In fact, one can show conditional lower bounds from Boolean matrix multiplication that rule out near-linear time for computing all ’s unless the exponent of matrix multiplication is . Therefore, we devise an approximation algorithm for estimating the number of matching pairs.

Lemma 6.

For write and . Given the strings and a number , we can compute values that w.h.p. satisfy , in total expected time .

This yields a near-linear-time constant-factor approximation of all above-average : By setting , in expected time we obtain a constant-factor approximation of all values with .

Proof.

The algorithm works as follows.

  1. Graph Construction: Build a three-layered graph on vertex set , where has a node for every string , has a node for every string , and has a node for any and . Put an edge from to iff . Similarly, put an edge from to iff . Note that all frequencies and thus all edges of this graphs can be computed in total time . For and , we denote by their common neighbors. Note that any represents all matching pairs of symbol in and , and the number of these matching pairs is .

  2. Subsampling: We sample a subset by removing each node independently with probability , where .

  3. Determine Common Neighbors: For each enumerate all pairs of neighbors and . For each such 2-path, add to an initially empty set . This step computes the sets in time proportional to their total size.

  4. Output: Return the values .

Correctness:

To analyze this algorithm, we consider the numbers . Observe that we have , since each corresponds to at least and at most matching pairs of and . It therefore suffices to show that is close to . Using Bernoulli random variables to express whether survives the subsampling, we write

This yields an expected value of , so by Markov’s inequality we obtain with probability at least . Since

is a linear combination of independent Bernoulli random variables, we can also easily express its variance as

We now use the definition of to bound

This yields . We now use Chebychev’s inequality on and to obtain

In case , we have and hence . Otherwise, in case , we can only use the trivial .

Hence, each inequality and individually holds with probability at leat . Finally, we boost the success probability by repeating the above algorithm times and returning for each the median of all computed values .

Running Time:

Steps 1 and 2 can be easily seen to run in time . Steps 3 and 4 run in time proportional to the total size of all sets , which we claim to be at most in expectation. Over repetitions, we obtain a total expected running time of . (We remark that here we consider a succinct output format, where only the non-zero numbers are listed; otherwise additional time of is required to output the numbers .)

It remains to prove the claimed bound of . Since , from the definition of we infer . Therefore,

4.3 Single Symbol Approximation Algorithm

For strings that have a large number of matchings pairs , some symbol must appear often in and in . This yields a common subsequence using (several repetitions of) a single alphabet symbol.

Lemma 7 (Cf. Lemma 6.6.(ii) in [20] or Algorithm 3 in [24]).

For any there exists a symbol that appears at least times in and in . Therefore, in time we can compute a common subsequence of of length at least . In particular, we can compute a value that satisfies .

Proof.

Let be maximal such that some symbol appears at least times in and at least times in . Let for . Since no symbol appears more than times in and in , we have . We can thus bound

since the frequencies sum up to at most , and similarly for . It follows that . Computing , and a symbol attaining , in time is straightforward. ∎

We devise a variant of Lemma 7 in the following setting. For strings , we write , and . We want to find for each block  a frequent symbol in and , or equivalently we want to find a common subsequence of and using a single alphabet symbol. Similarly to Lemma 6, we relax Lemma 7 to obtain a fast running time.

Lemma 8.

Given and any , we can compute for each a number  such that w.h.p. . The algorithm runs in total expected time .

Proof.

We run the same algorithm as in Lemma 6, except that in Step 4 for each with non-empty set  we let be the maximum of over all . For each empty set , we implicitly set , i.e., we output a sparse representation of all non-zero values .

The running time analysis is the same as in Lemma 6.

For the upper bound on , since appears at least times in and at least times in , there is a common subsequence of and of length at least . Thus, we have .

For the lower bound on , fix and order the tuples in ascending order of , obtaining an ordering . For we let and . Recall that , and observe that we can pick with

(1)

Then we have

Note that for any the symbol appears at least times in or in , and thus the sum on the right hand side is at most . Rearranging, this yields , where we used as in the proof of Lemma 6. In particular, due to our ordering we have for any :

(2)

Consider the number of nodes in surviving the subsampling, i.e., . If , then some node in survived, and thus by (2) the computed value is at least . It thus remains to analyze .

In case some has , we have with probability 1. Otherwise all have and thus . In this case, we write as a sum of independent Bernoulli random variates in the form . In particular,

Since is a sum of independent -variables, multiplicative Chernoff applies and yields . We thus obtain

In case , we obtain , and thus we have with probability at least . Otherwise, in case , we can only use the trivial bound . In any case, we have with probability at least . Similar to the proof of Lemma 6, we run independent repetitions of this algorithm and return for each the maximum of all computed values , to boost the success probability and finish the proof. ∎

5 Main Algorithm

In this section we prove Theorem 1. First we show that Theorem 9 implies Theorem 1, and then in the remainder of this section we prove Theorem 9.

Theorem 9 (Main Result, Relaxation).

Given strings of length and a time budget , in expected time we can compute a number such that and w.h.p. .

Recall Theorem 1:

See 1

Proof of Theorem 1 assuming Theorem 9.

Note that the difference between Theorems 1 and 9 is that the latter allows expected running time and has an additional slack of logarithmic factors in the running time.

In order to remove the expected running time, we abort the algorithm from Theorem 9 after time steps. By Markov’s inequality, we can choose the hidden constants and logfactors such that the probability of aborting is at most . We boost the success probability of this adapted algorithm by running independent repetitions and returning the maximum over all computed values . This yields an -approximation with high probability in time .

To remove the logfactors in the running time, as the first step in our algorithm we subsample the given strings , keeping each symbol independently with probability , resulting in subsampled strings . Since any common subsequence of is also a common subsequence of , the estimate that we compute for satisfies . Moreover, if then by Chernoff bound with high probability we have , so that an -approximation on also yields an -approximation on . Otherwise, if , then in order to compute a -approximation it suffices to compute an LCS of length 1, which is just a matching pair and can be found in time (assuming that the alphabet is ).

This yields an algorithm that computes a value such that w.h.p. . The algorithm runs in time , and this running time bound holds deterministically, i.e., with probability 1. Hence, we proved Theorem 1. ∎

It remains to prove Theorem 9. Our algorithm is a combination of four methods that work well in different regimes of the problem, see Sections 5.1, 5.3, 5.4, and 5.5. We will combine these methods in Section 5.6.

5.1 Algorithm 1: Small

Algorithm 1 works well if the LCS is short. It yields the following result.

Theorem 10 (Algorithm 1).

We can compute in expected time an estimate that w.h.p. satisfies .

Proof.

Our Algorithm 1 works as follows.

  1. Run Lemma 7 on and .

  2. Run Lemma 4 on and with .

  3. Output the larger of the two common subsequence lengths computed in Steps 1 and 2.

Running Time:

Step 1 runs in time . Step 2 runs in expected time . Since we have , so the expected running time is .

Upper Bound:

Steps 1 and 2 compute common subsequences, so the computed estimate satisfies .

Approximation Guarantee:

Note that Step 1 guarantees and Step 2 guarantees w.h.p. . If then and , so we solved the problem exactly. Otherwise we have and