The longest common subsequence (LCS) of two strings and is the longest string that appears as a subsequence of both strings. The length of the LCS of and , which we denote by , is one of the most fundamental measures of similarity between two strings and has drawn significant interest in last five decades, see, e.g. [35, 6, 26, 27, 30, 32, 10, 31, 11, 36, 22, 16, 28, 2, 19, 4, 20, 3, 33, 34, 24]. On strings of length , the LCS problem can be solved exactly in quadratic time using a classical dynamic programming approach . Despite an extensive line of research the quadratic running time has been improved only by logarithmic factors . This lack of progress is explained by a recent result showing that any truly subquadratic algorithm for LCS would falsify the Strong Exponential Time Hypothesis (SETH); this has been proven independently by Abboud et al.  and by Bringmann and Künnemann . Further work in this direction shows that even a high polylogarithmic speedup for LCS would have surprising consequences [4, 3]. For the closely related edit distance the situation is similar, as the classic quadratic running time can be improved by logarithmic factors, but any truly subquadratic algorithm would falsify SETH .
These strong hardness results naturally bring up the question whether LCS or edit distance can be efficiently approximated (namely, whether an algorithm with truly subquadratic time for any constant , can produce a good approximation in the worst-case). In the last two decades, significant progress has been made towards designing efficient approximation algorithms for edit distance [14, 13, 15, 9, 7, 21, 23, 29, 17]; the latest achievement is a constant-factor approximation in almost-linear111By almost-linear we mean time for a constant that can be chosen arbitrarily small. time .
For LCS the picture is much more frustrating. The LCS problem has a simple -approximation algorithm with running time for any constant , and it has a trivial -approximation algorithm with running time for strings over alphabet . Yet, improving upon these naive bounds has evaded the community until very recently, making LCS a notoriously hard problem to approximate. In 2019, Rubinstein et al.  presented a subquadratic-time -approximation, where is the ratio of the string length to the length of the optimal LCS. For binary alphabet, Rubinstein and Song  recently improved the -approximation. In the general case (where and the alphabet size are arbitrary), the naive -approximation in near-linear222By near-linear we mean time , where hides polylogarithmic factors in . time was recently beaten by Hajiaghayi et al. , who designed a linear-time algorithm that computes an -approximation in expectation.333While the SODA proceedings version of  claimed a high probability bound, the newer corrected Arxiv version  only claims that the algorithm outputs an -approximation in expectation. Personal communications with the authors confirm that the result indeed holds only in expectation, see also Remark 14. Nonetheless, the gap between the upper bound provided by Hajiaghayi et al.  and the recent results on hardness of approximation [1, 5] remains huge.
1.1 Our Contribution
We present a randomized -approximation for LCS running in linear time , where the approximation guarantee holds with high probability444We say that an event happens with high probability (w.h.p.) if it has probability at least , where the constant can be chosen in advance.. More generally, we obtain a tradeoff between approximation guarantee and running time: For any we achieve approximation ratio in time . Formally we prove the following:
There is a randomized algorithm that, given strings of length and a time budget , with high probability computes a multiplicative -approximation of the length of the LCS of and in time .
The improvement over the state of the art can be summarized as follows:
An improved approximation ratio for the linear time regime: from  to ;
The first algorithm which improves upon the naive bound with high probability;
A generalization to running time , breaking the naive approximation ratio in general.
2 Technical Overview
We combine classic exact algorithms for LCS with different subsampling strategies to develop several algorithms that work in different regimes of the problem. A combination of these algorithms then yields the full approximation algorithm.
Our Algorithm 1 covers the regime of short LCS, i.e., when the LCS has length at most for an appropriate constant depending on the running time budget. In this regime, we decrease the length of the string by subsampling. This naturally allows to run classic exact algorithms for LCS on the subsampled string (which now has significantly smaller size) and the original string , while not deteriorating the LCS between the two strings too much.
For the remaining parts of the algorithm, the strings and are split into substrings and of length where denotes the total running time budget. For any block we write for the length of the LCS of and . We call a set with and a block sequence. Since we can assume the LCS of and to be long, it follows that there exists a good “block-aligned LCS”, more precisely there exists a block sequence with large LCS sum .
Now, a natural approach is to compute estimatesfor all blocks and to determine the maximum sum over all block sequences . Once we have estimates , the maximum can be computed by dynamic programming in time , which is for our choice of . In the following we describe three different strategies to compute estimates . The major difficulty is that on average per block we can only afford time to compute an estimate .
The first strategy focuses on matching pairs. A matching pair of strings is a pair of indices such that . We write for the number of matching pairs of the strings and . Our Algorithm 2 works well if some block sequence has a large total number of matching pairs . Here the key observation (Lemma 7) is that for each block there exists a symbol that occurs at least times in both and . If is large, matching this symbol provides a good approximation for . Unfortunately, since we can afford only running time per block, finding a frequent symbol is difficult. We develop as a new tool an algorithm that w.h.p. finds a frequent symbol in each block with an above-average number of matching pairs, see Lemma 8.
For our remaining two strategies we can assume the optimal LCS to be large and to be small (i.e., every block sequence has a small total number of matching pairs). In our Algorithm 3, we analyze the case where is large. Here we pick some diagonal and run our basic approximation algorithm on each block along the diagonal. Since there are diagonals, an above-average diagonal has a total LCS of . If is large then this provides a good estimation of the LCS. The main difficulty is how to find an above-average diagonal. A random diagonal has a good LCS sum in expectation, but not necessarily with good probability. Our solution is a non-uniform sampling, where we first test random blocks until we find a block with large LCS, and then choose the diagonal containing this seed block. This sampling yields an above-average diagonal with good probability.
Recall that there always exists a block sequence with large LCS sum (see Lemma 11). The idea of our Algorithm 4 is to focus on a uniformly random subset of all blocks, where each block is picked with probability . Then on each picked block we can spend more time (specifically time ) to compute an estimate . Moreover, we still find a -fraction of . We analyze this algorithm in terms of and (the choice of depends on these two parameters) and show that it works well in the complementary regimes of Algorithms 1-3.
Comparison with the Previous Approach of Hajiaghayi et al. 
The general approach of splitting and into blocks and performing dynamic programming over estimates was introduced by Hajiaghayi et al. . Moreover, our Algorithm 1 has essentially the same guarantees as [24, Algorithm 1], but ours is a simple combination of generic parts that we reuse in our later algorithms, thus simplifying the overall algorithm.
Our Algorithm 2 follows the same idea as [24, Algorithm 3], in that we want to find a frequent symbol in and and match only this symbol to obtain an estimate . Hajiaghayi et al. find a frequent symbol by picking a random symbol in each block ; in expectation appears at least times in and . In order to obtain with high probability guarantees, we need to develop a new tool for finding frequent symbols not only in expectation but even with high probability, see Lemma 8 and Remark 14.
The remainder of the approach differs significantly; our Algorithms 3 and 4 are very different compared to [24, Algorithms 2 and 4]. In the following we discuss their ideas. In [24, Algorithm 2], they argue about the alphabet size, splitting the alphabet into frequent and infrequent letters. For infrequent letters the total number of matching pairs is small, so augmenting a classic exact algorithm by subsampling works well. Therefore, they can assume that every letter is frequent and thus the alphabet size is small. We avoid this line of reasoning. Finally, [24, Algorithm 4] is their most involved algorithm. Assuming that their other algorithms have failed to produce a sufficiently good approximation, they show that each part and can be turned into a semi-permutation by a little subsampling. Then by leveraging Dilworth’s theorem and Tuŕan’s theorem they show that most blocks have an LCS length of at least ; this can be seen as a triangle inequality for LCS and is their most novel contribution. This results in a highly non-trivial algorithm making clever use of combinatorial machinery.
We show that these ideas can be completely avoided, by instead relying on classic algorithms based on matching pairs augmented by subsampling. Specifically, we replace their combinatorial machinery by our Algorithms 3 and 4 described above (recall that Algorithm 3 considers a non-uniformly sampled random diagonal while Algorithm 4 subsamples the set of blocks to be able to spend more time per block). We stress that our solution completely avoids the concept of semi-permutation or any heavy combinatorial machinery as used in [24, Algorithm 4], while providing a significantly improved approximation guarantee.
Organization of the Paper.
Section 3 introduces notation and a classical algorithm by Hunt and Szymanski. In Section 4 we present our new tools, in particular for finding frequent symbols. Section 5 contains our main algorithm, split into four parts that are presented in Sections 5.1, 5.3, 5.4, and 5.5, and combined in Section 5.6. In the appendix, for completeness we sketch the algorithm by Hunt and Szymanski (Appendix A) and we present pseudocode for all our algorithms (Appendix B).
For we write . By the notation and we hide factors of the form . We use “with high probability” (w.h.p.) to denote probabilities of the form , where the constant can be chosen in advance.
A string over alphabet is a finite sequence of letters in . We denote its length by and its -th letter by . We also denote by the substring consisting of letters . For any indices the string forms a subsequence of . For strings we denote by the length of the longest common subsequence of and . In this paper we study the problem of approximating for given strings of length . We focus on the length , however, our algorithms can be easily adapted to also reconstruct a subsequence attaining the output length. If are clear from the context, we may replace by . Throughout the paper we assume that the alphabet is (this is without loss of generality after a -time preprocessing).
For a symbol , we denote the number of times that appears in by , and call this the frequency of in . For strings and , a matching pair is a pair with . We denote the number of matching pairs by . If are clear from the context, we may replace by . Observe that . Using this equation we can compute in time .
Hunt and Szymanski  solved the LCS problem in time . More precisely, their algorithm can be viewed as having a preprocessing phase that only reads and runs in time , and a query phase that reads and and takes time .
Theorem 2 (Hunt and Szymanski ).
We can preprocess a string in time . Given a string and a preprocessed string , we can compute their LCS in time .
For convenience, we provide a proof sketch of their theorem in Appendix A.
4 New Basic Tools
4.1 Basic Approximation Algorithm
Throughout this section we abbreviate and . We start with the basic approximation algorithm that is central to our approach; most of our later algorithms use this as a subroutine. This algorithm subsamples the string and then runs Hunt and Szymanski’s algorithm (Theorem 2).
Lemma 3 (Basic Approximation Algorithm).
Let . We can preprocess in time . Given , the preprocessed string , and , in expected time we can compute a value that w.h.p. satisfies .
In the preprocessing phase, we run the preprocessing of Theorem 2 on .
Fix a constant . If , then in the query phase we simply run Theorem 2, solving LCS exactly in time .
Otherwise, denote by a random subsequence of , where each letter is removed independently with probability (i.e., kept with probability ) for . Note that by our assumption on . We can sample in expected time
, since the difference from one unremoved letter to the next is geometrically distributed, and geometric random variates can be sampled in expected time, see, e.g., . Note that this subsampling yields and .
In the query phase, we sample and then run the query phase of Theorem 2 on and . This runs in time , which is in expectation.
Finally, consider a fixed LCS of and , namely for some and . Each letter survives the subsampling to with probability . Therefore, we can bound
from below by a binomial random variable(the correct terminology is that statistically dominates ). Since is a sum of independent -variables, multiplicative Chernoff applies and yields . If then and , and thus . Otherwise, if , then we can only bound . In both cases, we have with high probability. ∎
The above lemma behaves poorly if , due to the “” in the approximation guarantee. We next show that this can be avoided, at the cost of increasing the running time by an additive .
Lemma 4 (Generalised Basic Approximation Algorithm).
Given and , in expected time we can compute a value that w.h.p. satisfies .
We run the basic approximation algorithm from Lemma 3, which computes a value . Additionally, we compute the number of matching pairs in time . If , then there exists a matching pair, which yields a common subsequence of length 1. Therefore, if we set .
In the proof of Lemma 3 we showed that if then w.h.p. we have . We now argue differently in the case . If , then and we are done. If , then there must exist at least one matching pair, so , so the second part of our algorithm yields . Hence, in all cases w.h.p. we have . ∎
We now turn towards the problem of deciding for given and whether . To this end, we repeatedly call the basic approximation algorithm with geometrically decreasing approximation ratio . Note that with decreasing approximation ratio we get a better approximation guarantee at the cost of higher running time. The idea is that if the LCS is much shorter than the threshold , then already approximation ratio allows us to detect that . This yields a running time bound depending on the gap .
Lemma 5 (Basic Decision Algorithm).
Let . We can preprocess in time . Given , the preprocessed , and a number , in expected time we can w.h.p. correctly decide whether . Our algorithm has no false positives (and w.h.p. no false negatives).
Let us first argue correctness. Since Lemma 3 computes a common subsequence of , we have . Thus, if , we correctly infer . Moreover, w.h.p. satisfies . Therefore, if , we can infer , and this decision is correct with high probability. Finally, in the last iteration (where ), we have , and thus one of or must hold, so the algorithm indeed returns a decision.
The expected time of the query phase of Lemma 3 is . Since decreases geometrically, the total expected time of our algorithm is dominated by the last call.
If , the last call is at the latest for . This yields running time .
If , note that for any we have , and thus we return “”. Because we decrease by a factor 2 in each iteration, the last call satisfies . Hence, the expected running time is . If then this time bound simplifies to . If , then also , and the time bound becomes . In both cases we can bound the expected running time by the claimed , since . ∎
4.2 Approximating the Number of Matching Pairs
Recall that for given strings of length the number of matching pairs can be computed in time , which is linear in the input size. However, later in the paper we will split into substrings and into substrings , each of length , and we will need estimates of the numbers of matching pairs . In this setting, the input size is still (the total length of all strings and ) and the output size is (all numbers ), but we are not aware of any algorithm computing the numbers in near-linear time in the input plus output size .555In fact, one can show conditional lower bounds from Boolean matrix multiplication that rule out near-linear time for computing all ’s unless the exponent of matrix multiplication is . Therefore, we devise an approximation algorithm for estimating the number of matching pairs.
For write and . Given the strings and a number , we can compute values that w.h.p. satisfy , in total expected time .
This yields a near-linear-time constant-factor approximation of all above-average : By setting , in expected time we obtain a constant-factor approximation of all values with .
The algorithm works as follows.
Graph Construction: Build a three-layered graph on vertex set , where has a node for every string , has a node for every string , and has a node for any and . Put an edge from to iff . Similarly, put an edge from to iff . Note that all frequencies and thus all edges of this graphs can be computed in total time . For and , we denote by their common neighbors. Note that any represents all matching pairs of symbol in and , and the number of these matching pairs is .
Subsampling: We sample a subset by removing each node independently with probability , where .
Determine Common Neighbors: For each enumerate all pairs of neighbors and . For each such 2-path, add to an initially empty set . This step computes the sets in time proportional to their total size.
Output: Return the values .
To analyze this algorithm, we consider the numbers . Observe that we have , since each corresponds to at least and at most matching pairs of and . It therefore suffices to show that is close to . Using Bernoulli random variables to express whether survives the subsampling, we write
This yields an expected value of , so by Markov’s inequality we obtain with probability at least . Since
is a linear combination of independent Bernoulli random variables, we can also easily express its variance as
We now use the definition of to bound
This yields . We now use Chebychev’s inequality on and to obtain
In case , we have and hence . Otherwise, in case , we can only use the trivial .
Hence, each inequality and individually holds with probability at leat . Finally, we boost the success probability by repeating the above algorithm times and returning for each the median of all computed values .
Steps 1 and 2 can be easily seen to run in time . Steps 3 and 4 run in time proportional to the total size of all sets , which we claim to be at most in expectation. Over repetitions, we obtain a total expected running time of . (We remark that here we consider a succinct output format, where only the non-zero numbers are listed; otherwise additional time of is required to output the numbers .)
It remains to prove the claimed bound of . Since , from the definition of we infer . Therefore,
4.3 Single Symbol Approximation Algorithm
For strings that have a large number of matchings pairs , some symbol must appear often in and in . This yields a common subsequence using (several repetitions of) a single alphabet symbol.
For any there exists a symbol that appears at least times in and in . Therefore, in time we can compute a common subsequence of of length at least . In particular, we can compute a value that satisfies .
Let be maximal such that some symbol appears at least times in and at least times in . Let for . Since no symbol appears more than times in and in , we have . We can thus bound
since the frequencies sum up to at most , and similarly for . It follows that . Computing , and a symbol attaining , in time is straightforward. ∎
We devise a variant of Lemma 7 in the following setting. For strings , we write , and . We want to find for each block a frequent symbol in and , or equivalently we want to find a common subsequence of and using a single alphabet symbol. Similarly to Lemma 6, we relax Lemma 7 to obtain a fast running time.
Given and any , we can compute for each a number such that w.h.p. . The algorithm runs in total expected time .
We run the same algorithm as in Lemma 6, except that in Step 4 for each with non-empty set we let be the maximum of over all . For each empty set , we implicitly set , i.e., we output a sparse representation of all non-zero values .
The running time analysis is the same as in Lemma 6.
For the upper bound on , since appears at least times in and at least times in , there is a common subsequence of and of length at least . Thus, we have .
For the lower bound on , fix and order the tuples in ascending order of , obtaining an ordering . For we let and . Recall that , and observe that we can pick with
Then we have
Note that for any the symbol appears at least times in or in , and thus the sum on the right hand side is at most . Rearranging, this yields , where we used as in the proof of Lemma 6. In particular, due to our ordering we have for any :
Consider the number of nodes in surviving the subsampling, i.e., . If , then some node in survived, and thus by (2) the computed value is at least . It thus remains to analyze .
In case some has , we have with probability 1. Otherwise all have and thus . In this case, we write as a sum of independent Bernoulli random variates in the form . In particular,
Since is a sum of independent -variables, multiplicative Chernoff applies and yields . We thus obtain
In case , we obtain , and thus we have with probability at least . Otherwise, in case , we can only use the trivial bound . In any case, we have with probability at least . Similar to the proof of Lemma 6, we run independent repetitions of this algorithm and return for each the maximum of all computed values , to boost the success probability and finish the proof. ∎
5 Main Algorithm
Theorem 9 (Main Result, Relaxation).
Given strings of length and a time budget , in expected time we can compute a number such that and w.h.p. .
Recall Theorem 1:
In order to remove the expected running time, we abort the algorithm from Theorem 9 after time steps. By Markov’s inequality, we can choose the hidden constants and logfactors such that the probability of aborting is at most . We boost the success probability of this adapted algorithm by running independent repetitions and returning the maximum over all computed values . This yields an -approximation with high probability in time .
To remove the logfactors in the running time, as the first step in our algorithm we subsample the given strings , keeping each symbol independently with probability , resulting in subsampled strings . Since any common subsequence of is also a common subsequence of , the estimate that we compute for satisfies . Moreover, if then by Chernoff bound with high probability we have , so that an -approximation on also yields an -approximation on . Otherwise, if , then in order to compute a -approximation it suffices to compute an LCS of length 1, which is just a matching pair and can be found in time (assuming that the alphabet is ).
This yields an algorithm that computes a value such that w.h.p. . The algorithm runs in time , and this running time bound holds deterministically, i.e., with probability 1. Hence, we proved Theorem 1. ∎
It remains to prove Theorem 9. Our algorithm is a combination of four methods that work well in different regimes of the problem, see Sections 5.1, 5.3, 5.4, and 5.5. We will combine these methods in Section 5.6.
5.1 Algorithm 1: Small
Algorithm 1 works well if the LCS is short. It yields the following result.
Theorem 10 (Algorithm 1).
We can compute in expected time an estimate that w.h.p. satisfies .
Step 1 runs in time . Step 2 runs in expected time . Since we have , so the expected running time is .
Steps 1 and 2 compute common subsequences, so the computed estimate satisfies .
Note that Step 1 guarantees and Step 2 guarantees w.h.p. . If then and , so we solved the problem exactly. Otherwise we have and