Improved Sublinear-Time Edit Distance for Preprocessed Strings

04/29/2022
by   Karl Bringmann, et al.
0

We study the problem of approximating the edit distance of two strings in sublinear time, in a setting where one or both string(s) are preprocessed, as initiated by Goldenberg, Rubinstein, Saha (STOC '20). Specifically, in the (k, K)-gap edit distance problem, the goal is to distinguish whether the edit distance of two strings is at most k or at least K. We obtain the following results: * After preprocessing one string in time n^1+o(1), we can solve (k, k · n^o(1))-gap edit distance in time (n/k + k) · n^o(1). * After preprocessing both strings separately in time n^1+o(1), we can solve (k, k · n^o(1))-gap edit distance in time k · n^o(1). Both results improve upon some previously best known result, with respect to either the gap or the query time or the preprocessing time. Our algorithms build on the framework by Andoni, Krauthgamer and Onak (FOCS '10) and the recent sublinear-time algorithm by Bringmann, Cassis, Fischer and Nakos (STOC '22). We replace many complicated parts in their algorithm by faster and simpler solutions which exploit the preprocessing.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/28/2020

A Simple Sublinear Algorithm for Gap Edit Distance

We study the problem of estimating the edit distance between two n-chara...
08/20/2021

Does Preprocessing help in Fast Sequence Comparisons?

We study edit distance computation with preprocessing: the preprocessing...
07/24/2020

Sublinear-Time Algorithms for Computing Embedding Gap Edit Distance

In this paper, we design new sublinear-time algorithms for solving the g...
11/10/2018

Efficiently Approximating Edit Distance Between Pseudorandom Strings

We present an algorithm for approximating the edit distance ed(x, y) bet...
10/20/2018

MinJoin: Efficient Edit Similarity Joins via Local Hash Minimums

In this paper we study edit similarity joins, in which we are given a se...
07/16/2020

String Sanitization Under Edit Distance: Improved and Generalized

Let W be a string of length n over an alphabet Σ, k be a positive intege...
11/12/2017

Longest Alignment with Edits in Data Streams

Analyzing patterns in data streams generated by network traffic, sensor ...

1 Introduction

The edit distance (also known as Levenshtein distance) is a fundamental measure of similarity between strings. It has numerous applications in several fields such as information retrieval, computational biology and text processing. Given strings and , their edit distance denoted by is defined as the minimum number of character insertions, deletions and substitutions needed to transform into .

A textbook dynamic programming algorithm computes the edit distance of two strings of length in time . Popular conjectures such as the Strong Exponential Time Hypothesis imply that that this algorithm is essentially optimal, as there is no strongly subquadratic-time algorithm [BackursI15, AbboudBW15, BringmannK15, AbboudHWW16]. As for some applications involving enormous strings (such as DNA sequences) quadratic-time algorithms are impractical, a long line of research developed progressively better and faster approximation algorithms [BarYossefJKK04, BatuES06, OstrovskyR07, AndoniO12, AndoniKO10, ChakrabortyDGKS20, KouckyS20, BrakensiekR20]. The current best approximation guarantee in near-linear time is an algorithm by Andoni and Nosatzki [AndoniN20] computing an -approximation in time .

Another more recent line of research studies edit distance in the sublinear-time setting. Here the goal is to approximate the edit distance without reading the entire input strings. More formally, in the -gap edit distance problem the goal is to distinguish whether the edit distance between and is at most or greater than . The performance of gap algorithms is typically measured in terms of the string length and the gap parameters  and . This problem has been studied in several works [BatuEKMR03, AndoniO12, GoldenbergKS19, BrakensiekCR20, KociumakaS20] most of which focus on the -gap problem. Currently, there are two incomparable best known results: A recent result by Goldenberg, Kociumaka, Krauthgamer and Saha [GoldenbergKKS21] established a non-adaptive algorithm for the -gap problem in time .111We write to hide polylogarithmic factors . Another recent result by Bringmann, Cassis, Fischer and Nakos [BringmannCFN22] reduces the gap to and solves the /̄gap problem in time .222We write to hide subpolynomial factors in . See Table 1 for a more detailed comparison.

Source
Gap
Preprocessing time
Query time
Goldenberg, Krauthgamer, Saha [GoldenbergKS19] no preprocessing
Kociumaka, Saha [KociumakaS20] no preprocessing
Brakensiek, Charikar, Rubinstein [BrakensiekCR20] no preprocessing
Bringmann, Cassis, Fischer, Nakos [BringmannCFN22] no preprocessing
Goldenberg, Kociumaka, Krauthgamer, Saha [GoldenbergKKS21] no preprocessing
Bringmann, Cassis, Fischer, Nakos [BringmannCFN22] no preprocessing
Goldenberg, Rubinstein, Saha [GoldenbergRS20] one-sided,
Brakensiek, Charikar, Rubinstein [BrakensiekCR20] one-sided,
This work, Section 1 one-sided,
Chakraborty, Goldenberg, Koucký [ChakrabortyGK16] two-sided,
Brakensiek, Charikar, Rubinstein [BrakensiekCR20] two-sided,
Ostrovsky, Rabani [OstrovskyR07] two-sided,
This work, Section 1 two-sided,
Goldenberg, Rubinstein, Saha [GoldenbergRS20] two-sided,
Goldenberg, Rubinstein, Saha [GoldenbergRS20] two-sided,
Table 1: A comparison of sublinear-time algorithms for the -gap edit distance problem for different gap parameters

. All algorithms in this table are randomized and succeed with high probability. Note that some of these results are subsumed by others.

Our starting point is the work by Goldenberg, Rubinstein and Saha [GoldenbergRS20] which studies sublinear algorithms for edit distance in the preprocessing model. Here, we are allowed to preprocess one or both input strings and separately, and then use the precomputed information to solve the -gap edit distance problem. This model is motivated by applications where many long strings are compared against each other. For example, the string similarity join problem is to find all pairs of strings in a database (containing e.g. DNA sequences) which are close in edit distance; see [WandeltDGMMPSTWWL14] for a survey on practically relevant algorithms. Note that in these applications, if we have an algorithm with almost-linear preprocessing time (which is the case for all the algorithms we present in this paper), then the overhead incurred by preprocessing is comparable to the time necessary to read and store the strings in the first place. In [GoldenbergRS20], the authors pose and investigate the following open question:

“What is the complexity of approximate edit distance

with preprocessing when ?” [GoldenbergRS20]

This question has spawned significant interest in the community [ChakrabortyGK16, GoldenbergRS20, BrakensiekCR20], and with this paper we also make progress towards this question. We give an overview of results in Table 1. Note that most results are hard to compare to each other (one-sided versus two-sided preprocessing, exact versus -approximate versus -approximate).

In the two-sided model, all known algorithms (with almost-linear preprocessing time333Here we insist on almost-linear preprocessing time since the celebrated embedding of edit distance into the -metric with distortion due to Ostrovsky and Rabani [OstrovskyR07] achieves query time but requires preprocessing time . and, say, subpolynomial gap ) share the common barrier that the query time is . Due to this barrier, Goldenberg et al. [GoldenbergRS20] specifically ask whether there exists an approximation algorithm with sub- query time. One of our contributions is that we answer this question in the affirmative.

Our Results

We develop sublinear-time algorithms for the -gap edit distance problem, in the one-sided and two-sided preprocessing model, respectively.

[One-Sided Preprocessing] Let be length- strings. After preprocessing  in time , we can solve the -gap edit distance problem for and in time  with high probability.

In comparison to the -gap algorithms from [GoldenbergRS20, BrakensiekCR20] with best query time444For the -gap problem, the running time bounds and can be considered equal, as for  the algorithm may return a trivial answer. , we contribute the following improvement: Ignoring lower-order factors, we reduce the gap from  to  while achieving the same query time and the same preprocessing time . In comparison to the -gap algorithm in time  from [BringmannCFN22], we achieve the same gap but an improved query time for large , at the cost of preprocessing one of the strings.

In the two-sided model, we obtain an analogous result, where the query time no longer depends on .

[Two-Sided Preprocessing] Let be length- strings. After preprocessing both and (separately) in time , we can solve the -gap edit distance problem for and in time  with high probability.

We remark that all hidden factors in both theorems are . For a detailed comparison of this algorithm to the previously known results, see Table 1. We point out that Section 1 settles the open question from [GoldenbergRS20] whether there exists an edit distance approximation algorithm with small gap and sub- query time.

Our Techniques

To achieve our results we build on the recent sublinear-time algorithm by Bringmann, Cassis, Fischer and Nakos [BringmannCFN22], which itself builds on an almost-linear-time algorithm by Andoni, Krauthgamer and Onak [AndoniKO10]. The basic idea of the original algorithm is to split the strings into several smaller parts and recur on these parts with non-uniform precisions. The idea of [BringmannCFN22] is to prune branches in the recursion tree, by detecting and analyzing periodic substructures. Towards that, they designed appropriate property testers to efficiently detect these structures. In our setting, we observe that having preprocessed the string(s), we can prune the computation tree much more easily. Thus, our algorithm proceeds in the same recursive fashion as [BringmannCFN22] and uses a similar set of techniques, but due to the preprocessing it turns out to be simpler and faster.

Further Related Work

In the previous comparison about sublinear-time algorithms, we left out streaming and sketching algorithms [BelazzouguiZ16, ChakrabortyGK16, BarYossefJKK04] and document exchange protocols [Jowhari12, BelazzouguiZ16, Haeupler19].

Future Directions

There are several interesting directions for future work. We specifically mention two open problems.

  1. Constant gap? As Table 1 shows, so far no constant-gap sublinear-time algorithm is known. Maybe the one-sided preprocessing setting is more approachable for this challenge. We believe that our approach is hopeless to achieve a constant gap, since we borrow from the recursive decomposition introduced in [AndoniKO10] which inherently incurs a polylogarithmic overhead in the approximation factor.

  2. Improving the query time? The well-known lower bound against the -gap Hamming distance problem (and therefore against edit distance) continues to hold in the one-sided preprocessing setting. In particular, the most optimistic hope is an algorithm with query time for the -gap edit distance problem. Can this be achieved or is the extra in the query time of Section 1 necessary? For two-sided preprocessing, to the best of our knowledge no lower bound is known.

2 Preliminaries

We set (in particular, ) and . We say that an event happens with high probability if it happens with probability at least , where the degree of the polynomial can be an arbitrary constant. We write and .

Let be strings over an alphabet with polynomial size. We denote by the length of . We denote by the concatenation of and . We denote by  the -th character in starting with index zero. We denote by the substring of with indices in , that is, including  and excluding . For out-of-bounds indices we set . If  and  have the same length, we define their Hamming distance as the number of non-matching characters . For two strings  with possibly different lengths, we define their edit distance as the smallest number of character insertions, deletions and substitutions necessary to transform into . An optimal alignment between  and  is a monotonically non-decreasing function  such that , and

It is easy to see that there is an optimal alignment between any two strings : Trace an optimal path through the standard dynamic program for edit distance and assign to the smallest for which the path crosses .

Let be a rooted tree, and let be a node in . We denote by the root node in . We denote by the parent node of . We denote by the length of the root-to- path, and by the length of the longest -to-leaf path.

3 Overview

3.1 A Linear-Time Algorithm à la Andoni-Krauthgamer-Onak

We start to outline an almost-linear-time algorithm to approximate the edit distance of two strings  following the framework of the Andoni-Krauthgamer-Onak algorithm [AndoniKO10], with some changes as in [BringmannCFN22] and some additional modifications (see the novel trick outlined at the end of this subsection).

First Ingredient: A Divide-and-Conquer Scheme

The basic idea of the algorithm is to apply a divide-and-conquer scheme to reduce the approximation of the global edit distance to approximating the edit distance of several smaller strings. The straightforward idea of partitioning both strings into parts  and computing the edit distances does not immediately work; instead we need to consider several shifts of the string . We remark that this concept of recurring on smaller strings for several shifts is quite standard in previous work. The following lemma uses the same ideas as the “-distance” defined in [AndoniKO10]. We give a proof in Appendix A.

[Divide and Conquer] Let be length- strings, and let . We write and .

  • For all shifts we have that .

  • There are shifts with and for all .

To explain how to apply Section 3.1, we first specify on which substrings our algorithm is supposed to recur. To this end, let be a balanced -ary tree with  leaves. will act as the “recursion tree” of the algorithm. For a string of length , we define a substring  for every node in as follows: If the subtree below spans from the -th to the -th leaf (ordered from left to right), then we set . In particular, is a single character for each leaf , and . We further define . For concreteness, we set throughout the paper.

A Simple Algorithm

Based on Section 3.1, we next present a simple (yet slow) algorithm. Our goal is to compute, for each node in the tree , an approximation  of  for all shifts . The result at the root node is returned as the desired approximation of . The algorithm works as follows: For each leaf we can cheaply compute  exactly by comparing the single characters  and . For each internal node with children  we compute

(1)

A careful application of Section 3.1 shows that if the recursive approximations have multiplicative error at most , then by approximating as in (1) the multiplicative error becomes . Since we repeat this argument recursively up to depth , the multiplicative error accumulates to .

This simple algorithm achieves the desired approximation quality, however, it is not fast enough: For every node we have to compute for too many shifts (naively speaking, for up to shifts). As a first step towards dealing with this issue, we first show that at every node we can in fact tolerate a certain additive error (in addition to the multiplicative error discussed before) using a technique called Precision Sampling. Then we exploit the freedom of additive errors to run this algorithm for a restricted set of shifts .

Second Ingredient: Precision Sampling

It ultimately suffices to compute an approximation of with additive error in order to solve the constant-gap edit distance problem. We leverage this freedom to also solve the recursive subproblems up to some additive error. Specifically, we will work with the following data structure:

[Precision Tree] Let be a balanced -ary tree with leaves. For , we randomly associate a tolerance to every node in as follows:

  • If is the root, then set ;

  • otherwise set , where we sample

    (the exponential distribution with parameter

    ).

We refer to as a precision tree with initial tolerance .

The tolerance at a node determines the additive error which we can tolerate at . That is, our goal is to approximate with additive error (and the same multiplicative error as before). The initial tolerance is set to . The critical step is how to combine the recursive approximations with additive error to an approximation with additive error . The naive solution would incur error . Instead, we employ the Precision Sampling Lemma [AndoniKO10, AndoniKO11, Andoni17, BringmannCFN22] (see Algorithm 2 in Section 4) to recombine the recursive approximation and avoid this blow-up in the additive error.

An Improved Algorithm

We can now improve the simple algorithm to a near-linear-time algorithm. In the original Andoni-Krauthgamer-Onak algorithm this was achieved by pruning most recursive subproblems (depending on their tolerances ). We will follow a different avenue: Our algorithm recurs on every node in the precision tree, and we obtain a linear-time algorithm by bounding the expected running time per node by .

We achieve this by the following novel trick: We restrict the set of feasible shifts at each node with respect to the tolerance . In fact, we require two constraints: First, we restrict  to values smaller than in absolute value. This first restriction is correct since we only want to maintain edit distances bounded by ; this idea was also used in previous works. Second, we restrict the feasible shifts at any node to multiples of . Then, in order to approximate for any shift we let denote the closest multiple of  to , and approximate by . Since , both edit distances differ by at most . Recall that we can tolerate this error using the Precision Sampling technique. Let denote the set of shifts respecting these restrictions (the precise lower-order term will be fixed later).

In terms of efficiency, we have improved as follows: At every node the running time is essentially dominated by the number of feasible shifts . Using our discretization trick, there are only such shifts. By the following Section 3.1, we can bound this number in expectation by .

[Expected Precision] Let be a -ary precision tree with initial tolerance and , and let be a node in . Then, conditioned on a high-probability event , it holds that

We include a proof in Appendix A. For technical reasons, the lemma is only true conditioned on some high-probability event . For the remainder of this paper, we implicitly condition on this event .

3.2 How to Go Sublinear?

Following the idea of [BringmannCFN22], our strategy is to turn this algorithm into a sublinear-time algorithm by not exploring the whole precision tree recursively, and instead only exploring a smaller fragment. To achieve this, the goal is to approximate for many nodes directly, without the need to explore their children—we say that we prune . The algorithm by [BringmannCFN22] uses several structural insights on periodic versus non-periodic strings to implement pruning rules. We can avoid the complicated treatment and follow a much simpler avenue, exploiting that we can preprocess the strings.

In the following, we will assume that we have access to an oracle answering the following two queries. In the next section we will argue how to efficiently implement data structures to answer these queries.

  • : Returns either where satisfies and , or in case that there is no shift with , see Section 4.1. (Note that if , the query can return either or .)

  • : For shifts , returns an approximation of with additive error (and -multiplicative error, see Section 4.2).

Suppose for the moment that both queries can be answered in constant time. Then we can reduce the running time of the previous linear-time algorithm to time

as follows: We try to prune each node by querying . If the query reports , we simply continue recursively as before (i.e., no pruning takes place). However, if the matching query reports , then we can prune as follows: Query for all shifts and return the outcome as an approximation of  for all shifts . For the correctness we apply the triangle inequality and argue that the additive error is bounded by .

It remains to argue that the number of recursive computations is bounded by . The intuitive argument is as follows: Assume that (i.e., we are in the “close” case) and consider an optimal alignment between and , which contains at most mismatches. Recall that each level of the precision tree induces a partition of into consecutive substrings . Thus, there are at most substrings which contain a mismatch in the optimal alignment. For all other substrings, there are no mismatches and hence for some shift . It follows that on every level there are at most nodes for which the  test fails, and in total there are only such nodes.

3.3 How to Answer Matching and Shifted-Distance Queries?

It remains to find data structures which answer these queries efficiently. We assume that the precision tree has been generated in advance and is shared across all precomputations. In particular, this requires “public randomness” for the otherwise independent precomputations.

Matching Queries

The idea is to precompute and store fingerprints (i.e. hashes) of the substrings  for every node in the partition tree and all shifts . Then, to answer a query we simply compute the fingerprint of and lookup whether it equals one of the precomputed fingerprints. Alas, a naive implementation of this idea is too slow since upon query we might need to read the whole string . To obtain the desired sublinear query time, we instead subsample the strings and with rate . In this way we incur additive Hamming error at most , as desired. Formally, we show the following lemma in Section 4.1:

[Matching Queries] We can preprocess in expected time to answer queries in time with high probability. Moreover, we can separately preprocess both and in expected time to answer queries in time with high probability.

The key difference between the one-sided and the two-sided preprocessing is that in the former we need to compute the fingerprint for , which takes time (we shave the factor due to the subsampling), while in the latter we can afford to precompute these fingerprints and answer the queries faster.

Shifted-Distance Queries

We give two ways to answer queries. The first one relies in a black-box manner on any almost-linear-time algorithm to compute an edit distance approximation with multiplicative error  [AndoniKO10, AndoniO12, AndoniN20]. Using the same trick as before to restrict the set of feasible shifts, it suffices to approximate for all shifts  and . (We could also discretize the range of , but this does not improve the performance here.) In this way, we incur an additive error of at most . The computation per node takes time , which becomes  in expectation by Section 3.1 and by summing over all nodes .

We then show that we can improve the preprocessing time to by applying the non-oblivious embedding of edit distance into by Andoni and Onak [AndoniO12]. Formally we obtain the following lemma, which we prove in Section 4.3.

[Shifted-Distance Queries] We can preprocess in time to answer queries in time with high probability.

4 Our Algorithm in Detail

In this section we give a detailed proof of our main theorems by analyzing Algorithms 2 and 1 (see Sections 4.5 and 4.4). Algorithm 2 is the previously discussed reformulation of the Andoni-Krauthgamer-Onak algorithm, with the improvement that the recursive computation can be avoided whenever Algorithm 1 succeeds. Algorithm 1 implements the pruning rule which, as we will argue, triggers often enough to improve the running time.

Throughout we fix the initial tolerance of the precision tree  to . Moreover, we set and . Observe that  for our choice of .

Outline

Compared to the technical overview, we present the details in the reverse order: Starting with the implementation of the data structures for and queries (in Section 4.1), we first analyze Algorithm 1 (in Section 4.4). Then we analyze Algorithm 2 (in Section 4.5) and put the pieces together for our main theorems (in Section 4.6).

4.1 Matching Queries

In this section we show to implement data structures that answer queries. We start with a formal definition.

[Matching Queries] Let be strings of length , and let be a node in the precision tree. A query is correctly answered by one of two outputs:

  • for some shift satisfying , or

  • if there exists no shift with .

See 3.3

Proof.

We first explain the general idea behind the data structure and then point out the specifics for the one-sided and two-sided preprocessing. For each node in the precision tree, we subsample a set with rate and sample a hash function  from any universal family of hash functions. (Take for instance the function for some prime and random ’s).

We associate to a string the fingerprint , where we write for the subsequence of with indices in . We claim that these fingerprints can distinguish any two strings with with high probability. Indeed, the probability that contains an index with is at least

Hence, with high probability we have that . Moreover, since is sampled from a universal family of hash functions, we also have that with high probability. On the other hand, if then clearly . We now turn to the implementation details for one-sided and two-sided preprocessing.

One-Sided Preprocessing. In the preprocessing phase, we prepare for every node the fingerprints of for all shifts , and store these fingerprints in a lookup table. In the query phase, given some string we compute the fingerprint of and check whether it appears in the lookup table. If so, we return for the shift corresponding to the precomputed fingerprint. Otherwise, we return . The correctness follows from the previous paragraph.

The expected running time of the preprocessing phase is bounded by (for every node we have to prepare fingerprints, each taking time )). This becomes in expectation over the tolerances ; see Section 3.1. The query time is dominated by computing the fingerprint of  which takes expected time (even with high probability by an application of the Chernoff bound).

Two-Sided Preprocessing. For two-sided preprocessing, we can also prepare the fingerprints of for all nodes in the preprocessing phase. The expected preprocessing time is still , but for queries it only takes constant time to perform the lookup. ∎

4.2 Simple Shifted-Distance Queries

We next demonstrate how to deal with queries, formally defined as follows:

[Shifted-Distance Queries] Let be a string of length , let be a node in the precision tree and let . A query computes a number satisfying:

We will first show how to implement a simpler version of Section 3.3 at the cost of worsening the preprocessing time to instead of . The benefit of this weaker version is that it is a black-box reduction to any almost-linear time -approximation algorithm for edit distance, while the improved version crucially relies on the properties of the particular algorithm by Andoni and Onak [AndoniO12]. In the next section we show how to obtain the speed-up.

[Slower Shifted-Distance Queries] We can preprocess in time to answer queries in time with high probability.

Proof.

We will use the result that an -approximation for the edit distance of two length- strings can be computed in time  [AndoniKO10, AndoniO12, AndoniN20]. In the preprocessing phase, we compute -factor approximations for all nodes , all  and all . Then, to answer a query we let

and output .

First we argue that this gives a good approximation. Indeed, we have that by the definition of . Therefore:

where second inequality is an application of the triangle inequality, and the last inequality follows since we can transform into by adding and removing symbols. A symmetric argument shows the claimed lower bound .

Next we analyze the running time. For the preprocessing, we compute  many approximations for each node , each in time . We can bound  by Section 3.1, so the expected total time is . Answering a query takes constant time since we only need to compute as stated above and perform a constant-time lookup. ∎

4.3 Faster Shifted-Distance Queries

In this section we show how to improve the preprocessing time for queries to . We thank the anonymous reviewer who suggested this improvement. The key technical tool to obtain this improvement is the following result by Andoni and Onak [AndoniO12].

[Embedding Substrings into  [AndoniO12]] Let be a string of length . Then, for each integer of the form for and , we can embed all length- substrings of

into vectors

of dimension such that for every , with high probability it holds that

The time to compute all vectors is .

The main application of this embedding in [AndoniO12] is an -approximation for the edit distance of two length- strings in time (in [AndoniO12], Section 4.3 is applied to the concatenation of two strings to compute the -distance between their corresponding vectors). We remark that the guarantee of the embedding in Section 4.3 is non-oblivious, in the sense that the algorithm needs to have access to all the substrings it is embedding. In particular, this means that it cannot be directly applied in the two-sided preprocessing setting where we would like to embed the strings separately.

See 3.3

Proof.

We assume without loss of generality that is a power of

(we can pad both

and so that this holds, which has no impact on the running time or the approximation guarantee of our algorithms). We apply Section 4.3 to the string and store all the embedded vectors. To answer a query , note that and are substrings of of length . Thus, the distance between their corresponding vectors given by Section 4.3 gives the desired approximation (with no additive error). Since these vectors have dimension , the time to compute and answer the query is , as desired. ∎

4.4 Pruning Rule for Preprocessed Strings

In this section we analyze Algorithm 1. We always assume that either only or both and have been preprocessed by Sections 3.3 and 3.3 to efficiently answer and queries.

Strings (un- or preprocessed) and (preprocessed), a node in the precision tree An approximation of for all , or Fail
1:if  then
2:     return for all
3:else
4:     return Fail
Algorithm 1

[Correctness of Algorithm 1] Whenever Algorithm 1 does not return Fail, it returns approximations  satisfying with high probability

Proof.

Algorithm 1 only succeeds if the query in creftype 1 successfully identified some shift with by Section 3.3. In this case, we report . By Section 3.3, this is an approximation of with additive error and multiplicative error . Combining both facts, and using the triangle inequality we obtain that

and

[Running Time of Algorithm 1] If only the string is preprocessed, then Algorithm 1 runs in expected time . If both strings are preprocessed, then Algorithm 1 runs in expected time .

Proof.

The expected running time of the query is bounded by (for one-sided preprocessing) or by (for two-sided preprocessing). The running time of a single query is bounded by , and we make  queries. Hence, the total expected time is  (for one-sided preprocessing) or (for two-sided preprocessing). ∎

[Efficiency of Algorithm 1] For any two strings , there are at most  many nodes in the precision tree for which Algorithm 1 fails, assuming that .

Proof.

It suffices to argue that there are at most nodes for which the query in creftype 1 fails. We start with a fixed level in the precision tree, and enumerate all nodes on that level as . By definition we have that , hence there exist indices  such that . Now consider an optimal alignment between and . In particular, satisfies

There can be at most many nonzero terms in the sum, and we claim that every zero term corresponds to a node for which Algorithm 1 succeeds. Indeed, for any zero term we have and therefore where . It remains to argue that (since otherwise the index would be out-of-bounds and could not be detected by a query). To see this, observe that

exploiting again that is an optimal alignment.

Finally, recall that there are only levels in the precision tree, hence the total number of nodes for which Algorithm 1 fails is bounded by . ∎

4.5 The Complete Algorithm

Our complete Algorithm 2 is essentially what we described in Section 4.5 with the additional pruning rule of applying Algorithm 1. We start with the correctness of Algorithm 2. We need the following lemma, which has previously been referred to as a Precision Sampling Lemma [AndoniKO10, AndoniKO11, Andoni17, BringmannCFN22]. The lemma was introduced by Andoni, Krauthgamer and Onak [AndoniKO10], and was refined in [AndoniKO11, Andoni17].

Intuitively, the lemma serves the following purpose: For fixed numbers , say that we have access to approximations with multiplicative error and additive approximation error . Then we can naively approximate by with multiplicative error and additive error . The Precision Sampling Lemma states that the blow-up in the additive error can be avoided if the approximations instead have additive error for some non-uniformly sampled precisions .

Strings (un- or preprocessed), (preprocessed), a node in the precision tree An approximation of for all
1:if Algorithm 1 succeeds and reports for all