Approximate Trace Reconstruction via Median String (in Average-Case)

We consider an approximate version of the trace reconstruction problem, where the goal is to recover an unknown string s∈{0,1}^n from m traces (each trace is generated independently by passing s through a probabilistic insertion-deletion channel with rate p). We present a deterministic near-linear time algorithm for the average-case model, where s is random, that uses only three traces. It runs in near-linear time Õ(n) and with high probability reports a string within edit distance O(ϵ p n) from s for ϵ=Õ(p), which significantly improves over the straightforward bound of O(pn). Technically, our algorithm computes a (1+ϵ)-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown s. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time O(n^3).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/24/2021

Near-Optimal Average-Case Approximate Trace Reconstruction from Few Traces

In the standard trace reconstruction problem, the goal is to exactly rec...
11/02/2020

Approximating the Median under the Ulam Metric

We study approximation algorithms for variants of the median string prob...
12/12/2020

Approximate Trace Reconstruction

In the usual trace reconstruction problem, the goal is to exactly recons...
08/27/2020

Polynomial-time trace reconstruction in the smoothed complexity model

In the trace reconstruction problem, an unknown source string x ∈{0,1}^n...
07/12/2019

Efficient average-case population recovery in the presence of insertions and deletions

Several recent works have considered the trace reconstruction problem, i...
07/06/2020

Near-Linear Time Edit Distance for Indel Channels

We consider the following model for sampling pairs of strings: s_1 is a ...
03/04/2020

Pivot Selection for Median String Problem

The Median String Problem is W[1]-Hard under the Levenshtein distance, t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Trace Reconstruction.

One of the most common problems in statistics is to estimate an unknown parameter from a set of noisy observations (or samples). The main objectives are (1) to use as few samples as possible, (2) to minimize the estimation error, and (3) to design an efficient estimation algorithm. One such parameter-estimation problem is

trace reconstruction, where the unknown quantity is a string , and the observations are independent traces, where a trace is a string that results from passing through some noise channel. The goal is to reconstruct using a few traces. (Unless otherwise specified, in this paper we consider .) Various noise channels have been considered so far. The most basic one only performs substitutions. A more challenging channel performs deletions. Even more challenging is the insertion-deletion channel, which scans the string and keeps the next character with probability , deletes it with probability , or inserts a uniformly randomly chosen symbol (without processing the next character) with probability , for some noise-rate parameter . We denote this insertion-deletion channel by , see Section 2 for a formal definition.111In the literature, an insertion-deletion channel with different probabilities for insertion and for deletion has been studied. For simplicity in exposition, we consider a single error probability throughout this paper, however our results can easily be generalized to different insertion and deletion probabilities. Another possible generalization is to allow substitutions along with insertions and deletions. Again, for simplicity, we do not consider substitutions, but with slightly more careful analysis our results could be extended.

The literature studies mostly two variants of trace reconstruction. In the worst-case variant, the unknown string is an arbitrary string from , while in the average-case variant, is assumed to be drawn uniformly at random from . The trace reconstruction problem finds numerous applications in computational biology, DNA storage systems, coding theory, etc. Starting from early 1970s [Kal73], various other versions of this problem have been studied, including combinatorial channels [Lev01a, Lev01b], smoothed complexity [CDL21b], coded trace reconstruction [CGMR20], and population recovery [BCF19, Nar20].

We focus on the average-case variant, where it is known that samples suffice to reconstruct a (random) unknown string over the insertion-deletion channel [HPPZ20]. On the other hand, a recent result [Cha21a] showed that samples are necessary, improving upon the previous best lower bound of  [HL20]. We emphasize that all these upper and lower bounds are for exact trace reconstruction, i.e., for recovering the unknown string perfectly (with no errors). A natural question proposed by Mitzenmacher [Mit09] is whether such a lower bound on the sample complexity can be bypassed by allowing approximation, i.e., by finding a string that is “close” to the unknown string . One of the most fundamental measures of closeness between a pair of strings and , is their edit distance, denoted by and defined as the minimum number of insertion, deletion, and substitution operations needed to transform into . Observe that a trace generated from via an insertion-deletion channel has expected edit distance about from the unknown string (see Section 3). We ask how many traces (or samples) are required to construct a string at a much smaller edit distance from the unknown . (Since the insertion-deletion channel has no substitutions, we also do not consider substitutions in our analysis of the edit distance, however the results probably hold also when both allow also substitutions.)

A practical application of average-case trace reconstruction is in the portable DNA-based data storage system. In the DNA storage system [GBC13, RMR17], a file is preprocessed by encoding it into a DNA sequence. This encoded sequence is randomized using a pseudo-random sequence, and thus the final encoding sequence could be treated as a (pseudo-)random string. The stored (encoded) data is retrieved using next-generation sequencing (like single-molecule real-time sequencing (SMRT) [RCS13] that involves , which generates several noisy copies (traces) of the stored data via some insertion-deletion channel. The final step is to decode back the stored data with as few traces as possible. Currently, researchers use multiple sequence alignment algorithms to reconstruct the trace [YGM17, OAC18]

. Unfortunately, such heuristic algorithms are notoriously difficult to analyze rigorously to show a theoretical guarantee. However, the preprocessing step also involves error-correcting code to encode the strings. Thus it suffices to reconstruct the original string up to some small error (depending on the error-correcting codes used). This specific application gives one motivation to study approximate trace reconstruction.

Our main contribution is to show that it is sufficient to use only three traces to reconstruct the unknown string up to a small edit error.

Theorem 1.1.

There is a constant and a deterministic algorithm that, given as input a noise parameter , and three traces from the insertion-deletion channel for a uniformly random (but unknown) string , outputs in time a string that satisfies .

The probability in this theorem is over the random choice of and the randomness of the insertion-deletion channel . We note that the term in the estimation error can be shaved by increasing the alphabet size to . An edit error of is optimal for three traces, because in expectation characters of are deleted in two of the three traces, and look as if they are inserted to in one of the three traces (which occurs in expectation for even more characters).

Our theorem demonstrates that the number of required traces exhibits a sharp contrast between exact and approximate trace reconstruction. In fact, approximate reconstruction not only beats the lower bound for exact reconstruction, but surprisingly uses only three traces! We conjecture that the estimation error can be reduced further using more than three traces. We believe that our technique can be useful here, but this is left open for future work.

Conjecture 1.

The estimation error in Theorem 1.1 can be reduced to for arbitrarily small , using traces.

This conjecture holds for , as follows from known bounds for exact reconstruction [HPPZ20], and perhaps suggests that a number of traces that is sub-polynomial in it suffices for all .

Median String.

As mentioned earlier, a common heuristic to solve the trace reconstruction problem is multiple sequence alignment, which can be formulated equivalently (see [Gus97] and the references therein) as the problem of finding a median under edit distance. For general context, the median problem is a classical aggregation task in data analysis; its input is a set of points in a metric space relevant to the intended application, and the goal is to find a point (not necessarily from ) with the minimum sum of distances to points in , i.e.,

(1)

Such a point is called a median (or geometric median in a Euclidean space). For many applications, it suffices to find an approximate median, i.e., a point in the metric with approximately minimal objective value (1) . The problem of finding an (approximate) median has been studied extensively both in theory and in applied domains, over various metric spaces, including Euclidean [CLM16] (see references therein for an overview), Hamming (folklore), the edit metric [San75, Kru83, NR03], rankings [DKNS01, ACN08, KS07], Jaccard distance [CKPV10], Ulam [CDK21], and many more [FVJ08, Min15, CCGB17].

The median problem over the edit-distance metric is known as the median string problem [Koh85], and finds numerous applications in computational biology [Gus97, Pev00], DNA storage system [GBC13, RMR17], speech recognition [Koh85], and classification [MJC00]. This problem is known to be NP-hard [dlHC00, NR03] (even W[1]-hard [NR03]), and can be solved by standard dynamic programming [San75, Kru83] in time when the input has strings of length each. From the perspective of approximation algorithms, a multiplicative -approximation to the median is straightforward (this works in every metric space by simply reporting the best among the input strings, i.e., that minimizes the objective). However, no polynomial-time algorithm is known to break below -approximation (i.e., achieve factor for fixed ) for the median string problem, despite several heuristic algorithms and results for special cases [CA97, Kru99, FZ00, PB07, ARJ14, HK16, MAS19, CDK21].

Although the median string (or equivalently multiple sequence alignment) is a common heuristic for trace reconstruction [YGM17, OAC18], to the best of our knowledge there is no definite connection between these two problems. We show that both the problems are roughly the same in the average-case model. It is not difficult to show that any string close to the unknown string is an approximate median. To see this, we can show that for a set of traces of an (unknown) random string , their optimal median objective value is at least with high probability. On the other hand, the median objective value with respect to itself is at most with high probability. (See the proof of Claim 4.5.) Hence, the unknown string is an -approximate median of , and by the triangle inequality, every string close (in edit distance) to is also an approximate median of . One of the major contributions of this paper is the converse direction, showing that given a set of traces of an unknown string, any approximate median of the traces is close (in edit distance) to the unknown string. This is true even for three traces.

Theorem 1.2.

For a large enough and a noise parameter , let the string be chosen uniformly at random, and let be three traces generated by the insertion-deletion channel . If is a -approximate median of for , then , where OPT denotes the optimal median objective value of .

An immediate consequence (see Corollary 4.9) is that for every traces, every -approximate median satisfies .

Thus if we could solve any of the two problems (even approximately), we get an approximate solution to the other problem. E.g., the current best (exact) trace reconstruction algorithm for the average-case [HPPZ20] immediately gives us an time algorithm to find an -approximate median of a set of traces. We leverage this interplay between the two problems to design an efficient algorithm for approximate trace reconstruction. Since one can compute the (exact) median of three strings in time  [San75, Kru83], the above theorem immediately provides us the unknown string up to some small edit error in time . We further reduce the running time to near-linear by cleverly partitioning each of the traces into -size blocks and then applying the median algorithm on these blocks. Finally, we concatenate all the block-medians to get an “approximate” unknown string, leading to Theorem 1.1. One may further note that Theorem 1.1 also provides a -approximate median for any set of traces in the average-case (again due to Theorem 1.2).

Taking the smallest possible in Theorem 1.2, we get that for three traces generated from , with high probability . In comparison, it is not hard to see that with high probability OPT is bounded by roughly . We conjecture that the number of traces increases, the median string converges to the unknown string . In particular, when using traces (instead of just three), with high probability. We hope that our technique can be extended to prove the above conjecture, but we leave it open for future work.

The main implication of this conjecture is an time approximate trace reconstruction algorithm, for any fixed , as follows. It is straightforward to extend our approximate median finding algorithm (in Section 5) to more input strings. (For brevity, we present only for three input strings.) For strings, the running time would be , and for strings this running time is . As a consequence, we will be able to reconstruct in time a string such that , which in particular implies Conjecture 1.

1.1 Related Work

A systematic study on the trace reconstruction problem has been started since [Lev01a, Lev01b, BKKM04]. However, some of its variants appeared even in the early ’70s [Kal73]. One of the main objectives here is to reduce the number of traces required, aka the sample complexity. Both the deletion only and the insertion-deletion channels have been considered so far. In the general worst-case version, the problem considers the unknown string to be any arbitrary string from . The very first result by Batu et al. [BKKM04] asserts that for small deletion probability (noise parameter) , to reconstruct considering samples suffice. A very recent work [CDL21a] improved the sample complexity to while allowing a deletion probability . For any constant deletion probability bounded away from 1, the first subexponential (more specifically, ) sample complexity was shown by [HMPW08], which was later improved to  [NP17, DOS17], and then finally to  [Cha21b].

Another natural variant that has also been widely studied is the average-case, where the unknown string is randomly chosen from . It turns out that this version is significantly simpler than the worst-case in terms of the sample complexity. For sufficiently small noise parameter ( as a function of ), efficient trace reconstruction algorithms are known [BKKM04, KM05, VS08]. For any constant noise parameter bounded away from 1 in case of insertion-deletion channel, the current best sample complexity is  [HPPZ20] improving up on  [PZ17]. Both of these results are built on the worst-case trace reconstruction by [NP17, DOS17]. Furthermore, the trace reconstruction algorithm of [HPPZ20] runs in time.

In the case of the lower bound, information-theoretically, it is easy to see that samples must be needed when the deletion probability is at least some constant. In the worst-case model, the best known lower bound on the sample complexity is  [Cha21a]. For the average-case, McGregor, Price, and Vorotnikova [MPV14] showed that samples are necessary to reconstruct the unknown (random) string . This bound was further improved to by Holden and Lyons [HL20], and very recently to by Chase [Cha21a].

The results described above show an exponential gap between the upper bound and lower bound of the sample complexity. The natural question is, instead of reconstructing the unknown string exactly, if we allow some error in the reconstructed string, then can we reduce the sample complexity? Recently, Davies et al. [DRRS20] presented an algorithm that for a specific class of strings (considering various run-lengths or density assumptions), can compute an approximate trace with additive error under the edit distance while using only samples. The authors also established that to approximate within the edit distance , the number of required samples is , for , in the worst case. Independently, Grigorescu et al. [GSZ20] showed assuming deletion probability , there exist two strings within edit distance 4 such that any mean-based algorithm requires samples to distinguish them.

1.2 Technical Overview

The key contribution of this paper is a linear-time approximate trace reconstruction algorithm that uses only three traces to reconstruct an unknown (random) string up to some small edit error (Theorem 1.1). To get our result, we establish a relation between the (approximate) trace reconstruction problem and the (approximate) median string problem. Consider a uniformly random (unknown) string . We show that for any three traces of generated by the probabilistic insertion-deletion channel , an arbitrary -approximate median of the three trace must be, with high probability, -close in edit distance to (Theorem 1.2). Once we establish this connection, it suffices to solve the median problem (even approximately). The median of three traces can be solved optimally in time using a standard dynamic programming algorithm [San75, Kru83]. It is not difficult to show that the optimal median objective value OPT is at least (Claim 4.5), and thus the computed median is at edit distance at most from the unknown string . This result already beats the known lower bound for exact trace reconstruction in terms of sample complexity. However, the running time is cubic in , whereas current average-case trace reconstruction algorithms run in time  [HPPZ20].

Next we briefly describe the algorithm that improves the running time to . Instead of finding a median of the entire traces, we compute the median block-by-block and then concatenate the resulting blocks. A natural idea is that each such block is just the median of three substrings taken from the three traces, but the challenge is to identify which substring to take from each trace, particularly because is not known. To mitigate this issue, we take the first trace and partition it into disjoint blocks of length each. For each such block, we consider its middle -size sub-block as an anchor. We then locate for each anchor its corresponding substrings in the other two traces and

, using any approximate pattern matching algorithm under the edit metric (e.g. 

[LV89, GP90]) to find the best match of the anchor inside . Each anchor has a true match in and in , i.e., the portion that the anchor generated under the noise channel. Since the anchors in are “well-separated” (by at least ), their true matches are also far apart both in . Further, exploiting the fact that is a random string, we can argue that each anchor’s best match and true match overlap almost completely (i.e., except for a small portion), see Section 5). We thus treat these best match blocks as anchors in and and partition them into blocks. From this point, the algorithm is straightforward. Just consider the first block of each of and compute their median. Then consider the second block from each trace and compute their median, and so on. Finally, concatenate all these block medians, and output the resulting string.

The crucial claim is that the best match and true match of an anchor in are the same except for a small portion, is crucial from two aspects. First, it ensures that any -th block of contains the true match of the -th anchor of . Consequently, computing a median of these blocks reconstructs the corresponding portion of the unknown string up to edit distance with high probability. Thus for “most of the blocks”, we can reconstruct up to such edit distance bound. We can make the length of the non-anchor portions negligible compared to the anchors (simply because a relatively small “buffer” around each anchor suffices), and thus we may ignore them and still ensure that the output string is -close (in edit distance) to the unknown string . (See the proof of Lemma 5.6 for the details.) The second use of that crucial claim is that it helps in searching for the best match of each anchor “locally” (within a -size window) in each , . As a result, we bound the running time of the pattern matching step by . The median computations are also on -size blocks, and thus takes a total time.

It remains to explain the key contribution, which is the connection between the (approximate) trace reconstruction and the (approximate) median string problem. Its first ingredient is that there is an “almost unique” alignment between the unknown string and a trace of it generated by the insertion-deletion channel . We provide below an overview of this analysis (see Section 3 for details). We start by considering the random string and a string generated by passing through the noise channel . For sake of analysis, we can replace with an equivalent probabilistic model , that first computes a random alignment between and , and only then fills in random characters in and in the insertion-positions in . This model is more convenient because it separates the two sources of randomness, for example we can condition on one () when analyzing typical behavior of the other (characters of ).

Figure 1: (a) An example of well separated edit operation: deletes and aligns rest of the characters in block with block . (b) aligns and . For each index appearing left of in , .

In expectation, the channel generates a trace by performing about random edit operations in (planting insertions/deletions), hence the planted alignment has expected cost about . But can these edit operations cancel each other? Can they otherwise interact, leading to the optimal edit distance being smaller? For example, suppose . If first inserts a before and then deletes , then clearly these two operations cancel each other. We show that such events are unlikely. Following this intuition, we establish our first claim, that with high probability and the edit distance between is large, specifically for , see Lemma 3.3; thus, the planted alignment is near-optimal. Towards proving this, we first show that a vast majority of the planted edit operations are well-separated, i.e., have positions between them. In this case, for one operation to cancel another one, the characters appearing between them in must all be equal, which happens with a small probability because is random.

Formally, for almost all indices where performs some edit operation, the block around it in (for a small constant ), satisfies that is the only index in that edits (see Lemma 3.2). Next we show that in every optimal alignment between and , almost all these blocks contribute a cost of . As otherwise, there is locally an alignment that aligns each index in to some character in , while makes exactly one edit operation, say deletes . See for example Figure 1, where aligns and whereas deletes . In this case, and must disagree on at least indices (all indices either to the right or to the left of in ). In Figure 1, all satisfy . The crux is that any pair of symbols in are chosen independently at random unless aligns their positions. Thus probability that in each of the pairs aligned by , the two matched symbols will be equal is . In the formal proof, we address several technical issues, like having not just one but many blocks, and possible correlations due to overlaps between different pairs, which are overcome by a carefully crafted union bound.

We further need to prove that the planted alignment is robust, in the sense that, with high probability, every near-optimal alignment between and must “agree” with the planted alignment on all but a small fraction of the edit operations. Formally, we again consider a partition of into blocks containing exactly one planted edit operation, and show that for almost all such blocks , if maps to a substring in , that near-optimal alignment also maps to (see Lemma 3.6). To see this, suppose there is a near-optimal alignment that maps to . Then following an argument similar to the above, we can show there are many indices in the block such that and disagree on them. Thus, in each such block, tries to match many pairs of symbols that are chosen independently at random, and therefore the probability that matches and with a cost at most is small. Compared to Lemma 3.2, an extra complication here is that now we allow to match and with cost at most (and not only ), and in particular can have three different lengths: . Hence the analysis must argue separately for all these cases, requiring a few additional ideas/observations.

After showing that a near-optimal alignment between a random string and is almost unique (or robust), we use that fact to argue about the distance between and any approximate median of the traces. Let be three independent traces of , generated by . We can view as a uniformly random string, and are generated from by a insertion-deletion channel with a higher noise rate . (See Section 2 for the details.) Hence, any near-optimal alignment between and “agree” with the planted alignment induced by (denoted by and respectively). Next, we consider the alignment from to via , that we get by composing the planted alignment from to induced by (actually the inverse of the alignment from to ) with the planted alignment from to induced by . Similarly, consider the alignment from to via . Then we take any -approximate median of . Consider an optimal alignment between , and , and . Use these three alignments to define an alignment from to via , and an alignment from to via . It is not hard to argue that both and are near-optimal alignments between . Thus, both of them agree with the planted alignment . Similarly, both and agree with the planted alignment . Observe, are not independently generated from by . The overlap between and essentially provides an alignment from to . Again, using the robustness property of the planted alignment, this overlap between and agrees with the planted alignment from to by (actually the inverse of the alignment from to ). On the other hand, since agrees with and agrees with , there is also a huge agreement between the overlap of and the overlap of . The overlap between is essentially the optimal alignment from to (that we have considered before). This in turn implies that there is a huge agreement between the optimal alignment from to and the planted alignment from to by . Hence, we can deduce that and are the same in most of the portions, and thus has small edit distance. We provide the detailed analysis in Section 4.

1.3 Preliminaries

Alignments.

For two strings of length , an alignment is a function that is monotonically increasing on the support of , defined as , and also satisfies for all . An alignment is essentially a common subsequence of , but provides the relevant location information. Define the length (or support size) of the alignment as , i.e., the number of positions in (equivalently in ) that are matched by . Define the cost of to be the number of positions in and in that are not matched by , i.e., . Let denotes the minimum cost of an alignment between .

Given a substring of , let and , be the first and last positions in the substring that are matched by alignment . If the above is not well-defined, i.e., for all , then by convention . Let be the mapping of under . If , then by convention is an empty string. Let be the positions in that are not aligned by , and similarly let be the positions in not aligned by . These quantities are related because the number of matched positions in is the same as in , giving us . Define the cost of alignment on substring to be

By abusing the notation, sometimes we will also use in place of . These definitions easily extend to strings of non-equal length, and even of infinite length.

Lemma 1.3.

Given two strings and an alignment , let be disjoint substrings of . Then

  1. are disjoint substrings of ; and

Proof.

The first claim directly follows from the fact that are disjoint and is monotonically increasing.

Since are disjoint, also are disjoint. Similarly, since are disjoint, also are disjoint. Therefore, . Hence we can claim . ∎

For an alignment , we define the inverse alignment as follows: For each , if for some , set ; otherwise, set .

We use the notation for composition of two functions. Composition of two alignments (or inverse of alignments) is defined in a natural way.

Approximate Median.

Given a set and a string , we refer the quantity by the median objective value of with respect to , denoted by .

Given a set , a median of is a string (not necessarily from ) such that is minimized, i.e., . We refer by . Whenever it will be clear from the context, for brevity we will drop from both and . We call a string a -approximate median, for some , of iff .

2 Probabilistic Generative Model

Let us first introduce a probabilistic generative model. For simplicity, our model is defined using infinite-length strings, but our algorithmic analysis will consider only a finite prefix of each string. Fixing a finite alphabet , we denote by the set of all infinite-length strings over . We write to denote the concatenation of two finite-length strings and .

We actually describe two probabilistic models that are equivalent. The first model is just the insertion-deletion channel mentioned in Section 1. These models are given an arbitrary string (base string) to generate a random string (a trace), but in our intended application is usually a random string. The second model consists of two stages, first “planting” an alignment between two strings, and only then placing random symbols (accordingly). This is more convenient in the analysis, because we often want to condition on the planted alignment and rely on the randomness in choosing symbols.

Model .

Given an infinite-length string and a parameter . Consider the following random procedure:

  1. Initialize . (We use to point to the current index positions of the input string.) Also, initialize an empty string .

  2. Do the following independently at random:

    1. With probability , set and increment . (Match .)

    2. With probability , increment . (Delete .)

    3. With probability , choose independently uniformly at random a character and set . (Insert a random character.)

We call this procedure , and denote the randomized output string by .

Model .

This model first provides a randomized mapping (alignment) , and then uses this alignment (and ) to generate the output string. First, given a parameter , consider the following random procedure to get a mapping :

  1. Initialize and . (Indices of current positions in the input and output strings, respectively.)

  2. Do the following independently at random:

    1. With probability , set and increment both and . (Match .)

    2. With probability , set and increment . (Delete .)

    3. With probability , increment . (Insertion.)

Next, use this and a given string to generate a string as follows. For each ,

  • If there is with then set . (Match .)

  • Otherwise, choose independently uniformly at random a character and set . (Insert a random character.)

We denote by the randomized string generated as above. By construction, is an alignment between and .

Basic Properties.

We claim next that and are equivalent, which is useful because we find it more convenient to analyze . We use

to denote that two random variable

, have equal distribution, i.e., . The next two propositions are immediate.

Proposition 2.1.

For every string and , we have .

Proposition 2.2 (Transitivity).

For every and , let

(2)

Then .

Additional Properties (Random Base String).

Let

be the uniform distribution over strings

, i.e., each character is chosen uniformly at random and independently from . We now state two important observations regarding the process . The first one is a direct corollary of Proposition 2.2. The second observation follows because the probability of insertion is the same as that of deletion at any index in the random process .

Corollary 2.3 (Transitivity).

Let and let be as in (2). Draw a random string , and use it to draw and . Then .

Proposition 2.4 (Symmetry).

Let . Draw a random string and use it to draw another string . Then .

3 Robustness of the Insertion-Deletion Channel

In this section we analyze the finite-length version of our probabilistic model (from Section 2), which gets a random string and generates from it a trace . We provide a high-probability estimates for the cost of the planted alignment between and (Lemma 3.1), and for the optimal alignment (i.e., edit distance) between the two strings (Lemmas 3.3 and 3.4). It follows that with high probability the planted alignment is near-optimal. We then further prove that the planted alignment is robust, in the sense that, with high probability, every near-optimal alignment between the two strings must “agree” with the planted alignment on all but a small fraction of the edit operations (Lemmas 3.6 and 3.7).

Assume henceforth that is a random string , and given a parameter , generate from it a string by the random process described in Section 2, denoting by the random mapping used in this process. A small difference here is that now has finite length, but it can also be viewed as an -length prefix of an infinite string. Similarly, now has finite length and is obtained by applying on , and it can be viewed also as a finite prefix of an infinite string.

Let be the set of indices , for which process performs at least one insert/delete operation after (not including) and up to (including) (i.e., inserting at least one character between , or deleting , or both). This information can clearly be described using alone (independently of and of the symbols chosen for insertions to in the second stage of process ); we omit the formal definition. When clear from the context, we shorten to .

Lemma 3.1.

For every , the probability that is

(3)

For all , we have .

Proof.

For every , the probability that performs edit operations after and up to is , where the first summand represents insertions and one deletion, and the second summand represents insertions and no deletion. Thus .

For every , the probability it appears in is . Then . These events are independent, hence by Chernoff’s bound, the probability that is at most . ∎

As shown in (3), . Hence from now on we replace by . Given and , we consider the event , which informally means that process makes a single “well-spaced” edit operation at position , i.e., there is an edit operation at position and no other edit operations within positions away from . To define it formally, we separate it into two cases, an insertion and a deletion. Observe that these events depend on alone. Let be the event that

  1. (thus ); and

  2. for all , we have and (in particular, they are not ).

Similarly, let be the event that

  1. (thus no index is mapped to );

  2. for all , we have ; and

  3. for all , and .

Now define the set of indices for which any of these two events happens

When clear from the context, we shorten to .

Lemma 3.2.

For every , we have .

Proof.

For an index , define the random variable to be an indicator for the union event . Observe that each of the two events, and , occurs with probability , and these events are disjoint. Hence,

Next we prove a deviation bound for the random variable , which has expectation is . Observe that is a Doob Martingale, and let us apply the method of bounded differences. Revealing (after are already known) might affect the value of ’s for , but by definition their sum is bounded , while the other ’s are independent of hence the expectation of by revealing it. Together, we see that , and therefore by Azuma’s inequality, . ∎

Edit Distance (Optimal Alignment) between .

For each , define a window .

Lemma 3.3.

For every , we have .

At a high level, our proof avoids a direct union bound over all low-cost potential alignments, because there are too many of them. Instead, we introduce a smaller set of basic events that “covers” all these potential alignments, which is equivalent to carefully grouping the potential alignments to get a more “efficient” union bound.

Proof.

We assume henceforth that (the alignment from process ) is known and satisfies , which occurs with high probability by Lemma 3.2. In other words, we condition on and proceed with a probabilistic analysis based only on the randomness of and of the characters inserted into .

Our plan is to define basic events for every two subsets of the same size , representing positions in and in , respectively. We will then show that our event of interest is bounded by these events

(4)

and bound the probability of each basic event by

(5)

The proof will then follow easily using a union bound and a simple calculation.

To define the basic event , we need some notation. Write in increasing order, and similarly , and use these to define blocks in and in , namely, and . Notice that all the blocks are of the same length . Now define to be the event that (i) ;222This implies that the blocks in are disjoint. (ii) the blocks in are disjoint; and (iii) each block in is equal to its corresponding block in . Notice that conditions (i) and (ii) actually depend only on , and thus can be viewed as restrictions on the choice of in (4); with this viewpoint in mind, we can simply write

We proceed to prove (4). Suppose there is an alignment from to with , and consider its cost around each position , namely, . These intervals in are disjoint (by definition of ), and thus by Lemma 1.3,

Let include (the indices of) the summands equal to . Each other summand contributes at least , thus and by rearranging . To get the exact size , we can replace with an arbitrary subset of it of the exact size. Now define . It is easy to verify that the event holds. Indeed, each satisfies , which implies , and thus