Log In Sign Up

Mean-Based Trace Reconstruction over Practically any Replication-Insertion Channel

by   Mahdi Cheraghchi, et al.
University of Michigan
Imperial College London

Mean-based reconstruction is a fundamental, natural approach to worst-case trace reconstruction over channels with synchronization errors. It is known that (O(n^1/3)) traces are necessary and sufficient for mean-based worst-case trace reconstruction over the deletion channel, and this result was also extended to certain channels combining deletions and geometric insertions of uniformly random bits. In this work, we use a simple extension of the original complex-analytic approach to show that these results are examples of a much more general phenomenon: (O(n^1/3)) traces suffice for mean-based worst-case trace reconstruction over any memoryless channel that maps each input bit to an arbitrarily distributed sequence of replications and insertions of random bits, provided the length of this sequence follows a sub-exponential distribution.


page 1

page 2

page 3

page 4


Polynomial-time trace reconstruction in the smoothed complexity model

In the trace reconstruction problem, an unknown source string x ∈{0,1}^n...

Average-Case to (shifted) Worst-Case Reduction for the Trace Reconstruction Problem

The insertion-deletion channel takes as input a binary string x ∈{0, 1}^...

Approximate Trace Reconstruction from a Single Trace

The well-known trace reconstruction problem is the problem of inferring ...

Efficient average-case population recovery in the presence of insertions and deletions

Several recent works have considered the trace reconstruction problem, i...

Approximate Trace Reconstruction

In the usual trace reconstruction problem, the goal is to exactly recons...

Coding for Trace Reconstruction over Multiple Channels with Vanishing Deletion Probabilities

Motivated by DNA-based storage applications, we study the problem of rec...

Tree trace reconstruction using subtraces

Tree trace reconstruction aims to learn the binary node labels of a tree...

1 Introduction

When any length- message is sent through a noisy channel , the channel modifies the input in some way to produce a distorted copy of , called a trace. The goal of worst-case trace reconstruction over is to design an algorithm which recovers any input string

with high probability from as few independent and identically distributed (i.i.d.) traces as possible. This problem was first introduced by Levenshtein 

[1, 2], who studied it over combinatorial channels causing synchronization errors, such as deletions and insertions of symbols and certain discrete memoryless channels. Trace reconstruction over the deletion channel, which independently deletes each input symbol with some probability, was first considered by Batu, Kannan, Khanna, and McGregor [3]. Some of their results were quickly generalized to what we call the geometric insertion-deletion channel [4, 5], which prepends a geometric number of independent, uniformly random symbols to each input symbol and then deletes it with a given probability. Both the deletion and geometric insertion-deletion channels are examples of discrete memoryless synchronization channels [6, 7].

Holenstein, Mitzenmacher, Panigrahy, and Wieder [8] were the first to obtain non-trivial worst-case trace reconstruction algorithms for the deletion channel with constant deletion probability. They showed that traces suffice for mean-based reconstruction of any input string with high probability. By mean-based reconstruction, we mean that the reconstruction algorithm only requires knowledge of the expected value of each trace coordinate. In general, this procedure works as follows: Let denote the trace distribution on input and

denote the infinite string obtained by padding

with zeros on the right. The mean trace is given by

As the first step, the algorithm estimates

from traces sampled i.i.d. according to via the empirical means


Subsequently, it outputs the string that minimizes . If is large enough, we have with high probability over the randomness of the traces. Because of their structure, pinpointing the number of traces required for mean-based reconstruction over any channel reduces to bounding for any pair of distinct strings and . Overall, mean-based reconstruction is a natural paradigm, and it is not only useful over channels with synchronization errors. For example, traces suffice for mean-based reconstruction over the binary symmetric channel, which is optimal.

More recently, an elegant complex-analytic approach was employed concurrently by De, O’Donnell, and Servedio [9] and by Nazarov and Peres [10] to show that traces suffice for mean-based worst-case trace reconstruction not only over the deletion channel with constant deletion probability, but also over the more general geometric insertion-deletion channel we described previously.111Nazarov and Peres [10] consider a slightly modified geometric insertion-deletion channel: First, a geometric number of independent, uniformly random symbols is added independently before each input symbol. Then, the resulting string is sent through a deletion channel. The analysis is similar to that of the geometric-insertion channel. Remarkably, traces were shown to also be necessary for mean-based reconstruction over the deletion channel.

Given the fundamental nature of mean-based reconstruction and this state of affairs, the following question arises naturally: Are these results examples of a much more general phenomenon? In particular, is it true that traces suffice for mean-based trace reconstruction over any discrete memoryless synchronization channel? We make significant progress in this direction by showing that a simple extension of the analysis from [9, 10] yields that result for a much broader class of such channels which map each input symbol to an arbitrarily distributed sequence of noisy symbol replications and insertions of random symbols, under a mild assumption.

Research in this direction has other practical and theoretical implications. First, studying trace reconstruction over channels introducing more complex synchronization errors than simple i.i.d. deletions is fundamental for the design of reliable DNA-based data storage systems with nanopore-based sequencing [11, 12, 13]. Second, understanding the structure of the mean trace of a string is by itself a natural information-theoretic problem which may lead to improved capacity bounds and coding techniques for channels with synchronization errors, both notoriously difficult problems (see the extensive surveys [14, 15, 7, 16]).

1.1 Related Work

Besides the works mentioned above, there has been significant recent interest in various notions of trace reconstruction.

The mean-based approach of [9, 10] has proven useful to some problems incomparable to our general setting: the deletion channel with position- and symbol-dependent deletion probabilities satisfying strong monotonicity and periodicity assumptions [17]; a combination of the geometric insertion-deletion channel and random shifts of the output string as an intermediate step in the design of average-case trace reconstruction algorithms (which are only required to have low average reconstruction error probability) [18, 19]; trace reconstruction of trees with i.i.d. deletions of vertices [20]; trace reconstruction of matrices with i.i.d. deletions of rows and columns [21]; trace reconstruction of circular strings over the deletion channel [22]. In another direction, Grigorescu, Sudan, and Zhu [23] studied the performance of mean-based reconstruction for distinguishing between strings at low Hamming or edit distance from each other over the deletion channel.

Different complex-analytic methods have been used to, for example, obtain the current best upper bound of traces on the trace complexity of the deletion channel [24], as well as upper bounds for trace reconstruction of “smoothed” worst-case strings over the deletion channel [25]. We note, however, that mean-based reconstruction remains the state-of-the-art approach for the geometric insertion-deletion channel.

Other related problems considered include the already-mentioned average-case trace reconstruction problem over the deletion and geometric insertion-deletion channels [3, 8, 26, 18, 19], trace reconstruction over the deletion and geometric insertion-deletion channels with vanishing deletion probabilities [3, 4, 5, 27], trace complexity lower bounds for the deletion channel [3, 26, 28, 29], trace reconstruction of coded strings over the deletion channel [30, 31], approximate trace reconstruction [32], alternative trace reconstruction models motivated by immunology [33], and population recovery over the deletion and geometric insertion-deletion channels [34, 35, 36].

1.2 Notation

For convenience, we denote discrete random variables and their corresponding distributions by uppercase letters, such as

, , and . The expected value of is denoted by . Sets are denoted by calligraphic uppercase letters such as and , and we write . The open disk of radius centered at is . The

-norm of vector

is denoted by . The concatenation of strings and is denoted by .

1.3 Channel Model

We consider a general replication-insertion channel model that, in particular, captures the models studied in [4, 5, 9, 10]. A replication-insertion channel is characterized by a constant

and a joint probability distribution

with and . To avoid trivial settings where trace reconstruction is impossible, we require that and . Given an input string , the channel behaves independently on each input as follows:

  1. Sample a pair with and according to the distribution of ;

  2. Construct an output string . If , then let with probability and with probability . If , let be a uniformly random bit.

The overall output of on input is

For example, the geometric insertion-deletion channel from [4, 5, 9] can be easily instantiated in this general framework by sampling as follows: First, sample

following a geometric distribution with support in

and success probability and

following a Bernoulli distribution with success probability

. Then, set and let if and otherwise. Likewise, the alternative geometric insertion-deletion channel from [10, 18, 19] can also be easily instantiated in our framework.

1.4 Our Contributions

Our main theorem shows that previous results on mean-based trace reconstruction over the deletion and geometric insertion-deletion channels are examples of a much more general phenomenon.

Theorem 1.

Worst-case mean-based trace reconstruction with success probability at least over any channel with sub-exponential222A random variable is sub-exponential if there exists a constant such that for all .random variable is achievable with traces.

Note that many common distributions satisfy the requirement that is sub-exponential, including geometric, Poisson, and all finitely-supported distributions.

2 Proof of Theorem 1

Fix a replication-insertion channel , where is a sub-exponential random variable. To every string , we can associate a polynomial over defined as

Then, using the definition of mean trace above, we define the mean trace power series as

Let and denote the mean trace truncated at by

To prove Theorem 1, we will show that there exists a constant such that for a large enough , appropriate , and any distinct input strings , their truncated mean traces satisfy


This implies that traces suffice for mean-based worst-case trace reconstruction as follows: Let be the true input and suppose that we have access to traces. Then a direct application of the Chernoff bound and a union bound over all coordinates shows that the empirical mean trace defined in (1) satisfies


with probability at least over the randomness of the traces. On the other hand, if (3) holds, we also have that

for all as a result of (2). This allows us to recover naively from by computing for every and outputting the that minimizes .

We prove (2) by relating to for an appropriate choice of . By the triangle inequality, we have

for every such that . Rearranging, it follows that is lower-bounded by


for any such . The lower bound in (2), and thus Theorem 1, follows by combining (4) with the next two lemmas, each bounding a different term in the right-hand side of (4).

Lemma 2.

There exist constants such that for large enough and any distinct strings , it holds that for some satisfying .

Lemma 3.

If for some constant , there exist constants such that if then

for all distinct when is large enough.

Invoking Lemmas 2 and 3, we have that for large enough and any distinct , there exists an appropriate choice possibly depending on and which satisfies and by setting and in (4) yields

for some constant , implying (2).

We prove Lemmas 2 and 3 in Sections 3 and 4, respectively, which completes the argument.

3 Proof of Lemma 2

Our proof of Lemma 2 follows the blueprint of [9, Sections 4 and 5] and [10, Sections 2 and 3]. The key differences lie in Lemmas 4 and 6 below. Lemma 4 generalizes [9, Section 4 and Appendix A.3] and [10, Lemmas 2.1 and 5.2] to arbitrary replication-insertion channels well beyond the deletion and geometric insertion-deletion channels. Lemma 6 requires analyzing the local behavior of the inverse of an arbitrary probability generating function (PGF) in the complex plane around . Remarkably, the desired behavior follows by combining the standard inverse function theorem for analytic functions with basic properties of PGFs. In contrast, the PGFs associated to the deletion and geometric insertion-deletion channels treated in [9, 10, 18, 19] are all Möbius transformations, meaning that their inverses have simple explicit expressions and could be easily analyzed directly.

As a first step, we show that the mean trace power series is related to the input polynomial through a change of variable. This allows us to bound in terms of for some related to .

Lemma 4.

Let be a replication-insertion channel. Suppose is finite. Let be the distribution given by

and and be the probability generating functions of and , respectively. Then for every and such that is in the disk of convergence of both and , we have .

Let . Then, Lemma 4 yields


Analogously to [9, 10], we use the following lemma, due to Borwein and Erdélyi [37], to lower bound

Lemma 5 ([37]).

There is a universal constant for which the following holds: Let
be non-zero and define . Let denote the arc . Then, we have for every .

This lemma implies that there is a constant such that for every there exists with satisfying


We can use (6) to lower bound (5), provided there exists such that with good properties. The following lemma ensures this.

Lemma 6.

For large enough there is a constant such that for any there exists satisfying , , and .

As a result of Lemma 6, we can choose a that satisfies , , and for large enough . Using this together with (6), we obtain


Set . Combining (5), (7), and the fact that for large enough , we obtain

for some constant when is large enough. This concludes the proof of Lemma 2 assuming Lemmas 4 and 6.

3.1 Proofs of Lemmas 4 and 6

In this section, we prove the remaining lemmas.

Proof of Lemma 4.

Fix an input string . For each , let denote the indices of that correspond to replications of and let denote the indices of that correspond to insertions of random bits resulting from the channel’s action on . Then we may write

where the

are uniformly distributed over

, the are random variables over that are with probability , and all these are independent of each other, , and . Note that if an output bit is in , then it has expected value . Therefore, we have that

We can use this to show that


We proceed to simplify . Let , where the denote the lengths of the channel outputs associated to each input bit and are i.i.d. according to . We have


We can interchange the sums above because is in the disk of convergence of and . From the definition of ,


Combining (9) with (3.1) yields

Recalling (8) concludes the proof. ∎

We prove Lemma 6 using the standard inverse function theorem stated below.

Lemma 7 ([38, Section VIII.4], adapted).

Let be a non-constant function analytic on a connected open set such that for a given . Then, there exist radii such that for every there exists a unique satisfying . Moreover, the inverse function defined as is analytic on .

Proof of Lemma 6.

Because M is sub-exponential and non-trivial, is a non-constant analytic function on some open ball of radius that satisfies . Hence, Lemma 7 applies with , so there exist and an analytic function such that . In particular, there exists such that for every we can write


This is because , since , and furthermore

for some constant and all such .

Assume that is large enough so that for all . Then, we set . Note that by the definition of , as required.

Combining (11) with and the triangle inequality, we have

as . Since is a continuous function on a neighborhood of , and , it follows that if is large enough. On the other hand, combining (11) with the fact that

by the chain rule, we obtain

The second inequality holds because . The last inequality follows by noting that and for . ∎

4 Proof of Lemma 3

To conclude the argument, we prove Lemma 3 using an argument analogous to [9, Appendix A.2] and the fact that is sub-exponential.

Let be i.i.d. according to , and set . Then, we have

for every . Since is sub-exponential, a direct application of Bernstein’s inequality [39, Theorem 2.8.1] guarantees the existence of constants such that for and any we have

Combining these observations with the assumption that yields

for some constant and large enough.

5 Future Work

We have shown that traces suffice for mean-based worst-case trace reconstruction over a broad class of replication-insertion channels. However, our channel model does not cover all discrete memoryless synchronization channels as defined by Dobrushin [6, 7]. It would be interesting to extend the result in some form to all such non-trivial channels. On the other hand, to complement the above, it would be interesting to prove trace complexity lower bounds for mean-based reconstruction over all these channels.

Furthermore, it is unclear whether the assumption that is sub-exponential is necessary for our result. A clear extension of this work would be to either remove this condition or prove that it is necessary for mean-based trace reconstruction from traces.