1 Introduction
When any length- message is sent through a noisy channel , the channel modifies the input in some way to produce a distorted copy of , called a trace. The goal of worst-case trace reconstruction over is to design an algorithm which recovers any input string
with high probability from as few independent and identically distributed (i.i.d.) traces as possible. This problem was first introduced by Levenshtein
[1, 2], who studied it over combinatorial channels causing synchronization errors, such as deletions and insertions of symbols and certain discrete memoryless channels. Trace reconstruction over the deletion channel, which independently deletes each input symbol with some probability, was first considered by Batu, Kannan, Khanna, and McGregor [3]. Some of their results were quickly generalized to what we call the geometric insertion-deletion channel [4, 5], which prepends a geometric number of independent, uniformly random symbols to each input symbol and then deletes it with a given probability. Both the deletion and geometric insertion-deletion channels are examples of discrete memoryless synchronization channels [6, 7].Holenstein, Mitzenmacher, Panigrahy, and Wieder [8] were the first to obtain non-trivial worst-case trace reconstruction algorithms for the deletion channel with constant deletion probability. They showed that traces suffice for mean-based reconstruction of any input string with high probability. By mean-based reconstruction, we mean that the reconstruction algorithm only requires knowledge of the expected value of each trace coordinate. In general, this procedure works as follows: Let denote the trace distribution on input and
denote the infinite string obtained by padding
with zeros on the right. The mean trace is given byAs the first step, the algorithm estimates
from traces sampled i.i.d. according to via the empirical means(1) |
Subsequently, it outputs the string that minimizes . If is large enough, we have with high probability over the randomness of the traces. Because of their structure, pinpointing the number of traces required for mean-based reconstruction over any channel reduces to bounding for any pair of distinct strings and . Overall, mean-based reconstruction is a natural paradigm, and it is not only useful over channels with synchronization errors. For example, traces suffice for mean-based reconstruction over the binary symmetric channel, which is optimal.
More recently, an elegant complex-analytic approach was employed concurrently by De, O’Donnell, and Servedio [9] and by Nazarov and Peres [10] to show that traces suffice for mean-based worst-case trace reconstruction not only over the deletion channel with constant deletion probability, but also over the more general geometric insertion-deletion channel we described previously.111Nazarov and Peres [10] consider a slightly modified geometric insertion-deletion channel: First, a geometric number of independent, uniformly random symbols is added independently before each input symbol. Then, the resulting string is sent through a deletion channel. The analysis is similar to that of the geometric-insertion channel. Remarkably, traces were shown to also be necessary for mean-based reconstruction over the deletion channel.
Given the fundamental nature of mean-based reconstruction and this state of affairs, the following question arises naturally: Are these results examples of a much more general phenomenon? In particular, is it true that traces suffice for mean-based trace reconstruction over any discrete memoryless synchronization channel? We make significant progress in this direction by showing that a simple extension of the analysis from [9, 10] yields that result for a much broader class of such channels which map each input symbol to an arbitrarily distributed sequence of noisy symbol replications and insertions of random symbols, under a mild assumption.
Research in this direction has other practical and theoretical implications. First, studying trace reconstruction over channels introducing more complex synchronization errors than simple i.i.d. deletions is fundamental for the design of reliable DNA-based data storage systems with nanopore-based sequencing [11, 12, 13]. Second, understanding the structure of the mean trace of a string is by itself a natural information-theoretic problem which may lead to improved capacity bounds and coding techniques for channels with synchronization errors, both notoriously difficult problems (see the extensive surveys [14, 15, 7, 16]).
1.1 Related Work
Besides the works mentioned above, there has been significant recent interest in various notions of trace reconstruction.
The mean-based approach of [9, 10] has proven useful to some problems incomparable to our general setting: the deletion channel with position- and symbol-dependent deletion probabilities satisfying strong monotonicity and periodicity assumptions [17]; a combination of the geometric insertion-deletion channel and random shifts of the output string as an intermediate step in the design of average-case trace reconstruction algorithms (which are only required to have low average reconstruction error probability) [18, 19]; trace reconstruction of trees with i.i.d. deletions of vertices [20]; trace reconstruction of matrices with i.i.d. deletions of rows and columns [21]; trace reconstruction of circular strings over the deletion channel [22]. In another direction, Grigorescu, Sudan, and Zhu [23] studied the performance of mean-based reconstruction for distinguishing between strings at low Hamming or edit distance from each other over the deletion channel.
Different complex-analytic methods have been used to, for example, obtain the current best upper bound of traces on the trace complexity of the deletion channel [24], as well as upper bounds for trace reconstruction of “smoothed” worst-case strings over the deletion channel [25]. We note, however, that mean-based reconstruction remains the state-of-the-art approach for the geometric insertion-deletion channel.
Other related problems considered include the already-mentioned average-case trace reconstruction problem over the deletion and geometric insertion-deletion channels [3, 8, 26, 18, 19], trace reconstruction over the deletion and geometric insertion-deletion channels with vanishing deletion probabilities [3, 4, 5, 27], trace complexity lower bounds for the deletion channel [3, 26, 28, 29], trace reconstruction of coded strings over the deletion channel [30, 31], approximate trace reconstruction [32], alternative trace reconstruction models motivated by immunology [33], and population recovery over the deletion and geometric insertion-deletion channels [34, 35, 36].
1.2 Notation
For convenience, we denote discrete random variables and their corresponding distributions by uppercase letters, such as
, , and . The expected value of is denoted by . Sets are denoted by calligraphic uppercase letters such as and , and we write . The open disk of radius centered at is . The-norm of vector
is denoted by . The concatenation of strings and is denoted by .1.3 Channel Model
We consider a general replication-insertion channel model that, in particular, captures the models studied in [4, 5, 9, 10]. A replication-insertion channel is characterized by a constant
and a joint probability distribution
with and . To avoid trivial settings where trace reconstruction is impossible, we require that and . Given an input string , the channel behaves independently on each input as follows:-
Sample a pair with and according to the distribution of ;
-
Construct an output string . If , then let with probability and with probability . If , let be a uniformly random bit.
The overall output of on input is
For example, the geometric insertion-deletion channel from [4, 5, 9] can be easily instantiated in this general framework by sampling as follows: First, sample
following a geometric distribution with support in
and success probability andfollowing a Bernoulli distribution with success probability
. Then, set and let if and otherwise. Likewise, the alternative geometric insertion-deletion channel from [10, 18, 19] can also be easily instantiated in our framework.1.4 Our Contributions
Our main theorem shows that previous results on mean-based trace reconstruction over the deletion and geometric insertion-deletion channels are examples of a much more general phenomenon.
Theorem 1.
Worst-case mean-based trace reconstruction with success probability at least over any channel with sub-exponential222A random variable is sub-exponential if there exists a constant such that for all .random variable is achievable with traces.
Note that many common distributions satisfy the requirement that is sub-exponential, including geometric, Poisson, and all finitely-supported distributions.
2 Proof of Theorem 1
Fix a replication-insertion channel , where is a sub-exponential random variable. To every string , we can associate a polynomial over defined as
Then, using the definition of mean trace above, we define the mean trace power series as
Let and denote the mean trace truncated at by
To prove Theorem 1, we will show that there exists a constant such that for a large enough , appropriate , and any distinct input strings , their truncated mean traces satisfy
(2) |
This implies that traces suffice for mean-based worst-case trace reconstruction as follows: Let be the true input and suppose that we have access to traces. Then a direct application of the Chernoff bound and a union bound over all coordinates shows that the empirical mean trace defined in (1) satisfies
(3) |
with probability at least over the randomness of the traces. On the other hand, if (3) holds, we also have that
for all as a result of (2). This allows us to recover naively from by computing for every and outputting the that minimizes .
We prove (2) by relating to for an appropriate choice of . By the triangle inequality, we have
for every such that . Rearranging, it follows that is lower-bounded by
(4) |
for any such . The lower bound in (2), and thus Theorem 1, follows by combining (4) with the next two lemmas, each bounding a different term in the right-hand side of (4).
Lemma 2.
There exist constants such that for large enough and any distinct strings , it holds that for some satisfying .
Lemma 3.
If for some constant , there exist constants such that if then
for all distinct when is large enough.
3 Proof of Lemma 2
Our proof of Lemma 2 follows the blueprint of [9, Sections 4 and 5] and [10, Sections 2 and 3]. The key differences lie in Lemmas 4 and 6 below. Lemma 4 generalizes [9, Section 4 and Appendix A.3] and [10, Lemmas 2.1 and 5.2] to arbitrary replication-insertion channels well beyond the deletion and geometric insertion-deletion channels. Lemma 6 requires analyzing the local behavior of the inverse of an arbitrary probability generating function (PGF) in the complex plane around . Remarkably, the desired behavior follows by combining the standard inverse function theorem for analytic functions with basic properties of PGFs. In contrast, the PGFs associated to the deletion and geometric insertion-deletion channels treated in [9, 10, 18, 19] are all Möbius transformations, meaning that their inverses have simple explicit expressions and could be easily analyzed directly.
As a first step, we show that the mean trace power series is related to the input polynomial through a change of variable. This allows us to bound in terms of for some related to .
Lemma 4.
Let be a replication-insertion channel. Suppose is finite. Let be the distribution given by
and and be the probability generating functions of and , respectively. Then for every and such that is in the disk of convergence of both and , we have .
Let . Then, Lemma 4 yields
(5) |
Lemma 5 ([37]).
There is a universal constant for which the following holds:
Let
be non-zero and define .
Let denote the arc .
Then, we have for every .
This lemma implies that there is a constant such that for every there exists with satisfying
(6) |
We can use (6) to lower bound (5), provided there exists such that with good properties. The following lemma ensures this.
Lemma 6.
For large enough there is a constant such that for any there exists satisfying , , and .
As a result of Lemma 6, we can choose a that satisfies , , and for large enough . Using this together with (6), we obtain
(7) |
Set . Combining (5), (7), and the fact that for large enough , we obtain
for some constant when is large enough. This concludes the proof of Lemma 2 assuming Lemmas 4 and 6.
3.1 Proofs of Lemmas 4 and 6
In this section, we prove the remaining lemmas.
Proof of Lemma 4.
Fix an input string . For each , let denote the indices of that correspond to replications of and let denote the indices of that correspond to insertions of random bits resulting from the channel’s action on . Then we may write
where the
are uniformly distributed over
, the are random variables over that are with probability , and all these are independent of each other, , and . Note that if an output bit is in , then it has expected value . Therefore, we have thatWe can use this to show that
(8) |
We proceed to simplify . Let , where the denote the lengths of the channel outputs associated to each input bit and are i.i.d. according to . We have
(9) |
We can interchange the sums above because is in the disk of convergence of and . From the definition of ,
(10) |
Combining (9) with (3.1) yields
Recalling (8) concludes the proof. ∎
We prove Lemma 6 using the standard inverse function theorem stated below.
Lemma 7 ([38, Section VIII.4], adapted).
Let be a non-constant function analytic on a connected open set such that for a given . Then, there exist radii such that for every there exists a unique satisfying . Moreover, the inverse function defined as is analytic on .
Proof of Lemma 6.
Because M is sub-exponential and non-trivial, is a non-constant analytic function on some open ball of radius that satisfies . Hence, Lemma 7 applies with , so there exist and an analytic function such that . In particular, there exists such that for every we can write
(11) |
This is because , since , and furthermore
for some constant and all such .
Assume that is large enough so that for all . Then, we set . Note that by the definition of , as required.
Combining (11) with and the triangle inequality, we have
as . Since is a continuous function on a neighborhood of , and , it follows that if is large enough. On the other hand, combining (11) with the fact that
by the chain rule, we obtain
The second inequality holds because . The last inequality follows by noting that and for . ∎
4 Proof of Lemma 3
To conclude the argument, we prove Lemma 3 using an argument analogous to [9, Appendix A.2] and the fact that is sub-exponential.
Let be i.i.d. according to , and set . Then, we have
for every . Since is sub-exponential, a direct application of Bernstein’s inequality [39, Theorem 2.8.1] guarantees the existence of constants such that for and any we have
Combining these observations with the assumption that yields
for some constant and large enough.
5 Future Work
We have shown that traces suffice for mean-based worst-case trace reconstruction over a broad class of replication-insertion channels. However, our channel model does not cover all discrete memoryless synchronization channels as defined by Dobrushin [6, 7]. It would be interesting to extend the result in some form to all such non-trivial channels. On the other hand, to complement the above, it would be interesting to prove trace complexity lower bounds for mean-based reconstruction over all these channels.
Furthermore, it is unclear whether the assumption that is sub-exponential is necessary for our result. A clear extension of this work would be to either remove this condition or prove that it is necessary for mean-based trace reconstruction from traces.
References
- [1] V. I. Levenshtein, “Efficient reconstruction of sequences,” IEEE Transactions on Information Theory, vol. 47, no. 1, pp. 2–22, Jan 2001.
- [2] ——, “Efficient reconstruction of sequences from their subsequences or supersequences,” Journal of Combinatorial Theory, Series A, vol. 93, no. 2, pp. 310–332, 2001.
- [3] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing strings from random traces,” in Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2004, pp. 910–918.
- [4] S. Kannan and A. McGregor, “More on reconstructing strings from random traces: insertions and deletions,” in 2005 IEEE International Symposium on Information Theory (ISIT), 2005, pp. 297–301.
- [5] K. Viswanathan and R. Swaminathan, “Improved string reconstruction over insertion-deletion channels,” in Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2008, pp. 399–408.
- [6] R. L. Dobrushin, “Shannon’s theorems for channels with synchronization errors,” Problemy Peredachi Informatsii, vol. 3, no. 4, pp. 18–36, 1967.
- [7] M. Cheraghchi and J. Ribeiro, “An overview of capacity results for synchronization channels,” IEEE Transactions on Information Theory, 2020, to appear. DOI: 10.1109/TIT.2020.2997329.
- [8] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, “Trace reconstruction with constant deletion probability and related results,” in Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2008, pp. 389–398.
- [9] A. De, R. O’Donnell, and R. A. Servedio, “Optimal mean-based algorithms for trace reconstruction,” Annals of Applied Probability, vol. 29, no. 2, pp. 851–874, Apr 2019.
-
[10]
F. Nazarov and Y. Peres, “Trace reconstruction with
samples,” in
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC)
, 2017, pp. 1042–1046. - [11] S. M. H. T. Yazdi, H. M. Kiah, E. Garcia-Ruiz, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” IEEE Transactions on Molecular, Biological and Multi-Scale Communications, vol. 1, no. 3, pp. 230–248, 2015.
- [12] S. M. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,” Scientific reports, vol. 7, no. 1, p. 5011, 2017.
- [13] L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen et al., “Random access in large-scale DNA data storage,” Nature biotechnology, vol. 36, no. 3, p. 242, 2018.
- [14] M. Mitzenmacher, “A survey of results for deletion channels and related synchronization channels,” Probability Surveys, vol. 6, pp. 1–33, 2009.
- [15] H. Mercier, V. K. Bhargava, and V. Tarokh, “A survey of error-correcting codes for channels with symbol synchronization errors,” IEEE Communications Surveys Tutorials, vol. 12, no. 1, pp. 87–96, First Quarter 2010.
- [16] B. Haeupler and A. Shahrasbi, “Synchronization strings and codes for insertions and deletions – a survey,” IEEE Transactions on Information Theory, 2021, to appear. Available at https://arxiv.org/abs/2101.00711.
- [17] L. Hartung, N. Holden, and Y. Peres, “Trace reconstruction with varying deletion probabilities,” in Proceedings of the 15th Workshop on Analytic Algorithmics and Combinatorics (ANALCO), 2018, pp. 54–61.
- [18] Y. Peres and A. Zhai, “Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice,” in 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), Oct 2017, pp. 228–239.
- [19] N. Holden, R. Pemantle, and Y. Peres, “Subpolynomial trace reconstruction for random strings and arbitrary deletion probability,” in Proceedings of the 31st Conference On Learning Theory (COLT), 2018, pp. 1799–1840.
- [20] S. Davies, M. Z. Racz, and C. Rashtchian, “Reconstructing trees from traces,” in Proceedings of the 32nd Conference on Learning Theory (COLT), 2019, pp. 961–978.
- [21] A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal, “Trace reconstruction: Generalized and parameterized,” in 27th Annual European Symposium on Algorithms (ESA), 2019, pp. 68:1–68:25.
- [22] S. Narayanan and M. Ren, “Circular trace reconstruction,” arXiv e-prints, p. arXiv:2009.01346, Sep. 2020, to appear in ITCS 2021.
- [23] E. Grigorescu, M. Sudan, and M. Zhu, “Limitations of mean-based algorithms for trace reconstruction at small distance,” arXiv e-prints, p. arXiv:2011.13737, Nov. 2020.
- [24] Z. Chase, “New upper bounds for trace reconstruction,” arXiv e-prints, p. arXiv:2009.03296, Sep. 2020.
- [25] X. Chen, A. De, C. H. Lee, R. A. Servedio, and S. Sinha, “Polynomial-time trace reconstruction in the smoothed complexity model,” in Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021, pp. 54–73.
- [26] A. McGregor, E. Price, and S. Vorotnikova, “Trace reconstruction revisited,” in 22nd Annual European Symposium on Algorithms (ESA), 2014, pp. 689–700.
- [27] X. Chen, A. De, C. H. Lee, R. A. Servedio, and S. Sinha, “Polynomial-time trace reconstruction in the low deletion rate regime,” arXiv e-prints, p. arXiv:2012.02844, Dec. 2020, to appear in ITCS 2021.
- [28] N. Holden and R. Lyons, “Lower bounds for trace reconstruction,” Ann. Appl. Probab., vol. 30, no. 2, pp. 503–525, Apr. 2020.
- [29] Z. Chase, “New lower bounds for trace reconstruction,” arXiv e-prints, p. arXiv:1905.03031, May 2019, to appear in Ann. Inst. Henri Poincaré Probab. Stat.
- [30] M. Cheraghchi, R. Gabrys, O. Milenkovic, and J. Ribeiro, “Coded trace reconstruction,” IEEE Transactions on Information Theory, vol. 66, no. 10, pp. 6084–6103, 2020.
- [31] J. Brakensiek, R. Li, and B. Spang, “Coded trace reconstruction in a constant number of traces,” in 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020, pp. 482–493.
- [32] S. Davies, M. Z. Racz, C. Rashtchian, and B. G. Schiffer, “Approximate trace reconstruction,” arXiv e-prints, p. arXiv:2012.06713, Dec. 2020.
- [33] V. Bhardwaj, P. A. Pevzner, C. Rashtchian, and Y. Safonova, “Trace reconstruction problems in computational biology,” IEEE Transactions on Information Theory, 2020, to appear. DOI: 10.1109/TIT.2020.3030569.
- [34] F. Ban, X. Chen, A. Freilich, R. A. Servedio, and S. Sinha, “Beyond trace reconstruction: Population recovery from the deletion channel,” in 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), Nov 2019.
-
[35]
F. Ban, X. Chen, R. A. Servedio, and S. Sinha, “Efficient average-case
population recovery in the presence of insertions and deletions,” in
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
, 2019, pp. 44:1–44:18. - [36] S. Narayanan, “Improved algorithms for population recovery from the deletion channel,” in Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021, pp. 1259–1278.
- [37] P. Borwein and T. Erdélyi, “Littlewood-type problems on subarcs of the unit circle,” Indiana University mathematics journal, pp. 1323–1346, 1997.
- [38] T. Gamelin, Complex Analysis, ser. Undergraduate Texts in Mathematics. Springer New York, 2001.
-
[39]
R. Vershynin,
High-Dimensional Probability: An Introduction with Applications in Data Science
, ser. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.