Due to its longevity and enormous information density, and thanks to rapid advances in technologies for writing (synthesis) and reading (sequencing), DNA is on track to become an attractive medium for archival data storage. Computer scientists and engineers have dreamed of harnessing DNA’s storage capabilities already in the 60s [1, 2], and in recent years this idea has developed into an active field of research: In 2012 and 2013 groups lead by Church  and Goldman  independently stored about a megabyte of data in DNA. In 2015 Grass et al.  demonstrated that millenia long storage times are possible by protecting the data both physically and information-theoretically, and designed a robust DNA data storage scheme using modern error correcting codes. Later, in the same year, Yazdi et al  showed how to selectively access files, and in 2017, Erlich and Zielinski  demonstrated that practical DNA storage can achieve very high information densities. In 2018, Organick et al.  scaled up these techniques and stored about 200 megabytes of data.
DNA is a long molecule made up of four nucleotides (Adenine, Cytosine, Guanine, and Thymine) and, for storage purposes, can be viewed as a string over a four-letter alphabet. However, due to technological constraints, it is difficult and inefficient to synthesize long strands of DNA. Thus, data is stored on short DNA molecules which are preserved in a pool of DNA and cannot be spatially ordered. Indeed, all the aforementioned works did store information on molecules of no longer than a few hundred nucleotides.
Given these constraints, the simplest mathematical model for a DNA storage system is a shuffling channel: data is encoded into DNA molecules (i.e., strings), each of length , and the output of the channel consists of a random shuffle of the sequences.
In our previous work , we studied the capacity of such a shuffling channel with one additional feature: random sampling of the shuffled sequences. In a DNA storage system, the information is accessed via shotgun sequencing, which in essence corresponds to randomly sampling and reading molecules from the DNA pool. In order to formally study the capacity of this system, we considered an asymptotic regime in which is the number of sequences/molecules stored, sequences are sampled, the length of each sequence is small (specifically for a fixed , and throughout is with respect to base ), and . For this setting, and assuming an alphabet of size two for simplicity, the storage capacity, defined as the maximum number of bits that can be reliably stored per symbol (the total number of symbols is ) was shown to be
provided that . If , no positive rate is achievable.
Surprisingly, the capacity of this shuffle-and-sampling channel is achieved by a simple scheme where each sequence is given a unique index (or header), and an erasure outer code is used to protect against missing molecules in the sampling step. The factor can be understood as the loss due to unseen molecules, and the factor corresponds to the loss due to shuffling or the lack of ordering in the molecules.
In practice, however, the DNA sequences are additionally corrupted by substitutions, deletions, and insertion errors (which occur both at synthesis, sequencing, and during the storage period). For that reason, several recent works have studied the design of error-correcting codes tailored to the error profiles of DNA storage systems [10, 6, 11, 12, 13]. However, generalizing the capacity expression in (1) to a setting in which, in addition to random shuffling and random sampling, the molecules are corrupted by noise is quite nontrivial.
In this work we make progress on this front by studying the capacity of a noisy shuffling channel. We drop the random sampling constraint considered in , and focus on a setting in which the sequences are first corrupted by noise and then randomly shuffled. Since in all recently proposed systems [3, 4, 5, 6, 7, 8], substitution errors are more frequent than insertion and deletion errors , we focus on substitution noise.
Contributions: We study the capacity of the noisy shuffling channel illustrated in Figure I
, where the input/ouput sequences are binary, and the noisy channel is a Binary Symmetric Channel (BSC) with crossover probability. Our main result establishes that, for a large set of the parameters and , the capacity of the BSC shuffling channel is
Since the capacity of a BSC channel is , this results implies that the shuffling reduces the BSC capacity by bits.
The capacity-achieving scheme consists of encoding each length- string as a length- codeword from a BSC capacity-achiving code, and using out of its information bits (prior to encoding) to store an index. Hence, at least asymptotically, there is no rate gain in coding across the strings, and index-based approaches, which were used in several practical implementations [5, 6, 7, 8] are optimal.
Outline: The paper is organized as follows. After formalizing the problem setting in Section II, we state our main result and describe the achievability in Section III. Section IV is dedicated to the converse. We conclude by discussing extensions and future directions in Section V.
Related literature: Motivated by DNA-based storage, a few recent works have considered the problem of coding across an unordered set of strings [15, 16, 17, 18]. The setting studied in all these works bears similarities with the one in this paper, but they focus on providing explicit code constructions, as opposed to characterizing the channel capacity, as we do here.
Ii Problem Setting
The BSC shuffling channel, illustrated in Figure I, can be seen as the concatenation of a BSC with crossover probability and a shuffling channel, which randomly reorders the binary strings. Formally, we will view the input to the channel as a length- binary string
(i.e., strings of length concatenated to form a single string of length ). Similarly,
is the corresponding output. We let be the random binary error pattern created by the BSC on and
be a vector encoding the random shuffle induced by the channel, so that
for , where indicates elementwise modulo addition. Similar to our previous work , we let , and consider the asymptotic regime . As shown in , this is the proper regime for positive data rates to be achievable.
A rate is said to be achievable if we have a sequence of codes, each with codewords, and whose error probability goes to zero as . The shuffling channel capacity is the supremum over all achievable rates.
Iii Main Result
Our main result establishes that, provided that is large enough, treating each length- sequence as the input to a separate BSC and encoding a unique index into each sequence is capacity optimal. More precisely, consider code for a BSC with codewords of length and rate , for some small . Using this code, we can store information bits in each length- string. We utilize the first of those bits to encode a distinct index for each of the strings. Hence, in each string we can effectively store data bits, and the rate of this scheme is
Since can be chosen arbitrarily small, this scheme can have a rate arbitrarily close to
Strictly speaking, the simple index-based scheme described above needs to be modified using an outer code to guarantee that the rate in (3) is actually achieved. That is because there is a vanishing, but positive, probability that a given string is decoded in error, and since , the probability that at least one of the strings is decoded in error may not go to zero. When an error occurs, the index is also read in error, which may cause a “collision” with a properly decoded string. Nevertheless, since the error probability for each given string vanishes as , this can be fixed with an outer code across the data symbols of all the strings, and it is straightforward to see that one can achieve arbitrarily close to the rate in (3). Hence, (4) is a lower bound to the capacity .
On the other hand, a simple upper bound is given by
since both the capacity of the BSC, , and the capacity of the shuffling channel, , must be upper bounds to .
If and ,
Moreover, if , .
As we will see, the converse argument presented in the next section and the capacity expression in (6) hold for a set of parameters larger than those specified in Theorem 1. In fact, for all in the blue region of Figure III, the capacity is given by (6). Hence, for for instance, we need for (6) to hold.
Consider a sequence of codes for the noisy shuffling channel with rate and vanishing error probability. Let
be the input to the channel when we choose one of the codewords from one such code uniformly at random, and
the corresponding output. We first observe that
where as by Fano’s inequality. Then,
The last equality follows by noticing that, given , one can compute
for , and thus . The second term in (7) can be computed as
In order to finish the converse, we need to jointly bound the first and third terms in equation (7). This is the most challenging part of the proof and is summarized in the following lemma:
If and satisfy
then it holds that
Iv-a Intuition for Lemma 1
If we trivially bound each entropy term separately, we obtain
However, intuitively, the bound is too loose. Given and
, one could estimateby finding for each the that is closest to it and guessing that . This will be a good guess if there is no other close to . Hence, we can picture two opposite scenarios, illustrated in Figure IV-A.
In the first case, the strings are all distant from each other (in the Hamming sense). Hence, the maximum likelihood estimate of given and will be “close” to the truth and we would expect to be small. In the second case, illustrated in Fig. IV-A(b), several of are close to each other. So we have less information about , and may be large.
Iv-B Proof of Lemma 1
Let be the set of strings observed at the output of the channel. In order to capture whether we are in the scenario of Figure IV-A(a) or (b), we let be the largest subset of so that, for any , where is the Hamming distance and . We assume that in case of ties, an arbitrary tie-breaking rule is used to define (the actual choice will not be relevant for the proof).
We will prove that, given the conditions in Lemma 1, the following two bounds involving hold:
For large , we are intuitively more often in the scenario in Figure IV-A(a), while Figure IV-A(b) corresponds to the case where is small. The bounds above capture the tension between and because (B2) is decreasing in , while (B1) is increasing in (as long as ). Combining (B1) and (B2),
Replacing with and ignoring the terms in this upper bound that do not involve , we have the expression
Let . For , we have
We see that, as long as , the right-hand side of (11) is greater than for large enough. This means that is increasing for , and must attain its maximum at . Notice that . Therefore, (10) can be upper-bounded by setting , which yields
Notice that this holds as long as
for some . From the continuity of , such can be found if (9) is satisfied, proving the lemma. Finally, we prove (B1) and (B2).
Proof of (B1): Since is a deterministic function of and can take at most values,
Next we notice that, for a given , we can write
The first term in (13) is trivially bounded as
Each of the remaining length- strings with must be within a distance from one of the strings in , from the definition of . Hence, conditioned on , each of them can only take at most values, where is a Hamming ball of radius . Since for , we bound the second term in (13) as
Using these bounds back in (12), we obtain
where we used the fact that is a concave function of and Jensen’s inequality. ∎
Proof of (B2): Since is a deterministic function of ,
Next we notice that the probability that or more errors occur in a single length- string, for , is at most by the Chernoff bound (where is the binary KL divergence). If we let be the event that , then we have
The conditional entropy term in (15) is upper bounded by
The final step is to bound the conditional entropy term in (16), for the case where . Set . Conditioned on , . Moreover, conditioned on , for any , . Define the set
We claim that, if , must be in . To see this notice that, for any , , we have
implying that . Therefore, for each output string with , can take at most values. Hence we have
where we used Jensen’s inequality in the last step. Since as , , concluding the proof. ∎
V Discussion and Extensions
While the parameter regime of in the blue region of Figure III is practically the most relevant one, an interesting question for future work is whether expression (6) is still the capacity of the BSC-shuffling channel if and do not satisfy (9) (i.e., the gray region in Figure III). Notice that this is a high-noise, short-block regime, and it is reasonable to imagine that coding across the different sequences can be helpful and an index-based approach might not be optimal.
V-a General symmetric channels
If we view the capacity expression in (6) as , it is natural to wonder whether for a different noisy channel with capacity , the correspoinding noisy shuffling channel would have capacity . As it turns out, the converse proof in Section IV can be extended to the class of symmetric discrete memoryless channels (as described in [19, ch. 7.2]).
Consider a noisy shuffling channel formed by a symmetric discrete memoryless channel (SDMC) with output alphabet , followed by a shuffling channel. We focus on the same asymptotic regime from Section II. Then we have:
If is large enough, the capacity of the SDMC-shuffling channel is given by
Moreover, if , .
For symmetric channels, capacity is achieved by making the distribution of the output uniform, which allows an analogous result to Lemma 1 to be obtained. Notice that how large needs to be depends on the specific channel transition matrix.
Vi Concluding Remarks
In this paper we took steps towards the understanding of the fundamental limits of DNA-based storage systems. We proposed a simple model capturing the fact that molecules are stored in an unordered fashion, are short, and are corrupted by individual base errors. Our results show that a simple index-based coding scheme is asymptotically optimal.
While the model captures (moderate) substitution errors which are the prevalent error source on a nucleotide level of current DNA storage systems, the current generation of systems relies on low-error synthesis and sequencing technologies that are relatively expensive and limited in speed. A key idea towards developing the next-generation of DNA storage systems is to employ high-error, but cheaper and faster synthesis and sequencing technologies such as light-directed maskless synthesis of DNA and nanopore sequencing. Such systems induce a significant amount of insertion and deletion errors. Thus, and important area of further investigation is to understand the capacity of channels which introduce deletions and insertions as well.
In addition, in practice, one can introduce physical redundancy (i.e., duplicate molecules) in storage and redundancy in the DNA sequencing. Thus, an interesting question is to understand how to efficiently code for channels where we have several unordered, independently perturbed copies of each sequence.
-  M. S. Neiman. Some fundamental issues of microminiaturization. Radiotekhnika, 1(1):3–12, 1964.
-  E. B. Baum. Building an associative memory vastly larger than the brain. Science, 268(5210):583–585, 1995.
-  G. M. Church, Y. Gao, and S. Kosuri. Next-generation digital information storage in DNA. Science, 337(6102):1628–1628, 2012.
-  N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77–80, 2013.
-  R. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie International Edition, 54(8):2552–2555, 2015.
-  H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic. A rewritable, random-access DNA-based storage system. Sci. Rep., 5:14138, 2015.
-  Y. Erlich and D. Zielinski. Dna fountain enables a robust and efficient storage architecture. Science, 2017.
-  Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z. Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, and et al. Random access in large-scale DNA data storage. Nature Biotechnology, 2018.
-  R. Heckel, I. Shomorony, K. Ramchandran, and D. N. C. Tse. Fundamental limits of dna storage systems. In IEEE International Symposium on Information Theory (ISIT), pages 3130–3134, 2017.
-  H. M. Kiah, G. J. Puleo, and O. Milenkovic. Codes for DNA sequence profiles. IEEE Trans. on Information Theory, 62(6):3125–3146, 2016.
-  R. Gabrys, H. M. Kiah, and O. Milenkovic. Asymmetric Lee distance codes: New bounds and constructions. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5, 2015.
-  F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek. Exact reconstruction from insertions in synchronization codes. IEEE Transactions on Information Theory, 2017.
-  Mladen Kovačević and Vincent YF Tan. Asymptotically optimal codes correcting fixed-length duplication errors in dna storage systems. IEEE Communications Letters, 22(11):2194–2197, 2018.
-  R. Heckel, G. Mikutis, and R. N. Grass. A Characterization of the DNA Data Storage Channel. arXiv:1803.03322, 2018.
-  Mladen Kovačević and Vincent YF Tan. Codes in the space of multisets—coding for permutation channels with impairments. IEEE Transactions on Information Theory, 64(7):5156–5169, 2018.
-  Wentu Song and Kui Cai. Sequence-subset distance and coding for error control in dna data storage. arXiv preprint arXiv:1809.05821, 2018.
-  Andreas Lenz, Paul H Siegel, Antonia Wachter-zeh, and Eitan Yaakobi. Anchor-Based Correction of Substitutions in Indexed Sets. (291763).
-  Andreas Lenz, Paul H Siegel, Antonia Wachter-Zeh, and Eitan Yaakobit. Coding over sets for dna storage. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 2411–2415. IEEE, 2018.
-  T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, 2nd edition, 2006.