I Introduction
Consider the problem of transmitting a message by writing it on a piece of paper, which will be torn into small pieces of random sizes and randomly shuffled. This coding problem is illustrated in Figure 1. We refer to it as the tornpaper coding, in allusion to the classic dirtypaper coding problem [dirtypaper].
This problem is mainly motivated by macromoleculebased (and in particular DNAbased) data storage, which has recently received significant attention due to several proofofconcept DNA storage systems [church_nextgeneration_2012, goldman_towards_2013, grass_robust_2015, bornholt_dnabased_2016, erlich_dna_2016, organick_scaling_2017]. In these systems, data is written onto synthesized DNA molecules, which are then stored in solution. During synthesis and storage, molecules in solution are subject to random breaks and, due to the unordered nature of macromoleculebased storage, the resulting pieces are shuffled [heckel_characterization_2018]. Furthermore, the data is read via highthroughput sequencing technologies, which is typically preceded by physical fragmentation of the DNA with techniques like sonication [pomraning2012library]. In addition, the tornpaper channel is related to the DNA shotgun sequencing channel, studied in [MotahariDNA, BBT, gabrys2018unique], but in the context of variablelength reads, which are obtained in nanopore sequencing technologies [laver2015assessing, mao2018models].
We consider the scenario where the channel input is a length binary string, which is then torn into pieces of lengths , each of which has a distribution. The channel output is the unordered set of these pieces. As we will see, even this noisefree version of the tornpaper coding problem is nontrivial.
To obtain some intuition, notice that , and hence it is reasonable to compare our problem to the case where the tearing points are evenly separated, and for
with probability
. In this case, the channel becomes a shuffling channel, similar to the one considered in [noisyshuffling], but with no noise. Coding for the case of deterministic fragments of length is easy: since the tearing points are known, we can prefix each fragment with a unique identifier, which allows the decoder to correctly order the fragments. From the results in [noisyshuffling], such an indexbased coding scheme is capacityoptimal, and any achievable rate in this case must satisfy, for large ,(1) 
If we let , the capacity for this case becomes .
It is not clear a priori whether the capacity of the tornpaper channel should be higher or lower than . The fact that the tearing points are not known to the encoder makes it challenging to place a unique identifier in each fragment, suggesting that the tornpaper channel is “harder” and should have a lower capacity. The main result of this paper contradicts this intuition and shows that the capacity of the tornpaper channel with length fragments is higher than . More precisely, we show that the capacity of the tornpaper channel is
. Intuitively, this boost in capacity comes from the tail of the geometric distribution, which guarantees that a fraction of the fragments will be significantly larger than the mean
. This allows the capacity to be positive even for , in which case the capacity of the deterministictearing case in (1) becomes .Ii Problem Setting
We consider the problem of coding for the tornpaper channel, illustrated in Figure 1. The transmitter encodes a message into a length binary codeword . The channel output is a set of binary strings
(2) 
The process by which is obtained is described next.

[wide]

The channel tears the input sequence into segments of length for a tearing probability . More specifically, let be i.i.d. Geometricrandom variables. Let be the smallest index such that Notice that is also a random variable.
The channel tears into segments , where
for and
We note that this process is equivalent to independently tearing the message in between consecutive bits with probability . More precisely, let be binary indicators of whether there is a cut between and . Then, letting s be i.i.d. random variables results in independent fragments of length . Also, , implying that .

Given , let
be a uniformly distributed random permutation on
. The output segments are then obtained by setting, for ,
We note that there are no bitlevel errors, e.g., bit flips, in this process. We also point out that we allow the tearing probability to be a function of the block length , thus, including subscript in .
A code with rate for the tornpaper channel is a set of binary codewords, each of length , together with a decoding procedure that maps a set of variablelength binary strings to an index . The message is assumed to be chosen uniformly at random from , and the error probability of a code is defined accordingly. A rate is said to be achievable if there exists a sequence of rate codes , with blocklength , whose error probability tends to as . The capacity is defined as the supremum over all achievable rates. Notice that should be a function of the sequence of tearing probabilities .
Notation: Throughout the paper, represents the logarithm base , while represents the natural logarithm. For functions and , we write if as . For an event , we let or be the binary indicator of .
Iii Main Results
If the encoder had access to the tearing locations ahead of time, a natural coding scheme would involve placing unique indices on every fragment, and using the remaining bits for encoding a message. In particular, if the message block broke evenly into pieces of length , results from [noisyshuffling] imply that placing a unique index of length in each fragment is capacity optimal. In this case, the capacity is , where (assuming the limit exists). If , no positive rate is achievable.
However, in our setting, the fragment lengths are random and the same indexbased approach cannot be used. Because we do not know the tearing points, we cannot place indices at the beginning of each fragment. Furthermore, while the expected fragment length may be long, some fragments may be shorter than and a unique index could not be placed in them even if we knew the tearing points. Our main result shows that, surprisingly, the random tearing locations and fragment lengths in fact increases the channel capacity.
Theorem 1.
The capacity of the tornpaper channel is
where .
In Sections IV and V we prove Theorem 1. To prove the converse to this result, we exploit the fact that, for large ,
has an approximately exponential distribution. This, together with several concentration results, allows us to partition the set of fragments into multiple bins of fragments with roughly the same size and view the tornpaper coding, in essence, as parallel channels with fixedsize fragments. Our achievability is based on random coding arguments and does not provide much insight into efficient coding schemes. This opens up interesting avenues for future research.
Iv Converse
In order to prove the converse, we first partition the input and output strings based on length. This allows us to view the tornpaper channel as a set of parallel channels, each of which involves fragments of roughly the same size. More precisely, for an integer parameter , we will let
(3) 
for , and we will think of the transformation from to as a separate channel. Notice that the th channel is intuitively similar to the shuffling channel with equallength pieces considered in [DNAStorageISIT].
We will use the fact that the number of fragments in concentrates as . More precisely, we let
(4) 
and we have the following lemma, proved in Section VI.
Lemma 1.
For any and large enough,
(5) 
Notice that, since , as . Moreover, asymptotically, approaches an distribution. This known fact is stated as the following lemma, which we also prove in Section VI.
Lemma 2.
If is a random variable and , then
(6) 
Lemma 1 implies that , and
(7) 
where the last equality follows from Lemma 2. Next, we define event , where , which guarantees that, as , and from Lemma 1. Then,
(8) 
where we loosely upper bound with , since can be fully described by the binary string and the tearing points indicators .
In order to bound , i.e., the entropy of given that its size is close to , we first note that the number of possible distinct sequences in is
Moreover, given ,
(9) 
and the set can be seen as a histogram over all possible strings with . Notice that we can view the last element of the histogram as containing “excess counts” if . Hence, from Lemma 1 in [DNAStorageISIT],
(10) 
From (9), we have as . Combining (8) and (10), dividing by , and letting yields
(11) 
In order to bound an achievable rate , we use Fano’s inequality to obtain
(12) 
and we conclude that any achievable rate must satisfy In order to connect (12) and (11), we state the following lemma, which allows us to move the limit inside the summation. The proof is in Section VI.
Lemma 3.
If is defined as in (3) for ,
Using this lemma and (11), we can upper bound any achievable rate as
(13) 
where the last equality is due to a telescoping sum. The remaining summation can be computed as
We conclude that any achievable rate must satisfy
for any positive integer . Since
we obtain the outer bound .
V Achievability via Random Coding
A random coding argument can be used to show that any rate is achievable. Consider generating a codebook with codewords, by independently picking each symbol as . Let , where is the random codeword associated with message . Notice that optimal decoding can be obtained by simply finding an index such that corresponds to a concatenation of the strings in . If more than one such codewords exist, an error is declared.
Suppose message is chosen and is the random set of output strings. To bound the error probability we consider a suboptimal decoder that throws out all fragments shorter than , for some to be determined, and simply tries to find a codeword that contains all output strings as nonoverlapping substrings. If we let be the error event averaged over all codebook choices, we have
Using a similar approach to the one used in Section IV, it can be shown that . From Lemma 2, we thus have
(14) 
If we let be the binary indicator of the event , then . In Section VI, we prove the following concentration result.
Lemma 4.
For any , as ,
(15) 
In addition to characterizing asymptotically, we will also be interested in the total length of the sequences in . Intuitively, this determines how well the fragments in cover their codeword of origin .
Definition 1.
The coverage of is defined as
(16) 
Notice that with probability .
In order to characterize asymptotically, we will again resort to the exponential approximation to a geometric distribution, through the following lemma.
Lemma 5.
If is a random variable and , then, for any ,
(17) 
where is an random variable.
Using Lemma 5, we can characterize the asymptotic value of and show that concentrates around this value. More precisely, we show the following lemma in Section VI.
Lemma 6.
For any , as ,
(18) 
In particular, Lemma 6 implies that
(19) 
and that cannot deviate much from this value with high probability. If we let and , and we define the event
(20) 
then (15) and (18) imply that as . Since is independent of , we can upper bound the probability of error as
Inequality follows from the union bound and from the fact that thre are at most ways to align the strings in to a codeword in a nonoverlapping way and, given this alignment, bits in must be specified. Since as , we see that we can a rate as long as
for some and . Letting , yields
for some . The righthand side is maximized by setting , which implies that we can achieve any rate .
Vi Proofs of Lemmas
Lemma 1.
The number of fragments in satisfies
for any and large enough.
Proof of Lemma 1.
First notice that, since , where are i.i.d. random variables, and using Hoeffding’s inequality,
(21) 
where the last inequality holds for large enough.
Now suppose the sequence of independent random variables is an infinite sequence (and does not stop at ). Let be the binary indicator of the event , and . Intuitively, and should be close. In particular, Moreover, . If and , by the triangle inequality, . Therefore,
where we used Hoeffding’s inequality and (21). ∎
Lemma 2.
If is a random variable and , then
Proof of Lemma 2.
By definition,
As , and . Hence, , implying the lemma. ∎
Lemma 3.
If is defined as in (3) for ,
Proof of Lemma 3.
For a fixed integer , we define and we have
(22) 
If we define as in Definition 1, from Lemma 6, we have
Moreover, from Lemma 6, the event
has vanishing probability as . This allows us to write
Hence, from (22), we have that for every and ,
Notice that as . Therefore, we can let and , and we conclude that
∎
Lemma 4.
The number of fragments in satisfies
for any and large enough.
Proof of Lemma 4.
Let , for . Then . Since is random (and not independent of the s), we need to follow similar steps to those in the proof of Lemma 1.
Let us assume that the sequence of independent random variables is an infinite sequence and let . Notice that is a sum of i.i.d. Bernoulli random variables with
(23) 
and the standard Hoeffding’s inequality can be applied. Moreover, from Lemma 2,
and, for any , for large enough. If we set and, for large enough, we have . Moreover, if and , by the triangle inequality (applied twice), . Hence,
where we used Hoeffding’s inequality and (21). ∎
Lemma 5.
If is a random variable and , then, for any ,
Comments
There are no comments yet.