Communicating over the Torn-Paper Channel

05/26/2020 ∙ by Ilan Shomorony, et al. ∙ University of Illinois at Urbana-Champaign 0

We consider the problem of communicating over a channel that randomly "tears" the message block into small pieces of different sizes and shuffles them. For the binary torn-paper channel with block length n and pieces of length Geometric(p_n), we characterize the capacity as C = e^-α, where α = lim_n→∞ p_n log n. Our results show that the case of Geometric(p_n)-length fragments and the case of deterministic length-(1/p_n) fragments are qualitatively different and, surprisingly, the capacity of the former is larger. Intuitively, this is due to the fact that, in the random fragments case, large fragments are sometimes observed, which boosts the capacity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider the problem of transmitting a message by writing it on a piece of paper, which will be torn into small pieces of random sizes and randomly shuffled. This coding problem is illustrated in Figure 1. We refer to it as the torn-paper coding, in allusion to the classic dirty-paper coding problem [dirtypaper].

This problem is mainly motivated by macromolecule-based (and in particular DNA-based) data storage, which has recently received significant attention due to several proof-of-concept DNA storage systems [church_next-generation_2012, goldman_towards_2013, grass_robust_2015, bornholt_dna-based_2016, erlich_dna_2016, organick_scaling_2017]. In these systems, data is written onto synthesized DNA molecules, which are then stored in solution. During synthesis and storage, molecules in solution are subject to random breaks and, due to the unordered nature of macromolecule-based storage, the resulting pieces are shuffled [heckel_characterization_2018]. Furthermore, the data is read via high-throughput sequencing technologies, which is typically preceded by physical fragmentation of the DNA with techniques like sonication [pomraning2012library]. In addition, the torn-paper channel is related to the DNA shotgun sequencing channel, studied in [MotahariDNA, BBT, gabrys2018unique], but in the context of variable-length reads, which are obtained in nanopore sequencing technologies [laver2015assessing, mao2018models].

Fig. 1: The torn-paper channel.

We consider the scenario where the channel input is a length- binary string, which is then torn into pieces of lengths , each of which has a distribution. The channel output is the unordered set of these pieces. As we will see, even this noise-free version of the torn-paper coding problem is non-trivial.

To obtain some intuition, notice that , and hence it is reasonable to compare our problem to the case where the tearing points are evenly separated, and for

with probability

. In this case, the channel becomes a shuffling channel, similar to the one considered in [noisyshuffling], but with no noise. Coding for the case of deterministic fragments of length is easy: since the tearing points are known, we can prefix each fragment with a unique identifier, which allows the decoder to correctly order the fragments. From the results in [noisyshuffling], such an index-based coding scheme is capacity-optimal, and any achievable rate in this case must satisfy, for large ,

(1)

If we let , the capacity for this case becomes .

It is not clear a priori whether the capacity of the torn-paper channel should be higher or lower than . The fact that the tearing points are not known to the encoder makes it challenging to place a unique identifier in each fragment, suggesting that the torn-paper channel is “harder” and should have a lower capacity. The main result of this paper contradicts this intuition and shows that the capacity of the torn-paper channel with -length fragments is higher than . More precisely, we show that the capacity of the torn-paper channel is

. Intuitively, this boost in capacity comes from the tail of the geometric distribution, which guarantees that a fraction of the fragments will be significantly larger than the mean

. This allows the capacity to be positive even for , in which case the capacity of the deterministic-tearing case in (1) becomes .

Ii Problem Setting

We consider the problem of coding for the torn-paper channel, illustrated in Figure 1. The transmitter encodes a message into a length- binary codeword . The channel output is a set of binary strings

(2)

The process by which is obtained is described next.

  1. [wide]

  2. The channel tears the input sequence into segments of -length for a tearing probability . More specifically, let be i.i.d. Geometricrandom variables. Let be the smallest index such that Notice that is also a random variable.

    The channel tears into segments , where

    for and

    We note that this process is equivalent to independently tearing the message in between consecutive bits with probability . More precisely, let be binary indicators of whether there is a cut between and . Then, letting s be i.i.d.  random variables results in independent fragments of length . Also, , implying that .

  3. Given , let

    be a uniformly distributed random permutation on

    . The output segments are then obtained by setting, for ,

We note that there are no bit-level errors, e.g., bit flips, in this process. We also point out that we allow the tearing probability to be a function of the block length , thus, including subscript in .

A code with rate for the torn-paper channel is a set of binary codewords, each of length , together with a decoding procedure that maps a set of variable-length binary strings to an index . The message is assumed to be chosen uniformly at random from , and the error probability of a code is defined accordingly. A rate is said to be achievable if there exists a sequence of rate- codes , with blocklength , whose error probability tends to as . The capacity is defined as the supremum over all achievable rates. Notice that should be a function of the sequence of tearing probabilities .

Notation: Throughout the paper, represents the logarithm base , while represents the natural logarithm. For functions and , we write if as . For an event , we let or be the binary indicator of .

Iii Main Results

If the encoder had access to the tearing locations ahead of time, a natural coding scheme would involve placing unique indices on every fragment, and using the remaining bits for encoding a message. In particular, if the message block broke evenly into pieces of length , results from [noisyshuffling] imply that placing a unique index of length in each fragment is capacity optimal. In this case, the capacity is , where (assuming the limit exists). If , no positive rate is achievable.

However, in our setting, the fragment lengths are random and the same index-based approach cannot be used. Because we do not know the tearing points, we cannot place indices at the beginning of each fragment. Furthermore, while the expected fragment length may be long, some fragments may be shorter than and a unique index could not be placed in them even if we knew the tearing points. Our main result shows that, surprisingly, the random tearing locations and fragment lengths in fact increases the channel capacity.

Theorem 1.

The capacity of the torn-paper channel is

where .

In Sections IV and V we prove Theorem 1. To prove the converse to this result, we exploit the fact that, for large ,

has an approximately exponential distribution. This, together with several concentration results, allows us to partition the set of fragments into multiple bins of fragments with roughly the same size and view the torn-paper coding, in essence, as parallel channels with fixed-size fragments. Our achievability is based on random coding arguments and does not provide much insight into efficient coding schemes. This opens up interesting avenues for future research.

Iv Converse

In order to prove the converse, we first partition the input and output strings based on length. This allows us to view the torn-paper channel as a set of parallel channels, each of which involves fragments of roughly the same size. More precisely, for an integer parameter , we will let

(3)

for , and we will think of the transformation from to as a separate channel. Notice that the th channel is intuitively similar to the shuffling channel with equal-length pieces considered in [DNAStorageISIT].

We will use the fact that the number of fragments in concentrates as . More precisely, we let

(4)

and we have the following lemma, proved in Section VI.

Lemma 1.

For any and large enough,

(5)

Notice that, since , as . Moreover, asymptotically, approaches an distribution. This known fact is stated as the following lemma, which we also prove in Section VI.

Lemma 2.

If is a random variable and , then

(6)

Lemma 1 implies that , and

(7)

where the last equality follows from Lemma 2. Next, we define event , where , which guarantees that, as , and from Lemma 1. Then,

(8)

where we loosely upper bound with , since can be fully described by the binary string and the tearing points indicators .

In order to bound , i.e., the entropy of given that its size is close to , we first note that the number of possible distinct sequences in is

Moreover, given ,

(9)

and the set can be seen as a histogram over all possible strings with . Notice that we can view the last element of the histogram as containing “excess counts” if . Hence, from Lemma 1 in [DNAStorageISIT],

(10)

From (9), we have as . Combining (8) and (10), dividing by , and letting yields

(11)

In order to bound an achievable rate , we use Fano’s inequality to obtain

(12)

and we conclude that any achievable rate must satisfy In order to connect (12) and (11), we state the following lemma, which allows us to move the limit inside the summation. The proof is in Section VI.

Lemma 3.

If is defined as in (3) for ,

Using this lemma and (11), we can upper bound any achievable rate as

(13)

where the last equality is due to a telescoping sum. The remaining summation can be computed as

We conclude that any achievable rate must satisfy

for any positive integer . Since

we obtain the outer bound .

V Achievability via Random Coding

A random coding argument can be used to show that any rate is achievable. Consider generating a codebook with codewords, by independently picking each symbol as . Let , where is the random codeword associated with message . Notice that optimal decoding can be obtained by simply finding an index such that corresponds to a concatenation of the strings in . If more than one such codewords exist, an error is declared.

Suppose message is chosen and is the random set of output strings. To bound the error probability we consider a suboptimal decoder that throws out all fragments shorter than , for some to be determined, and simply tries to find a codeword that contains all output strings as non-overlapping substrings. If we let be the error event averaged over all codebook choices, we have

Using a similar approach to the one used in Section IV, it can be shown that . From Lemma 2, we thus have

(14)

If we let be the binary indicator of the event , then . In Section VI, we prove the following concentration result.

Lemma 4.

For any , as ,

(15)

In addition to characterizing asymptotically, we will also be interested in the total length of the sequences in . Intuitively, this determines how well the fragments in cover their codeword of origin .

Definition 1.

The coverage of is defined as

(16)

Notice that with probability .

In order to characterize asymptotically, we will again resort to the exponential approximation to a geometric distribution, through the following lemma.

Lemma 5.

If is a random variable and , then, for any ,

(17)

where is an random variable.

Using Lemma 5, we can characterize the asymptotic value of and show that concentrates around this value. More precisely, we show the following lemma in Section VI.

Lemma 6.

For any , as ,

(18)

In particular, Lemma 6 implies that

(19)

and that cannot deviate much from this value with high probability. If we let and , and we define the event

(20)

then (15) and (18) imply that as . Since is independent of , we can upper bound the probability of error as

Inequality follows from the union bound and from the fact that thre are at most ways to align the strings in to a codeword in a non-overlapping way and, given this alignment, bits in must be specified. Since as , we see that we can a rate as long as

for some and . Letting , yields

for some . The right-hand side is maximized by setting , which implies that we can achieve any rate .

Vi Proofs of Lemmas

Lemma 1.

The number of fragments in satisfies

for any and large enough.

Proof of Lemma 1.

First notice that, since , where are i.i.d.  random variables, and using Hoeffding’s inequality,

(21)

where the last inequality holds for large enough.

Now suppose the sequence of independent random variables is an infinite sequence (and does not stop at ). Let be the binary indicator of the event , and . Intuitively, and should be close. In particular, Moreover, . If and , by the triangle inequality, . Therefore,

where we used Hoeffding’s inequality and (21). ∎

Lemma 2.

If is a random variable and , then

Proof of Lemma 2.

By definition,

As , and . Hence, , implying the lemma. ∎

Lemma 3.

If is defined as in (3) for ,

Proof of Lemma 3.

For a fixed integer , we define and we have

(22)

If we define as in Definition 1, from Lemma 6, we have

Moreover, from Lemma 6, the event

has vanishing probability as . This allows us to write

Hence, from (22), we have that for every and ,

Notice that as . Therefore, we can let and , and we conclude that

Lemma 4.

The number of fragments in satisfies

for any and large enough.

Proof of Lemma 4.

Let , for . Then . Since is random (and not independent of the s), we need to follow similar steps to those in the proof of Lemma 1.

Let us assume that the sequence of independent random variables is an infinite sequence and let . Notice that is a sum of i.i.d. Bernoulli random variables with

(23)

and the standard Hoeffding’s inequality can be applied. Moreover, from Lemma 2,

and, for any , for large enough. If we set and, for large enough, we have . Moreover, if and , by the triangle inequality (applied twice), . Hence,

where we used Hoeffding’s inequality and (21). ∎

Lemma 5.

If is a random variable and , then, for any ,