# Feedback Capacity and Coding for the (0,k)-RLL Input-Constrained BEC

The input-constrained binary erasure channel (BEC) with strictly causal feedback is studied. The channel input sequence must satisfy the (0,k)-runlength limited (RLL) constraint, i.e., no more than k consecutive 0's are allowed. The feedback capacity of this channel is derived for all k≥ 1, and is given by C^fb_(0,k)(ε) = εH_2(δ_0)+∑_i=1^k-1(ε^i+1H_2(δ_i)∏_m=0^i-1δ_m)/1+∑_i=0^k-1(ε^i+1∏_m=0^iδ_m), where ε is the erasure probability, ε=1-ε and H_2(·) is the binary entropy function. The maximization is only over δ_k-1, while the parameters δ_i for i≤ k-2 are straightforward functions of δ_k-1. The lower bound is obtained by constructing a simple coding for all k≥1. It is shown that the feedback capacity can be achieved using zero-error, variable length coding. For the converse, an upper bound on the non-causal setting, where the erasure is available to the encoder just prior to the transmission, is derived. This upper bound coincides with the lower bound and concludes the search for both the feedback capacity and the non-causal capacity. As a result, non-causal knowledge of the erasures at the encoder does not increase the feedback capacity for the (0,k)-RLL input-constrained BEC. This property does not hold in general: the (2,∞)-RLL input-constrained BEC, where every 1' is followed by at least two `0's, is used to show that the feedback capacity can be strictly greater than the non-causal capacity.

## Authors

• 1 publication
• 10 publications
• 5 publications
• ### Bounds on the Feedback Capacity of the (d,∞)-RLL Input-Constrained Binary Erasure Channel

The paper considers the input-constrained binary erasure channel (BEC) w...
01/21/2021 ∙ by V. Arvind Rameshwar, et al. ∙ 0

• ### Error-Free Communication Over State-Dependent Channels with Variable-Length Feedback

The zero-error capacity of state-dependent channels with noiseless feedb...
12/21/2017 ∙ by Mladen Kovačević, et al. ∙ 0

• ### Finite-Blocklength Performance of Sequential Transmission over BSC with Noiseless Feedback

In this paper, we consider the expected blocklength of variable-length c...
02/02/2019 ∙ by Hengjie Yang, et al. ∙ 0

• ### A Repeated Signal Difference for Recognising Patterns

This paper describes a new mechanism that might help with defining patte...
04/18/2016 ∙ by Kieran Greer, et al. ∙ 0

• ### Reinforcement Learning Evaluation and Solution for the Feedback Capacity of the Ising Channel with Large Alphabet

We propose a new method to compute the feedback capacity of unifilar fin...
08/18/2020 ∙ by Ziv Aharoni, et al. ∙ 0

• ### A Geometric Property of Relative Entropy and the Universal Threshold Phenomenon for Binary-Input Channels with Noisy State Information at the Encoder

Tight lower and upper bounds on the ratio of relative entropies of two p...
05/30/2018 ∙ by Shengtian Yang, et al. ∙ 0

• ### The DNA Storage Channel: Capacity and Error Probability

The DNA storage channel is considered, in which M Deoxyribonucleic acid ...
09/26/2021 ∙ by Nir Weinberger, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The work was supported in part by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no.337752, and the ISF research grant 818/17. Part of this work was presented at the 2017 IEEE Int. Symposium on Information Theory (ISIT 2017), Aachen, Germany [Peled_ISIT].O. Peled, O. Sabag and H. H. Permuter are with the department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel (oripe@post.bgu.ac.il, oronsa@post.bgu.ac.il, haimp@bgu.ac.il). The physical limitations of the hardware used in recording and communication systems cause some digital sequences to be more prone to errors than others. This elicits the need to ensure that such sequences will not be recorded or transmitted. Constrained coding is a method that enables such systems to encode arbitrary data sequences into sequences that abide by the imposed restrictions [Marcus98]. In the classical constrained coding setting, it is assumed that the transmission is noiseless if the transmitted sequence satisfies the imposed constraint. In this paper, however, we consider a transmission of constrained sequences where the transmission is over a noisy channel, the binary erasure channel (BEC) (Fig. 1).

Run-length limited (RLL) constraints are common in magnetic and optical recording standards, where the run length of consecutive ‘’s should be limited between and . A -RLL constrained binary sequence must satisfy two restrictions:

1. at least ’s must follow each ‘’.

2. no more than consecutive ‘’s are allowed.

The first restriction ensures that the frequency of transitions, i.e., or , will not be too high. This is necessary in systems where the sequence is conveyed over band-limited channels. Timing is commonly recovered with a phase-locked loop (PLL) that adjusts the phase of the detection instant according to the observed transition of the received waveform. The second restriction guarantees that the PLL does not fall out of synchronization with the waveform [Immink90runlength-limitedsequences, Marcus98].

Two important families of the RLL constraint are the -RLL and -RLL. These constraints might seem symmetric in some sense, but indeed, may greatly differ in their behavior, see e.g., [dk_1, dk_2]. Therefore, when dealing with RLL constraints, it is common to tackle each of these families separately before approaching the general -RLL. In this paper, we adopt this approach and show that the -RLL problem is solvable, while the same problem with a -RLL constraint is a great deal more challenging.

The model studied in this paper is a BEC (Fig. 1), in which the input sequences must satisfy the -RLL constraint. Two cases of this model are investigated, based on the information that is available to the encoder. In the first case, described in Fig. 2, the encoder has access to all past outputs via a noiseless feedback link. In the second case, described in Fig. 3, the encoder has non-causal access to the erasure that is about to occur, that is, the encoder knows in advance whether the BEC behaves like a clean channel or not. From an operational point of view, the capacity of the non-causal case must be greater than the feedback case due to the additional information that the encoder has.

We show that the feedback capacity of the -RLL input-constrained BEC is:

 Cfb(0,k)(ε)=max0≤δk−1≤12¯¯¯εH2(δ0)+k−1∑i=1(¯¯¯εi+1H2(δi)i−1∏m=0δm)1+k−1∑i=0(¯¯¯εi+1i∏m=0δm), (1)

for all and , where are simple functions of , given in Eq. (6), below. Surprisingly, we are also able show that the non-causal knowledge of the channel erasure does not increase the feedback capacity, so that (1) is the non-causal capacity as well.

This work generalizes the results in [Sabag_BEC], where the feedback capacity of the -RLL input-constrained BEC was calculated111The -RLL constraint is equivalent to the -RLL constraint by swapping ‘’s and ‘’s. In [Sabag_BEC] and other works, [TatikondaMitter_IT09, Chen05, Yang05, PermuterCuffVanRoyWeissman08, trapdoor_generalized, Ising_channel, generalized_Ising, sabag_BIBO], the capacity was derived by formulating it as a dynamic programming (DP) problem and then solving the corresponding Bellman equation. In all past works, the DP state was an element of the -simplex, an essential property in the solution of the Bellman equation. However, the DP state in our case is an element of the -simplex. This makes the approach of solving the Bellman equation intractable and different methods are required.

To circumvent this difficulty, we use alternative techniques to solve the capacity of our problems. The upper bound follows from standard converse techniques for the non-causal model, where the encoder knows the erasure ahead of time. This upper bound is trivially an upper bound also for the feedback model, since non causal knowledge might increase the capacity only. Then, we construct a simple coding scheme for the feedback setting, inspired by the posterior matching principle [horstein_original, shayevitz_posterior_mathcing, Li_elgamal_matching, shayevitz_simple]. The coding scheme enables both the encoder and the decoder to systematically reduce the size of the set of possible messages to a single message, which is then declared by the decoder as the correct message. An analysis of the achieved rate reveals an expression that is similar to the upper bound. The equivalence of these bounds is finally derived, and this concludes both the feedback capacity and the non-causal capacity for our setting.

The remainder of the paper is organized as follows: Section II includes the notations we use and the problem definition. Section III contains the main results of this paper. In Section IV we present the coding scheme and its rate analysis. Section V includes an upper bound of the capacity. In Section VI we discuss the -RLL input constraint, as an example where the non-causal capacity is strictly greater than the feedback capacity. Section VII presents the feedback capacity of the -RLL BEC, as an example for possible future avenues of research. Finally, the appendices contain proofs of several lemmas used throughout the paper.

## Ii Notations and Problem Definition

### Ii-a Notations

Random variables are denoted using a capital letter . Lower-case letters are used to denote realizations of random variables. Calligraphic letters denote sets. The notation means the -tuple and

is a realization of such a vector of random variables. For a real number

, we define . The binary entropy function is denoted by for .

### Ii-B Problem Definition

The BEC (Fig. 1) is memoryless, that is , and can be represented by:

 yi={xi,if θi=✓?,if θi=\char pzd,

where is an i.i.d. process with . A message is drawn uniformly from and is available to the encoder. We define two models, based on the additional information that is available to the encoder: in the first model, at time , the encoder has knowledge of past channel outputs via a noiseless feedback link (Fig. 2); in the second model, at time , the encoder has non-causal access to (Fig. 3). In both cases, the transmission is over a BEC.

The encoder must produce sequences that comply with the -RLL input constraint. This constraint can be described graphically (Fig. 4), where all walks along the directed edges of the graph do not contain the forbidden string. Note that the node has only one outgoing edge, labeled ‘’, which implies that after consecutive ‘’s, the next bit must be a ‘’. The constrained encoder and the decoder operations are made precise by the following code definitions.

###### Definition 1.

A code for an input-constrained BEC is composed of encoding and decoding functions. The encoding functions for the first model (with feedback) are:

 fi:{1,…,2nR}×Yi−1→Xi, i=1,…,n, (2)

satisfying if for all . For the non-causal model the encoding functions are defined by:

 gi:{1,…,2nR}×{✓,\char pzd}i→Xi, i=1,…,n, (3)

satisfying if for all . The decoding function for both models is defined by:

 Ψ:Yn→{1,…,2nR}.

Without loss of generality, we assume that , so that the initial state is .

The average probability of error for a code is defined as . A rate is said to be -achievable if there exists a sequence of codes such that . The capacity is defined to be the supremum over all -achievable rates and is a function of and the erasure probability . Denote by the capacity of the feedback model and that of the non-causal model. Since is computable from and , we have the relation , for all , .

## Iii Main Results

In this section we present the main results, including the feedback capacity and the non-causal capacity for the BEC with -RLL input constraints. We then explain the methodology used to prove the results. The following theorem constitutes our main results regarding the feedback capacity and the capacity achieving coding scheme. Define the function:

 Rε(δ0,…,δk−1)=¯¯¯εH2(δ0)+k−1∑i=1(¯¯¯εi+1H2(δi)i−1∏m=0δm)1+k−1∑i=0(¯¯¯εi+1i∏m=0δm), (4)

where takes values in for .

###### Theorem 1.

The feedback capacity of the -RLL input-constrained BEC with feedback is:

 Cfb(0,k)(ε)=max0≤δk−1≤12Rε(δ0,…,δk−1), (5)

where are functions of and can be calculated recursively using:

 δj=δj+1δj+1+¯¯¯δj+1(¯δj+1¯¯¯δj+2)¯εj=0,1,…,k−2, (6)

with . In addition, there exists a simple coding scheme that achieves the capacity given in (5).

Fig 5 presents graphs of the feedback capacity as a function of for several values of . The capacity is a decreasing function of , and an increasing function of . For , the channel is noiseless and so the capacity is that of the corresponding constraint. For example, , where is the golden ratio, which is known to be the capacity of sequences that do not contain two consecutive ‘’s. For , the output is constant so we have for all . As increases the constraint becomes more lenient and the capacity approaches , which is the capacity of the unconstrained BEC.

Theorem 1 guarantees that even though the function we aim to maximize is a function of variables, to calculate the capacity, one needs only to perform a maximization over . For any , the values of all other variables can be calculated by utilizing the set of equations given in (6).

Our proposed coding scheme has degrees of freedom, represented by . For this reason, it is rather surprising that the feedback capacity is a simple optimization problem of one variable for all . Indeed, the relaxation of the optimization domain shows that optimizing over the k-tuple and the optimization in (5) and (6) are equivalent. In addition we also prove the following:

###### Theorem 2.

Non-causal knowledge of the erasures does not increase the feedback capacity, that is :

 Cfb(0,k)(ε)=Cnc(0,k)(ε).

It is tempting to conjecture that this property holds for the general -RLL constrained BEC, but we will provide a counterexample in Section VI. Theorems 1 and 2 both generalize parallel results shown in [Sabag_BEC], where the special case of was calculated using different techniques. The following inequalities are the main steps required to prove Theorems 1 and 2:

 max0≤δ0,…,δk−1≤12Rε(δ0,…,δk−1)(a)≤Cfb(0,k)(ε)(b)≤Cnc(0,k)(ε)(c)≤max0≤δ0,…,δk−1≤1Rε(δ0,…,δk−1), (7)

where

• Inequality (a) follows from the coding scheme that is presented in Algorithm 1. Specifically, it is shown that is achievable for any choice of .

• Inequality (b) follows from the operational definitions of the code in Section II.

• Inequality (c) follows from standard converse techniques for the non-causal setting.

The next lemma shows that the maximal value of remains the same whether the maximization domain is or . Thus, the chain of inequalities is actually a chain of equalities.

###### Lemma 1.

For all and ,

 max0≤δ0,…,δk−1≤12Rε(δ0,…,δk−1)=max0≤δ0,…,δk−1≤1Rε(δ0,…,δk−1).

Moreover, the k-tuple satisfies Eq. (6).

Theorems 1 and 2 are concluded from the inequalities chain (7) and Lemma 1, which is proved in Appendix A

## Iv Optimal Coding Scheme for the Input-Constrained BEC with Feedback

In this section, we introduce the idea behind the optimal coding scheme, as well as the coding itself, which is presented in Algorithm 1. We then prove that the scheme is feasible, meaning that the generated input sequence does not violate the input constraint, and, finally, calculate its rate.

### Iv-a Coding Scheme

Before presenting the coding scheme itself, we discuss the basic ideas in accordance with which the scheme operates. The coding scheme is a mechanism that allows the encoder to transmit a message to the decoder without violating the channel’s input constraint. The main feature of the scheme is a dynamic set of possible messages that is known both to the encoder and to the decoder at all times. Both parties will systematically reduce the size of the set of possible messages from in the beginning of the transmission process to a single message that will then be announced as the correct message.

Initially, the messages are mapped uniformly to message points in the unit interval by applying . As long as transmission proceeds, the set of possible messages is represented by uniform points on the unit interval with proper scaling.

Channel inputs are determined by labeling functions, which map the unit interval into . Given a labelling, , with a corresponding parameter the encoder assigns the label ‘’ to a subsection of of length and the label ‘’ to the rest of . Fig. 6 depicts the various labelings. Define the following function:

 X(m,L)={0,(L=~l0 and m>¯¯¯δ0) or (L=lj and m<δj, j=0,…,k)1,otherwise. (8)

The channel input is , where is the correct message point and is the labeling being used at time .

The labelling at each time is a function of all channel outpouts and can be computed recursively from the previous channel input and the previous labelling. Therefore, the instantaneous labelling is available both to the encoder and the decoder. Transition between the various labelings is controlled by a finite-state machine (FSM), which is illustrated in Fig. 7. Define the following function:

 G(L,Y)=⎧⎪⎨⎪⎩l0,(L=lk) or (Y=1) or (Y=? and % L=~l0)~l0,Y=? and L≠~l0lj+1,Y=0, (9)

and thus, .

A transmission at time is said to be successful if . Due to the nature of the BEC, whenever the decoder can know with certainty the value of . Denote by and the subsets of messages which are labeled at time with ‘’ and ‘’, respectively. Define and for :

 ^Mi=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩^Mi−1,Yi=?^M(0)i−1,Yi=0^M(1)i−1,Yi=1. (10)

Thus, a successful transmission reduces the size of the set of possible messages. Following a successful transmission, the remaining messages in the set of possible messages are uniformly mapped again to . Fig. 8 depicts a successful transmission and the subsequent reduction of the number of possible messages. The process continues until such a time that the set of possible messages contains only one message, at which point the decoder declares it to be .

Algorithm 1 presents the coding scheme in pseudo code form. The functions and mentioned in the algorithm are defined in Eq. (8) and Eq. (9), respectively.

### Iv-B Feasibility of the Proposed Scheme

First, we show that the coding scheme satisfies the -RLL constraint, that is, no message is mapped by the scheme into a sequence with more than consecutive ‘’s. The following lemma shows that the constraint is satisfied when restricting the scheme parameters .

###### Lemma 2.

If for , then any channel input sequence generated by the proposed coding scheme satisfies the -RLL constraint.

###### Proof.

We show that if for , then no message is labeled ‘’ more than times in a row. From Eqs. (8) and (9), the channel input, is a function of and . Therefore, we divide the proof into three disjoint cases based on the last outputs . For each case, we show that the subsequent sequence of channel inputs cannot contain more than consecutive ‘’s. Assume that transmission begins at the node associated with labeling .

1. Any output sequence (of length ) that contains a ‘’ cannot cause a violation.

2. An output sequence of consecutive ‘’s ends at (Fig. 7). Thus, the next channel input is (Fig 6).

3. Lastly, consider a sequence of outputs that contains both ‘’s and erasures. Assume the first erasure occurred at time instance , meaning that the erasure took place while the encoder was using labeling . This means that all messages between and in were labeled with ‘ times in a row. The next labeling that will be used is . In this labeling, all messages between and are labeled with ‘’. Since , we have that , so all messages that were labeled ‘ times in a row will be labeled ‘’ in . This analysis holds for any .

In summary, setting , ensures that the scheme does not violate the -RLL constraint. ∎

### Iv-C Rate Analysis

The achieved rate is measured by . Define as the number of information bits gained in a single channel use, i.e., the quotient of the size of the set of possible messages before and after the transmission. Additionally, Let be the random variable which corresponds to the labeling and takes values in .

In the following lemma we calculate the expectation of .

###### Lemma 3.

For all , we have that , where is the relevant to the labeling .

###### Proof.

Consider:

 E[Q∣L=l] =εE[Q∣L=l,θ=\char pzd]+¯¯¯εE[Q∣L=l,θ=✓] (a)=¯¯¯εE[Q∣L=l,θ=✓], (11)

where (a) holds since if , then the transmitted symbol is erased by the channel and the set of possible messages is unchanged.

In the proposed coding scheme, labeling assigns a portion of of size to the label ‘’ and the rest to label ‘’ for all . The labeling also assigns of the unit interval to ‘’. If the labeling is employed, then the channel input is distributed according to 222For labelings , the encoder transmits if the correct message falls within a sub-interval of that has length respectively. Note that the messages are discrete points in and it is possible for the partition to occur between two messages. This implies that the transmitted bit is distributed , where is a correction factor. From the continuity of the entropy function, the contribution of this correction factor can be bounded with arbitrary constant by taking the block length be large enough. The precise details are omitted and follow parallel argument to [Sabag_BEC, Appendix C].

Assume that . If the successfully received bit was ‘’, then the new set of possible messages has size , and if it was ‘’, then the new set of possible messages has size . The expected number of bits required to describe the new set of possible messages is . Thus, given that , following a successful transmission the decoder gains bits of information. Substituting into (IV-C) we get:

 E[Q∣L=l]=¯¯¯εH2(δl).

The next lemma calculates the rate achieved by the proposed coding scheme.

###### Lemma 4.

For any , and , the proposed coding scheme achieves the following rate :

 R=¯¯¯εH2(δ0)+k−1∑i=1(¯¯¯εi+1H2(δi)i−1∏m=0δm)1+k−1∑i=0(¯¯¯εi+1i∏m=0δm). (12)
###### Proof.

Consider the averaged gain of information divided by the amount of time:

 R =limn→∞1nn∑j=1E[Qj] (a)=limn→∞1nn∑j=1∑l∈LP(Lj=l)E[Qj∣Lj=l] (b)=∑l∈L¯¯¯εH2(δl)limn→∞1nn∑j=1P(Lj=l) (c)=∑l∈L¯¯¯εH2(δl)π(l) =¯¯¯εH2(δ0)(π(~l0)+π(l0))+k−1∑i=1¯¯¯εH2(δi)π(li) (d)=¯¯¯εH2(δ0)+k−1∑i=1(¯¯¯εi+1H2(δi)i−1∏m=0δm)1+k−1∑i=0(¯¯¯εi+1i∏m=0δm),

where

• Follows from the law of total expectation.

• Follows from Lemma 3 and exchanging the finite sums’ order.

• Follows from the definition of stationary probability. is the stationary probability of labeling . There exists a stationary probability because the random process

is a positive recurrent, irreducible and aperiodic Markov chain, as can be seen from Fig.

7.

• Follows from calculation of the stationary probability of the Markov chain described in Fig. 7 and is parameterized with ,

. Calculating the conditional probability of each edge is simple, using the law of total probability. For example, the conditional distribution of the edge beginning node

and culminating in node is .

From Lemma 2, we conclude that is an achievable rate. ∎

## V Non-Causal Upper Bound

In this section, we present an upper bound of the non-causal capacity for the -RLL input-constrained BEC (given in Eq. (7)). We begin with an observation: it is sufficient to look at a smaller family of codes, called restricted codes. We then proceed to calculate an upper bound on the achievable rate of such codes, using standard converse arguments, as well as the method of types and Markov theory results.

A code is said to be restricted if

 gi(m,θi−1,θi=\char pzd)=1, ∀m,θi−1,i=1,…,n. (13)

Condition (13) states that if an erasure is about to occur, the encoder transmits . The following lemma formalizes the fact that restricted codes can achieve the capacity.

###### Lemma 5.

For the -RLL constrained non-causal BEC, if a rate is achievable, then can be achieved using a sequence of restricted codes.

###### Proof.

Assume the rate is achieved using a sequence of codes: . Define for each a new code that is exactly the same as except that in , whenever the encoder transmits .

The code does not violate the input constraint since the original did not violate the constraint and transmitting is always permitted by the -RLL input constraint. In addition, the channel outputs remain the same whether the code is or . This means that , and so the rate is also achieved by the sequence of . ∎

### V-a Upper Bound Calculation

The following technical lemma is needed for the converse proof:

###### Lemma 6.

For any n-tuple constrained distribution , where , there exists a time invariant constrained Markov distribution such that:

 H~P(Yn)≤H~Q(Yn)+ζn, (14)

where .

The proof is available in Appendix B. This result can readily be generalized to any -RLL constraint imposed on the n-tuple distribution.

In the constraint graph, Fig. 4, a node can be calculated from an any-length tuple by walking along the edges labelled with , since we assume that the initial state is . The notation will be used in various forms for distributions on to signify that the probability for is and that the probability for a constrained word is .

###### Proof of the upper bound.

Let be an achievable rate using a restricted code, and consider the following chain of inequalities:

 nR =H(M) (a)≤I(Yn;M)+εn (b)=H(Yn)−n∑i=1H(Yi∣M,Yi−1)+εn (c)≤H(Yn)−nH2(ε)+εn (d)≤n∑i=1H(Yi∣Yi−1i−k)−nH2(ε)+εn (e)≤max{~Pi(yi∣yi−1)}ni=1n∑i=1H(Yi∣Yi−1i−k)−nH2(ε)+εn (f)=max{~Pi(yi∣yi−1i−k)}ni=1n∑i=1H(Yi∣Yi−1i−k)−nH2(ε)+εn (g)≤max{~P(yi∣yi−1i−k)}ni=1n∑i=1H(Yi∣Yi−1i−k)−nH2(ε)+ε′n (h)=max{~P(yi∣yi−1i−k,si−1)}ni=1n∑i=1H(Yi∣Yi−1i−k,Si−1)−nH2(ε)+ε′n (i)≤max{~P(yi∣yi−1i−k,si−1)}ni=1n∑i=1H(Yi∣Si−1)−nH2(ε)+ε′n (j)=max{~P(yi∣si−1)}ni=1n∑i=1H(Yi∣Si−1)−nH2(ε)+ε′n (k)=max0≤δ0,…,δk−1≤1n∑i=1k−1∑j=0Pr(Si−1=j)H3(¯εδj,ε,¯¯¯ε¯¯¯δj)−nH2(ε)+ε′n (l)=max0≤δ0,…,δk−1≤1n∑i=1k−1∑j=0Pr(Si−1=j)[H2(ε)+¯¯¯εH2(δj)]−nH2(ε)+ε′n =max0≤δ0,…,δk−1≤1¯¯¯εk−1∑j=0H2(δj)n∑i=1Pr(Si−1=j)+ε′n. (15)

where

1. Follows from Fano’s inequality.

2. Follows from the chain rule.

3. Follows from the fact that conditioning reduces entropy, so: .

4. Follows from the fact that conditioning reduces entropy.

5. The maximization domain is the set of all n-tuple distributions which induce and , for all and , respectively.

6. We want to show that it is possible to maximize over a smaller domain and maintain an equality. It Suffices to prove by induction that if we have two distributions and , which induce the same marginal distributions , then and coincide. For the proof is trivial. Assume by induction that and we need to prove that . Indeed we have:

 ~P(1)(yii−k)=~P(1)(yi−1i−k)~P(yi∣yi−1i−k)=~P(2)(yi−1i−k)~P(yi∣yi−1i−k)=~P(2)(yii−k),

since is the same for both distributions by assumption, and the