    # Algorithmic Polarization for Hidden Markov Models

Using a mild variant of polar codes we design linear compression schemes compressing Hidden Markov sources (where the source is a Markov chain, but whose state is not necessarily observable from its output), and to decode from Hidden Markov channels (where the channel has a state and the error introduced depends on the state). We give the first polynomial time algorithms that manage to compress and decompress (or encode and decode) at input lengths that are polynomial both in the gap to capacity and the mixing time of the Markov chain. Prior work achieved capacity only asymptotically in the limit of large lengths, and polynomial bounds were not available with respect to either the gap to capacity or mixing time. Our results operate in the setting where the source (or the channel) is known. If the source is unknown then compression at such short lengths would lead to effective algorithms for learning parity with noise -- thus our results are the first to suggest a separation between the complexity of the problem when the source is known versus when it is unknown.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of designing coding schemes, specifically encoding and decoding algorithms, that overcome errors caused by stochastic, but not memoryless, channels. Specifically we consider the class of “(hidden) Markov channels” that are stateful, with the states evolving according to some Markov process, and where the distribution of error depends on the state.111We use the term hidden to emphasize the fact that the state itself is not directly observable from the actions of the channel, though in the interest of succinctness we will omit this term for most of the rest of this section. Such Markovian models capture many natural settings of error, such as bursty error models. (See for example, Figure 1.) Yet they are often less understood than their memoryless counterparts (or even “explicit Markov models” where the state is completely determined by the actions of the channel). For instance (though this is not relevant to our work) even the capacity of such channels is not known to have a closed form expression in terms of channel parameters. (In particular the exact capacity of the channel in Figure 1 is not known as a function of , and !) Figure 1: A Markovian Channel: The Nice state flips bits with probability δ whereas the Noisy state flips with probability 1/2−δ. The stationary probability of the Nice state is q/p times that of the Noisy state.

In this work we aim to design coding schemes that achieve rates arbitrarily close to capacity. Specifically given a channel of capacity and gap parameter , we would like to design codes that achieve a rate of at least , that admit polynomial time algorithms even at small block lengths . Even for the memoryless case such coding schemes were not known till recently. In 2008, Arikan  invented a completely novel approach to constructing codes based on “channel polarization" for communication on binary-input memoryless channels, and proved that they enable achieving capacity in the limit of large code lengths with near-linear complexity encoding and decoding. In 2013, independent works by Guruswami and Xia  and Hassani et al.  gave a finite-length analysis of Arikan’s polar codes, proving that they approach capacity fast, at block lengths bounded by where is the difference between the channel capacity and code rate.

The success of polar codes on the memoryless channels might lead to the hope that maybe these codes, or some variants, might lead to similar coding schemes for channels with memory. But such a hope is not easily justified: the analysis of polar codes relies heavily on the fact that errors introduced by the channel are independent and this is exactly what is not true for channels with memory. Despite this seemingly insurmountable barrier, Şaşoğlu  and later Şaşoğlu and Tal  showed, quite surprisingly, that the analysis of polar codes can be carried out even with Markovian channels (and potentially even broader classes of channels). Specifically they show that these codes converge to capacity and even the probability of decoding error, under maximum likelihood decoding, drops exponentially fast in the block length (specifically as on codes of length ; see also , where exponentially fast polarization was also shown at the high entropy end). An extension of Arikan’s successive cancellation decoder from the memoryless case was also given by , building on an earlier version  specific to intersymbol interference channels, leading to efficient decoding algorithms.

However, none of the works above give small bounds on the block length of the codes as a function of the gap to capacity, and more centrally to this work, on the mixing time of the Markov chain. The latter issue gains importance when we turn to the issue of “compressing Markov sources” which turns out to be an intimately related task to that of error-correction for Markov channels as we elaborate below and which is also the central task we turn to in this paper. We start by describing Markov source and the (linear) compression problem.

A (hidden) Markov source over alphabet is given by a Markov chain on some finite state space where each state has an associated distribution over . The source produces information by performing a walk on the chain and at each time step , outputting a letter of drawn according to the distribution associated with the state at time (independent of all previous choices, and previous states).222The phrase “hidden” emphasizes the fact that the output produced by the source does not necessarily reveal the sequence of states visited.

In the special case of additive Markovian channels where the output of the channel is the sum of the transmitted word with an error vector produced by a Markov source, a well-known correspondence shows that error-correction for the additive Markov channel reduces to the task of designing a compression and decompression algorithm for Markovian sources, with the compression being

linear. Indeed in this paper we only focus on this task: our goal turns into that of compressing bits generated by the source to its entropy upto an additive factor of , while is only polynomially large in .

A central issue in the task of compressing a source is whether the source is known to the compression algorithm or not. While ostensibly the problem should be easier in the “known” setting than in the “unknown” one, we are not aware of any formal results suggesting a difference in complexity. It turns out that compression in the setting where the source is unknown is at least as hard as “learning parity with noise” (we argue this in Appendix B), if the compression works at lengths polynomial in the mixing time and gap to capacity. This suggests that the unknown source setting is hard (under some current beliefs). No corresponding hardness was known for the task of compressing sources when they are known, but no easiness result seems to have been known either (and certainly no linear compression algorithm was known). This leads to the main question addressed (positively) in this work.

#### Our Results.

Our main result is a construction of codes for additive Markov channels that gets close to capacity at block lengths polynomial in and the mixing time of the Markov chain, with polynomial (in fact near-linear) encoding and decoding time. Informally additive channels are those that map inputs from some alphabet to outputs over with an abelian group defined on and the channel generates an error sequence independent of the input sequence, and the output of the channel is just the coordinatewise sum of the input sequence with the error sequence. (In our case the alphabet is a finite field of prime cardinality.) The exact class of channels is described in Definition 2.1, and Theorem 2.2 states our result formally. We stress that we work with additive channels only for conceptual simplicity and that our results should extend to more general symmetric channels though we don’t do so here. Prior to this work no non-trivial Markov channel was known to achieve efficient encoding and decoding at block lengths polynomial in either parameter (gap to capacity or mixing time).

Our construction and analyses turn out to be relatively simple given the works of Şaşoğlu and Tal [4, 9] and the work of Blasiok et al. . The former provides insights on how to work with channels with memory, whereas the latter provides tools needed to get short block length and cleaner abstractions of the efficient decoding algorithm that enable us to apply it in our setting. Our codes are a slight variant of polar codes, where we apply the polar transforms independently to blocks of inputs. This enables us to apply the analysis of  in an essentially black box manner, benefiting both from its polynomially fast convergence guarantee to capacity as well as its generality covering all polarizing matrices over any prime alphabet (and not just the basic Boolean transform covered in ).

We give a more detailed summary of how our codes are obtained and how we analyze them in Section 3 after stating our results and main theorem formally.

## 2 Definitions and Main Results

### 2.1 Notation and Definitions

We will use to denote the finite field with elements. Throughout the paper, we will deal only with the case when is a prime. (This restriction in turn comes from the work of  whose results we use here.)

We use several notations to index matrices. For a matrix , the entry in the th row, th column is denoted or . Columns are denoted by superscripts, i.e., denotes the th column of . Note that . We also use the indices as sets in the natural way. For example denotes the first columns of . denotes the submatrix of elements in the first columns and first rows. denotes the set of elements of indexed by lexicographically smaller indices than . Multiplication of a matrix with a vector is denoted .

For a finite set , let

denote the set of probability distributions over

. For a random variable

and event , we write to denote the conditional distribution of , conditioned on . For example, we may write .

The total-variation distance between two distributions is

 ||p−q||1:=∑i|p(i)−q(i)|

We consider compression schemes, as a map . The rate of a compression scheme is the ratio .

For a random variable , the (non-normalized) entropy is denoted , and is

 H(X):=−∑iPr[X=i]log(Pr[X=i])

and the normalized entropy is denoted , and is

 ¯H(X):=1log(q)H(X)

A Markov chain is given by an representing the state space , a transition matrix , and a distribution on initial state . The rows of , denoted are thus elements of . A Markov chain generates a random sequence of states determined by letting , and for given . The stationary distribution is the distribution such that if , then all ’s are marginally identically distributed as .

We consider only Markov chains which are irreducible and aperiodic, and hence have a stationary distribution to which they converge in the limit. The rate of convergence is measured by the mixing time, defined below.

The mixing time of a Markov chain is the constant such that for every initial state of the Markov chain, the distribution of state is -close in total variation distance to the stationary distribution .

A (stationary, hidden) Markov source is specified by an alphabet , a Markov chain on states and distributions . The output of the source is a sequence of random variables obtained by first sampling a sequence according to and then sampling independently for each . We let the distribution of output sequences of length , and denote the distribution of i.i.d. samples from .

Similarly, we define an additive Markov channel as a channel which adds noise from a Markov source.

An additive Markov channel , specified by a Markov source over alphabet , is a randomized map obtained as follows: On channel input , the channel outputs where where .

A linear code is a linear map . The rate of a code is the ratio .

For all sets , a constructive source over samplable in time is a distribution such that can be sampled efficiently in time at most , and for every fixed , the conditional distribution can be sampled efficiently in time at most .

Every Markov source with state space is a constructive source samplable in time . That is, for every , let be the random variables generated by the Markov source. Then, the sequence can be sampled in time at most , and moreover for every setting of , the distribution can be sampled in time .

###### Proof.

Sampling can clearly be done by simulating the Markov chain, and sampling from the conditional distribution is possible using the standard Forward Algorithm for inference in Hidden Markov Models, which we describe for completeness in Appendix A. ∎

Finally, we will use the following notion of mixing matrices from  [7, 2], characterizing which matrices lead to good polar codes. In the study of polarization it is well-known that lower-triangular matrices do not polarize at all, and the polarization characteristics of matrices are invariant under column permutations. Mixing matrices are defined to be those that avoid the above cases.

For prime and , is said to be a mixing matrix if is invertible and for every permutation of the columns of , the resulting matrix is not lower-triangular.

### 2.2 Main Theorems

We are now ready to state the main results of this work formally. We begin with the statement for compressing the output of a hidden Markov model.

For every prime and mixing matrix there exists a preprocessing algorithm (Polar-Preprocess, Algorithm 3), a compression algorithm (Polar-Compress, Algorithm 1), a decompression algorithm (Polar-Decompress, Algorithm 2) and a polynomial such that for every , the following properties hold:

1. Polar-Preprocess is a randomized algorithm that takes as input a Markov source with states, and , and runs in time where and outputs auxiliary information for the compressor and decompressor (for ).

2. Polar-Compress takes as input a sequence as well as the auxiliary information output by the preprocessor, runs in time , and outputs a compressed string . Further, for every auxiliary input, the map is a linear map.

3. Polar-Decompress takes as input a Markov source a compressed string and the auxiliary information output by the preprocessor, runs in time and outputs . 333The runtime of the decompression algorithm can be improved to a runtime of by a simple modification. In particular, by taking the input matrix to be instead of . In fact we believe the decoding algorithm can be improved to an time algorithm with some extra bookkeeping though we don’t do so here.

The guarantee provided by the above algorithms is that with probability at least , the Preprocessing Algorithm outputs auxiliary information such that

 PrZ∼Hn[\textscPolar−Decompress(H,S;\textscPolar−Compress(Z;S))≠Z]≤O(1n2),

provided where is the mixing time of .

(In the above hides constants depending and , but not on or .)

The above linear compression directly yields channel coding for additive Markov channels, via a standard reduction (the details of which are in Section 7.) For every prime and mixing matrix there exists a randomized preprocessing algorithm Preprocess, an encoding algorithm Enc, a decoding algorithm Dec, and a polynomial such that for every , the following properties hold:

1. Preprocess is a randomized algorithm that takes as input an additive Markov channel described by Markov source with states, and , and runs in time where , and outputs auxiliary information for .

2. Enc takes as input a message , where , as well as auxiliary information from the preprocessor and outputs and computes Enc in time.

3. Dec takes as input the Markov source , auxiliary information from the preprocessor and a string , runs in time

, and outputs an estimate

of the message . 444This can similarly be improved to a runtime of .

The guarantee provided by the above algorithms is that with probability at least , the Preprocessing algorithm outputs such that for all we have

provided where is the mixing time of .

(In the above hides constants that may depend on and but not on or .)

Theorem 2.2 follows relatively easily from Theorem 2.2 and so in the next section we focus on the overview of the proof of the latter.

## 3 Overview of our construction

Basics of polarization. We start with the basics of polarization in the setting of compressing samples from an i.i.d. source. To compress a sequence drawn from some source, the idea is to build an invertible linear function such that for all but fraction of the output coordinates , the conditional entropy is close to and or close to . (Such an effect is called polarization, as the entropies are driven to polarize toward the two extreme values.) Since a deterministic invertible transformation preserves the total entropy, it follows that roughly output coordinates can have entropy close to and coordinates have (conditional) entropy close to . Letting denote the coordinates whose conditional entropies that are not close to zero, the compression function is simply , the projection of the output onto the coordinates in .

Picking a random linear function would satisfy the properties above with high probability, but this is not known (and unlikely) to be accompanied by efficient algorithms. To get the algorithmics (how to compute efficiently, to determine efficiently, and to decompress efficiently) one uses a recursive construction of . For our purposes the following explanation works best: Let and view and as an matrix over , where the elements of arrive one row at a time. Let denote the operation mapping to that applies to each row of separately. Let denote the operation that applies to each column separately. Then . The base case is given by .

Intuitively, when the elements of are independent and identical, the operation already polarizes the outputs somewhat and so a moderate fraction of the outputs of have conditional entropies moderately close to or . The further application of further polarizes the output bringing a larger fraction of he conditional entropies of the output even closer to or .

Polarization for Markovian Sources. When applied to source with memory, roughly the analysis in , reinterpreted to facilitate our subsequent modification of the above polar constructuion, goes as follows: Since the elements of the row are not really independent one cannot count on the polarization effects of . But, letting one can show that most elements of the column of are almost independent of each other, provided is much larger than the mixing time of the source. (Here we imagine that the entries of arrive row-by-row, so that the source outputs within each row are temporally well-separated from most entries of the previous row, when is large.) Further, this almost independence holds even when conditioning on the columns for most values of . Thus the operation continues to have its polarization effects and this is good enough to get a qualitatively strong polarization theorem (about the operator !).

The above analysis is asymptotic, proving that in the limit of , we get optimal compression. However, we do not know how to give an effective finite-length analysis of the polarization process for Markovian process, as the analysis in [5, 6] crucially rely on independence which we lack within a row.

Our Modified Code and Ingredients of Analysis. To enable a finite-length analysis, we make a minor, but quite important, alteration to the polar code: Instead of using we simply use the transformation (or in other words, we replace the inner function in the definition of by the identity function). This implies that we lose whatever polarization effects of we may have been counting on, but as pointed out above, for Markov sources, we weren’t counting on polarization here anyway!

The crucial property we identify and exploit in the analysis is the following: the Markovian nature of the source plus the row-by-row arrival ordering of , implies that the distribution of the ’th source column conditioned on the previous columns , is a close to a product distribution, for all but the last few (say ) columns. 555We handle the non-independence in the last few columns, by simply outputting those columns in entirety, rather than only a set of entropy-carrying positions. This only adds an fraction to the output length, which we can afford.

It turns out that the analysis of the polar transform only needs independent inputs, which however need not be identically distributed. We are then able to apply the recent analysis from , essentially as black box, to argue that will compress each of the conditioned sources to its respective entropy, and also establish fast convergence via quantitatively strong polynomial (in the gap to capacity) upper bounds on the needed to achieve this. Further, we automatically benefit from the generality of the analysis in , which applies not only to the transform at the base case, but in fact any transform (satisfying some minimal necessary conditions) over an arbitrary prime field . Previous works on polar coding for Markovian sources [4, 9, 12] only applied for Boolean sources.

We remark that the use of the identity transform for the rows in is quite counterintuitive. It implies that the compression matrix is a block diagonal matrix (after some permutation of the rows and columns) — and in turn this seems to suggest that we are compressing different parts of the input sequence “independently”. However this is not quite true. The relationship between the blocks ends up influencing the final set of the bits of that are output by the compression algorithm. Furthermore the decompression relies on the information obtained from the decompression of the blocks corresponding to to compute the block .

Decompression algorithm. Our alteration to apply the identity transform for the rows also helps us with the task of decompression. Toward this, we build on a decompression algorithm for memoryless sources from  that is somewhat different looking from the usual ones in the polar coding literature. This algorithm aims to compute one column at a time, given . Given the first columns , the algorithm first computes the conditional distribution of conditioned on and then uses a recursive decoding algorithm for to determine . The key to the recursive use is again that the decoding algorithm works as long as the input variables are independent (and in particular, does not need them to be identically distributed).

In our Markovian setting, we now have to compute the conditional distribution of conditioned on . But as mentioned above, this conditional distribution is close to a product distribution, say (except for the last few columns where decompression is trivial as we output the entire column). Further, the marginals of this product distribution are easily computed using dynamic programming (via what is called the “Forward Algorithm” for hidden Markov models, described for completeness in Appendix A). We can then determine the ’th column (having already recovered the first columns as ) by running (in a black box fashion) the polar decompressor from  for the memoryless case, feeding this product distribution as the source distribution.

Computing the output indices. Finally we need one more piece to make the result fully constructive. This is the preprocessing needed to compute the subset of the coordinates of that have noticeable conditional entropy. For the memoryless case these computations were shown to be polynomial time computable in the works of [8, 5, 11]. We manage to extend the ideas from Guruswami and Xia  to the case of Markovian channels as well. It turns out the only ingredients needed to make this computation work are, again, the ability to compute the distributions of conditioned on for typical values of . We note that unlike in the setting of memoryless channels (or i.i.d. sources) our preprocessing step is randomized. We believe this is related to the issue that there is no “closed” form solutions to basic questions related to Markovian sources and channels (such as the capacity of the channel in Figure 1) and this forces us to use some random sampling and estimation to compute some of the conditional entropies needed by our algorithms.

Organization of rest of the paper. In the next section (Section 4) we describe our compression and decompression algorithms. In Section 5 we describe a notion of “nice”-ness for the preprocessing stage and show that if the preprocessing algorithm returns a nice output, then the compression and decompression algorithm work correctly with moderately high probability (over the message produced by the source). In Section 6 we describe our preprocessing algorithm that returns a nice set with all but exponentially small failure probability (over its internal coin tosses). Finally in Section 7 we give the formal proofs of Theorems 2.2 and 2.2.

## 4 Construction

### 4.1 Compression Algorithm

Our compression, decompression and preprocessing algorithms are defined with respect to arbitrary mixing matrices . (Recall that mixing matrices were defined in Definition 2.1.) Though a reader seeking simplicity may set and . Given integer , let and let be the polarization transform given by .

### 4.2 Fast Decompressor

The decompressor below makes black-box use of the Fast-Decoder from [2, Algorithm 4].

The Fast-Decoder takes as input the description of a product distribution on inputs in , as well as the specified coordinates of the compression . It is intended to decode from the encoding , where , coordinates of are independent, and is defined by on the high-entropy coordinates of (and otherwise). It outputs an estimate of the input .

Note that, for a Markov source on states, Line  7 takes time (time per coordinate of , using the Forward Algorithm). The Fast-Decoder call in Line  9 takes time . Thus, the total runtime is .

## 5 Analysis

The goal of this section is to prove that the decompressor works correctly, with high probablity, provided the preprocessing stage returns the appropriate sets . Specifically, we prove Theorem 5 as stated below. But first we need a definition of “nice” sets : We will later show that pre-processing produces such sets and compression and decompression work correctly (w.h.p.) given nice sets.

[-niceness] Let be a Markov source. For every and , let be the corresponding “independent” distribution. Let .

We call sets -nice” if they satisfy the following:

Now, the rest of this section will show the following.

There exists a polynomial such that for every , , and the following holds:

Let be an aperiodic irreducible Markov source with alphabet , mixing time and underlying state space . Define random variables as generated by . Then, for all sets that are -nice as per Definition 5, we have:

 PrZ[\textscPolar−Decompress(\textscPolar−Compress(Z;{Sj}j∈[m]))≠Z]≤nζ+mexp(−\epsm/τ)

### 5.1 Proof Overview

Throughout this section, let be a stationary Markov source with alphabet and mixing-time . The key part of the analysis is showing that compression and decompression succeed when applied to the “independent” distribution . To do this, we first show that the compression transform “polarizes” entropies, which follows directly from the results of [2, 3]. Then we show that, provided “nice” sets can be computed (low-entropy sets, a la Definition 5), the compression and decompression succeed with high probability. This also follows essentially in a black-box fashion from the results of . Finally, we argue that the compression and decompression also work for the actual distribution , simply by observing that the involved variables are close in distribution.

We later describe how such “nice” sets can be computed in polynomial time, given the description of the Markov source .

### 5.2 Polarization

In this section, we show that the compression transform polarizes entropies.

Let be a Markov source, and let . Let .

Then, there exists a polynomial such that for every , there exists such that if , the following holds: For all but -fraction of indices , the normalized entropy

 ¯H(¯Ui,j|¯U≺(i,j))∉(exp(−mβ),1−\eps)
###### Proof.

We will show that for each column , all but -fraction of indices have entropies

 ¯H(¯Uji|¯Uj

Indeed, this follows directly from the analysis in . For each , the set of variables are independent and identically distributed. Thus, Theorem 5.2 from  (reproduced below) implies that the conditional entropies are polarized. Specifically, let and be as guaranteed by Theorem 5.2, for the distribution . Then, since , we have

 ¯H(¯Uji|¯Uj

The following theorem is direct from the works . For every , prime , mixing-matrix , discrete set , and any distribution , the following holds. Define the random vectors and where and each component is independent and identically distributed .

Let . Then, the conditional entropies of are polarized: There exists a polynomial and such that for every , if , then all but -fraction of indices have normalized entropy

 ¯H(Xi|X

### 5.3 Independent Analysis

Now we show that the Polar Compressor and Decompresser succeed with high probability, when applied to the “independent” input distribution .

First, we recall the (inefficient) Successive-Cancellation Decoder of Polar Codes. This is reproduced as in , with minor notational changes. We will use this decoder to reason about the efficient fast decoder.

The SC-Decoder is intended to decode from the encoding where , coordinates of are independent, and is the high-entropy coordinates of . It outputs an estimate of that is correct with high probability, from which we can decode the original inputs .

The SC-Decoder takes as input the product distribution on inputs , as well as the high-entropy coordinates .

Note that several of the above steps, including computing the joint distribution and marginal distributions of , are not computationally efficient.

The following claim is equivalent to [2, Claim A.1], and states that the failure probability of the SC-Decoder is at most the sum of conditional entropies on the unspecified coordinates of . Let be a random vector with independent (not necessarily identically distributed) components . Denote the distribution of as . Let , and .

Then,

 Pr[\sc SC-Decoder(D˜Z;˜US)≠U]≤∑i∉S¯H(˜Ui|˜U

Let , and let . For a fixed and fixed conditioning , let denote the distribution .

Then, for all and all ,

 Prz∼¯Zu←Pm(zj)Dzj|z
###### Proof.

This follows directly from Claim 1.

 \Ez∼¯Zu←Pm(zj)Dzj|z

Using the SC-Decoder, we can define the following (inefficient) decompresser. We will then relate its performance to the fast decompressor, and thereby conclude the desired correctness property of the latter.

Let , and . Then, for all sets ,

 Pr¯U←Pcolumnm(¯Z)[\sc SC-Polar-DecompressH(U1S1,U2S2,…,UmSm)≠¯U]≤∑j∈[m],i∉Sj¯H(¯Uji|¯Uj
###### Proof.

This follows directly from Claim 5.3.

 Pr¯U←Pcolumnm(¯Z)[\sc SC-Polar-Decompress(U1S1,U2S2,…,UmSm)≠¯U] =Pr¯U←Pcolumnm(¯Z)⎡⎣⋃j∈[m]{^Uj≠¯Uj and ^U