# General Strong Polarization

Arı kan's exciting discovery of polar codes has provided an altogether new way to efficiently achieve Shannon capacity. Given a (constant-sized) invertible matrix M, a family of polar codes can be associated with this matrix and its ability to approach capacity follows from the polarization of an associated [0,1]-bounded martingale, namely its convergence in the limit to either 0 or 1 with probability 1. Arı kan showed appropriate polarization of the martingale associated with the matrix G_2 = ( < s m a l l m a t r i x >) to get capacity achieving codes. His analysis was later extended to all matrices M which satisfy an obvious necessary condition for polarization. While Arı kan's theorem does not guarantee that the codes achieve capacity at small blocklengths, it turns out that a "strong" analysis of the polarization of the underlying martingale would lead to such constructions. Indeed for the martingale associated with G_2 such a strong polarization was shown in two independent works ([Guruswami and Xia, IEEE IT '15] and [Hassani et al., IEEE IT '14]), thereby resolving a major theoretical challenge associated with the efficient attainment of Shannon capacity. In this work we extend the result above to cover martingales associated with all matrices that satisfy the necessary condition for (weak) polarization. In addition to being vastly more general, our proofs of strong polarization are (in our view) also much simpler and modular. Key to our proof is a notion of local polarization that only depends on the evolution of the martingale in a single time step. Our result shows strong polarization over all prime fields and leads to efficient capacity-achieving source codes for compressing arbitrary i.i.d. sources, and capacity-achieving channel codes for arbitrary symmetric memoryless channels.

## Authors

• 10 publications
• 45 publications
• 15 publications
• 13 publications
• 20 publications
• ### Arıkan meets Shannon: Polar codes with near-optimal convergence to channel capacity

Let W be a binary-input memoryless symmetric (BMS) channel with Shannon ...
11/10/2019 ∙ by Venkatesan Guruswami, et al. ∙ 0

• ### Polar Codes with exponentially small error at finite block length

We show that the entire class of polar codes (up to a natural necessary ...
10/09/2018 ∙ by Jarosław Błasiok, et al. ∙ 0

• ### Capacity-achieving Polar-based LDGM Codes with Crowdsourcing Applications

In this paper we study codes with sparse generator matrices. More specif...
01/31/2020 ∙ by James Chin-Jen Pang, et al. ∙ 0

• ### Capacity-achieving Polar-based LDGM Codes

In this paper, we study codes with sparse generator matrices. More speci...
12/27/2020 ∙ by James Chin-Jen Pang, et al. ∙ 0

• ### Reed-Muller codes polarize

Reed-Muller (RM) codes and polar codes are generated by the same matrix ...
01/31/2019 ∙ by Emmanuel Abbe, et al. ∙ 0

• ### Polar decreasing monomial-Cartesian codes

We prove that families of polar codes with multiple kernels over certain...
02/02/2020 ∙ by Eduardo Camps, et al. ∙ 0

• ### Channel Polarization through the Lens of Blackwell Measures

Each memoryless binary-input channel (BIC) can be uniquely described by ...
09/13/2018 ∙ by Naveen Goela, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Polar codes, proposed in Arıkan’s remarkable work [2], gave a fresh information-theoretic approach to construct linear codes that achieve the Shannon capacity of symmetric channels, together with efficient encoding and decoding algorithms. About a decade after their discovery, there is now a vast and extensive body of work on polar coding spanning hundreds of papers, and polar codes are also being considered as one of the candidates for use in 5G wireless (e.g., see [7] and references therein). The underlying concept of polarizing transforms has emerged as a versatile tool to successfully attack a diverse collection of information-theoretic problems beyond the original channel and source coding applications, including wiretap channels [16], the Slepian-Wolf, Wyner-Ziv, and Gelfand-Pinsker problems [14], broadcast channels [9], multiple access channels [22, 1], and interference networks [24]. We recommend the survey by Şaşoğlu [21] for a nice treatment of the early work on polar codes.

The algorithmic interest in polar codes emerges from a consequence shown in the works [11, 12, 10] who show that this approach leads to a family of codes of rate for transmission over a channel of (Shannon) capacity , where the block length of the code and the decoding time grow only polynomially in . In contrast, for all previous constructions of codes, the decoding algorithms required time exponential in . Getting a polynomial running time in was arguably one of the most important theoretical challenges in the field of algorithmic coding theory, and polar codes were the first to overcome this challenge. The analyses of polar codes turn into questions about polarizations of certain martingales. The vast class of polar codes alluded to in the previous paragraph all build on polarizing martingales, and the results of [11, 12, 10] show that for one of the families of polar codes, the underlying martingale polarizes “extremely fast” — a notion we refer to as strong polarization (which we will define shortly).

The primary goal of this work is to understand the process of polarization of martingales, and in particular to understand when a martingale polarizes strongly. In attempting to study this question, we come up with a local notion of polarization and show that this local notion is sufficient to imply strong polarization. Applying this improved understanding to the martingales arising in the study of polar codes we show that a simple necessary condition for weak polarization of such martingales is actually sufficient for strong polarization. This allows us to extend the results of [11, 12, 10] to a broad class of codes and show essentially that all polarizing codes lead to polynomial convergence to capacity. Below we formally describe the notion of polarization of martingales and our results.

### 1.1 Polarization of [0,1]-martingales

Our interest is mainly in the (rate of) polarization of a specific family of martingales that we call the Arıkan martingales. We will define these objects later, but first describe the notion of polarization for general

-bounded martingales. Recall that a sequence of random variables

is said to be a martingale if for every and it is the case that . We say that that a martingale is -bounded (or simply a -martingale) if for all .

Weak Polarizationdefinitiondefweakpolarization A -martingale sequence is defined to be weakly polarizing if exists with probability , and this limit is either or (and so the limit is a Bernoulli random variable with expectation ).

Thus a polarizing martingale does not converge to a single value with probability , but rather converges to one of its extreme values. For the applications to constructions of polar codes, we need more explicit bounds on the rates of convergence leading to the notions of (regular) polarization and strong polarization defined below in Definition 1.1 and 1.1 respectively.

-Polarizationdefinitiondeftauepspolarization For functions , a -martingale sequence is defined to be -polarizing if for all we have

 \lx@paragraphsign(Xt∈(τ(t),1−τ(t)))<ε(t).

Regular Polarizationdefinitiondefregpolarization A -martingale sequence is defined to be regular polarizing if for all constant , there exist , such that is -polarizing.

We refer to the above as being “sub-exponentially” close to the limit (since it holds for every ). While weak polarization by itself is an interesting phenomenon, regular polarization (of Arıkan martingales) leads to capacity-achieving codes (though without explicit bounds on the length of the code as a function of the gap to capacity) and thus regular polarization is well-explored in the literature and tight necessary and sufficient conditions are known for regular polarization of Arıkan martingales [3, 15].

To get codes of block length polynomially small in the gap to capacity, an even stronger notion of polarization is needed, where we require that the sub-exponential closeness to the limit happens with all but exponentially small probability. We define this formally next.

Strong Polarizationdefinitiondefstrongpolarization A -martingale sequence is defined to be strongly polarizing if for all there exist and such that martingale is -polarizing.

In contrast to the rich literature on regular polarization, results on strong polarization are quite rare, reflecting a general lack of understanding of this phenomenon. Indeed (roughly) an Arıkan martingale can be associated with every invertible matrix over any finite field , and the only matrix for which strong polarization is known is [11, 12, 10].111An exception is the work by Pfister and Urbanke [19] who showed that for the -ary erasure channel for large enough , the martingale associated with a Reed-Solomon based matrix proposed in [18] polarizes strongly. A recent (unpublished) work [8] shows that for the binary erasure channel, martingales associated with large random matrices polarize strongly. Both these results obtain an optimal value of for (specific/random) large matrices. However, they only apply to the erasure channel, which is simple to error correct via Gaussian elimination and therefore not really reflective of the general capacity-achieving power of polar codes.

Part of the reason behind the lack of understanding of strong polarization is that polarization is a “limiting phenomenon” in that one tries to understand , whereas most stochastic processes, and the Arıkan martingales in particular, are defined by local evolution, i.e., one that relates to . The main contribution of this work is to give a local definition of polarization (Definition 1.2) and then showing that this definition implies strong polarization (Theorem 1.2). Later we show that Arıkan martingales polarize locally whenever they satisfy a simple condition that is necessary even for weak polarization. As a consequence we get strong polarization for all Arıkan martingales for which previously only regular polarization was known.

### 1.2 Results I: Local Polarization and Implication

Before giving the definition of local polarization, we give some intuition using the following martingale: Let , and where are chosen uniformly and independently from . Clearly this sequence is not polarizing (the limit of is uniform in ). One reason why this happens is that as time progresses, the martingale slows down and stops varying much. We would like to prevent this, but this is also inevitable if a martingale is polarizing. In particular, a polarizing martingale would be slowed at the boundary and cannot vary much. The first condition in our definition of local polarization insists that this be the only reason a martingale slows down (we refer to this as variance in the middle).

Next we consider what happens when a martingale is close to the boundary. For this part consider a martingale and . This martingale does polarize and even shows regular polarization, but it can also be easily seen that the probability that is zero (whereas we would like probability of being less than say to go to ). So this martingale definitely does not show strong polarization. This is so since even in the best case the martingale is approaching the boundary at a fixed exponential rate, and not a sub-exponential one. To overcome this obstacle we require that when the martingale is close to the boundary, with a fixed constant probability it should get much closer in a single step (a notion we refer to as suction at the ends).

The definition below makes the above requirements precise.

Local Polarizationdefinitiondefnpolarlocal A -martingale sequence is locally polarizing if the following conditions hold:

1. (Variance in the middle): For every , there is a such that for all , we have: If then .

2. (Suction at the ends): There exists an , such that for all , there exists a such that:

1. If then .

2. Similarly, if then .

We refer to condition (a) above as Suction at the low end and condition (b) as Suction at the high end.

When we wish to be more explicit, we refer to the sequence as -locally polarizing.

As such this definition is neither obviously sufficient for strong polarization, nor is it obviously satisfiable by any interesting martingale. In the rest of the paper, we address these concerns. Our first technical contribution is a general theorem connecting local polarization to strong polarization.

Local vs. Strong Polarizationtheoremthmlocalglobal If a -martingale sequence is locally polarizing, then it is also strongly polarizing.

It remains to show that the notion of local polarization is not vacuous. Next, we show that in fact Arıkan martingales polarize locally (under simple necessary conditions). First we give some background on Polar codes.

### 1.3 The Arıkan martingale and Polar codes

The setting of polar codes considers an arbitrary symmetric memoryless channel and yields codes that aim to achieve the capacity of this channel. These notions are reviewed in Section 2.2.1. Given any -ary memoryless channel and invertible matrix , the theory of polar codes implicitly defines a martingale, which we call the Arıkan martingale associated with and studies its polarization. (An additional contribution of this work is that we give an explicit compact definition of this martingale, see Definition 4. Since we do not need this definition for the purposes of this section, we defer it for Section 4). The consequences of regular polarization are described by the following remarkable theorem. (Below we use

to denote the tensor product of the matrix

and . Further, we use to denote the tensor of a matrix with itself times.)

Implied by Arıkan [2]theoremthmarikan Let be a -ary symmetric memoryless channel and let be an invertible matrix. If the Arıkan martingale associated with polarizes regularly, then given and there is a such that for every there is a code for of dimension at least such that is an affine code generated by the restriction of to a subset of its rows and an affine shift. Moreover there is a polynomial time decoding algorithm for these codes that has failure probability bounded by .222We remark that the encoding and decoding are not completely uniform as described above, since the subset of rows and the affine shift that are needed to specify the code are only guaranteed to exist. In the case of additive channels, where the shift can be assumed to be zero, the work of Tal and Vardy [23] (or [11, Sec. V]) removes this non-uniformity by giving a polynomial time algorithm to find the subset.

For , Arıkan and Telatar [3] proved that the martingale associated with the matrix , polarizes regularly over any binary input symmetric channel (Arıkan’s original paper [2] proved a weaker form of regular polarization with which also sufficed for decoding error going to ). Subsequent work generalized this to other matrices with the work of Korada, Şaşoğlu, and Urbanke [15] giving a precise characterization of matrices for which the Arıkan martingale polarizes (again over binary input channels). We will refer to such matrices as mixing. Mixing Matrixdefinitiondefmix A matrix is said to be mixing, if it is invertible and none of the permutations of the rows of yields an upper triangular matrix, i.e., for every permutation there exists with such that . It is not too hard to show that the Arıkan martingale associated with non-mixing matrices do not polarize (even weakly). In contrast [15] shows that every mixing matrix over polarizes regularly. Mori and Tanaka [18] show that the same result holds for all prime fields, and give a slightly more complicated criterion that characterizes (regular) polarization for general fields. (These works show that the decoding failure probability of the resulting polar codes is at most for some positive determined by the structure of the mixing matrix — this follows from an even stronger decay in the first of the two parameters in the definition of polarization. However, they do not show strong polarization, which is what we achieve.)

As alluded to earlier, strong polarization leads to even more effective code constructions and this is captured by the following theorem.

[2, 11, 12]theoremthmarikanpolar Let be a -ary symmetric memoryless channel and let be an invertible matrix. If the Arıkan martingale associated with polarizes strongly, then for every there exists such that for every and every there is an affine code , that is generated by the rows of and an affine shift, with the property that the rate of is at least , and can be encoded and decoded in time where and failure probability of the decoder is at most .

This theorem is implicit in the works above, but for completeness we include a proof of this theorem in Appendix A. As alluded to earlier, the only Arıkan martingales that were known to polarize strongly were those where the underlying matrix was . Specifically Guruswami and Xia [11] and Hassani et al. [12] show strong polarization of the Arıkan martingale associated with this matrix over any binary input symmetric channel, and Guruswami and Velingker [10] extended to the case of -ary input channels for prime . By using the concept of local polarization we are able to extend these results to all mixing matrices.

### 1.4 Results II: Local polarization of Arıkan martingales

In our second main result, we show that every mixing matrix gives rise to an Arıkan martingale that is locally polarizing: theoremthmtrianglelocal For every prime , for every mixing matrix , and for every symmetric memoryless channel over , the associated Arıkan martingale sequence is locally polarizing.

As a consequence of Theorems 1.31.2, and 1.4, we have the following theorem.

theoremthmcombo For every prime , every mixing matrix , every symmetric memoryless channel over , and every , there exists such that for every , for every , there is an affine code , that is generated by the rows of and an affine shift, with the property that the rate of is at least , and can be encoded and decoded in time where and failure probability of the decoder is at most .

The above theorem shows that all polar codes associated with every mixing matrix achieves the Shannon capacity of a symmetric memoryless channel efficiently, thus, vastly expanding on the class of polar codes known to satisfy this condition.

Our primary motivation in this work is to develop a general approach to proving polarization that applies to all matrices (matching the simple necessary condition for polarization) and is strong enough for the desired coding theory conclusion (convergence to capacity at polynomial block lengths, the distinguishing feature of polar codes). At the same time, our proof is arguably simpler and brings to light exactly what drives strong polarization — namely some simple local polarization conditions that hold for the single step evolution. One concrete motivation to consider polar codes with different choice of mixing matrices is that an appropriate choice can lead to decoding error probability of for any (as opposed to for [15, 18], where is the block length of the code.

### 1.5 Comparison with previous analyses of (strong) polarization

While most of the ingredients going into our eventual analysis of strong polarization are familiar in the literature on polar codes, our proofs end up being much simpler and modular. We describe some of the key steps in our proofs and contrast them with those in previous works.

Definition of Local Polarization. While we are not aware of a definition similar to local polarization being explicit in the literature before, such notions have been considered implicitly before. For instance, for the variation in the middle (where we require that if ) the previous analyses in [11, 10] required be quadratic in . Indeed this was the most significant technical hurdle in the analysis for prime case in [10]. In contrast, our requirement on the variation is very weak and qualitative, allowing any function . Similarly, our requirement in the suction at the ends case is relative mild and qualitative. In previous analyses the requirements were of the form “if then with positive probability.” This high demand on the suction case prevented the analyses from relying only on the local behavior of the martingale and instead had to look at other parameters associated with it which essentially depend on the entire sequence. (For the reader familiar with previous analyses, this is where the Bhattacharyya parameters enter the picture.) Our approach, in contrast, only requires arbitrarily large constant factor drop, and thereby works entirely with the local properties of .

Local Polarization implies Strong Polarization. Our proof that local polarization implies strong polarization is short (about 3 pages) and comes in two parts. The first part uses a simple variance argument to shows that is exponentially close (in ) to the limit except with probability exponentially small in . The second part then amplifies ’s proximity to to sub-exponentially small values using the suction at the end guarantee of each local step, coupled with Doob’s martingale inequality and standard concentration inequalities. Such a two-part breakdown of the analysis is not new; however, our technical implementation is more abstract, more general and more compact all at the same time.

Local Polarization of Arıkan martingales. We will elaborate further on the approach for this after defining the Arıkan martingales, but we can say a little bit already now: First we essentially reduce the analysis of the polarization of Arıkan martingale associated with an arbitrary mixing matrix to the analysis when . This reduction loses in the parameters specifying the level of local polarization, but since our strong polarization theorem works for any function, such loss in performance does not hurt the eventual result. Finally, local polarization for the case where the matrix is is of course standard, but even here our proofs (which we include for completeness) are simpler since they follow from known entropic inequalities on sums of two independent random variables. We stress that even quantitatively weak forms of these inequalities meet our requirements of local polarization, and we do not need strong forms of such inequalities (like Mrs. Gerber’s lemma for the binary case [5, 11] and an ad hoc one for the prime case [10]).

Some weakness in our analyses. We first point out two weaknesses in our analyses. First, in contrast to the result of Mori and Tanaka [18] who characterize the set of matrices that lead to regular polarization over all fields, we only get a characterization over prime fields. Second, our definition of strong polarization only allows us to bound the failure probability of decoding by an arbitrarily small polynomial in the block length whereas results such as those in [3] actually get exponentially small ( for some ) failure probability.

In both cases we do not believe that these limitations are inherent to our approach. In particular the extension to general fields will probably involve more care, but should not run into major technical hurdles. Reducing the failure probability will lead to new technical challenges, but we do believe they can be overcome. Specificially, this requires stronger suction which is not true for the Arıkan martingale if one considers a single step evolution, but it seems plausible that multiple steps (even two) might show strong enough suction. We hope to investigate this in future work.

Organization of the rest of this paper. We first introduce some of the notation and probabilistic preliminaries used to define and analyze the Arıkan martingale in Section 2. We then prove Theorem 1.2 showing that local polarization implies strong polarization in Section 3. This is followed by the formal definition of the Arıkan martingale in Section 4. Section 5.1 gives an overview of the proof of Theorem 1.4 which asserts that the Arıkan martingale is locally polarizing (under appropriate conditions). Section 5.2 then states the local polarization conditions for sums of two independent variables, with proofs deferred to Section 6. Section 5.3 reduces the analysis of local polarization of general mixing matrices to the conditions studied in Section 5.2 and uses this reduction to prove Theorem 1.4. Finally in Appendix A we show (for completeness) how the Arıkan martingale (and its convergence) can be used to construct capacity achieving codes.

## 2 Preliminaries and Notation

In this section we introduce the notation needed to define the Arıkan martingale (which will be introduced in the following section). We also include information-theoretic and probabilistic inequalities that will be necessary for the subsequent analysis.

### 2.1 Notation

The Arıkan martingale is based on a recursive construction of a vector valued random variable. To cleanly describe this construction it is useful to specify our notational conventions for vectors, tensors and how to view the tensor products of matrices. These notations will be used extensively in the following sections.

#### 2.1.1 Probability Notation

Throughout this work, all random variables involved will be discrete. For a probability distribution

and random variable , we write to mean that is distributed according to , and independent of all other variables. Similarly, for a set , we write to mean that is independent and uniform over . For a set , let denote the set of probability distributions over .

We occasionally abuse notation by treating distributions as random variables. That is, for and a matrix , we write to denote the distribution of the random variable . For a distribution and an event , we write to denote the conditional distribution of conditioned on .

#### 2.1.2 Tensor Notation

Here we introduce useful notation for dealing with scalars, vectors, tensors, and tensor-products.

All scalars will be non-boldfaced, for example: .

Any tensors of order (including vectors) will be boldfaced, for example: . One exception to this is the matrix used in the polarization transforms, which we do not boldface.

Subscripts are used to index tensors, with indices starting from . For example, for as above, . Matrices and higher-order tensors are indexed with multiple subscripts: For , we may write . We often index tensors by tuples (multiindices), which will be boldfaced: For , we write . Let be the lexicographic order on these indexing tuples.

When an index into a tensor is the concatenation of multiple tuples, we emphasize this by using brackets in the subscript. For example: for tensor as above, and and , we may write .

For a given tensor , we can consider fixing some subset of its indices, yielding a slice of (a tensor of lower order). We denote this with brackets, using to denote unspecified indices. For example for tensor as above, we have and .

We somewhat abuse the indexing notation, using to mean the set of variables . Similarly, .

We occasionally unwrap tensors into vectors, using the correspondence between and . Here, we unwrap according to the lexicographic order on tuples.

Finally, for matrices specifically, specifies the entry in the -th row and -th column of matrix . Throughout, all vectors will be row-vectors by default.

#### 2.1.3 Tensor Product Recursion

The construction of polar codes and analysis of the Arıkan martingale rely crucially on the recursive structure of the tensor product. Here we review the definition of the tensor product, and state its recursive structure.

For a linear transform

, let denote the -fold tensor power of . Explicitly (fixing basis for all the spaces involved), this operator acts on tensors as:

 [M⊗t(\bvecX)]\bvecj=∑\bveci∈[k]tX\bveciMi1,j1Mi2,j2⋯Mit,jt.

The tensor product has the following recursive structure: , which corresponds explicitly to:

 [M⊗t(\bvecX)][\bveca,jt]=∑it∈[k]Mit,jt[M⊗t−1(\bvecX[⋅,it])]\bveca. (1)

In the above, if we define tensor

 \bvecY(it):=M⊗t−1(\bvecX[⋅,it])

then this becomes

 [M⊗t(\bvecX)][\bveca,⋅]=M((\bvecY(1)\bveca,\bvecY(2)\bveca,…,\bvecY(k)\bveca)) (2)

where the vector .

Finally, we use that .

### 2.2 Information Theory Preliminaries

For the sake of completeness we include the information-theoretic concepts and tools we use in this paper.

, let denote its binary entropy:

 H(X):=∑a∈Support(X)pX(a)log(1pX(a))

where is the probability mass function of . Throughout, by default denotes .

For , we overload this notation, letting denote the entropy for .

For arbitrary random variables , let denote the conditional entropy:

 H(X|Y)=\EY[H(X|Y=y)].

For a -ary random variable , let denote its -ary entropy:

 \bH(X):=H(X)log(q).

Finally, the mutual information

between jointly distributed random variables

is:

 I(X;Y):=H(X)−H(X|Y)=H(Y)−H(Y|X)

We will use the following standard properties of entropy:

1. (Adding independent variables increases entropy): For any random variables such that are conditionally independent given , we have

 H(X+Y|Z)≥H(X|Z) (3)
2. (Transforming Conditioning): For any random variables , any function , and any bijection , we have

 H(X|Y)=H(X+f(Y)|Y)=H(X+f(Y)|σ(Y)) (4)
3. (Chain rule):

For arbitrary random variables : .

4. (Conditioning does not increase entropy): For arbitrary random variables, .

5. (Monotonicity): For , the binary entropy is non-decreasing with . And for , the binary entropy is non-increasing with .

6. (Deterministic postprocessing does not increase entropy): For arbitrary random variables and function we have .

7. (Conditioning on independent variables): For random variables where is independent from , we have .

#### 2.2.1 Channels

Given a finite field , and output alphabet , a -ary channel is a probabilistic function from to . Equivalently, it is given by probability distributions supported on . We use notation to denote the channel operating on inputs . A memoryless channel maps to by acting independently (and identically) on each coordinate. A symmetric channel is a memoryless channel where for every there is a bijection such that for every it is the case that , and moreover for any pair , we have (see, for example, [4, Section 7.2]). As shown by Shannon every memoryless channel has a finite capacity, denoted . For symmetric channels, this is the mutual information between the input and output where is drawn uniformly from and is drawn from given .

### 2.3 Basic Probabilistic Inequalities

We first show that a random variable with small-enough entropy will usually take its most-likely value:

Let be a random variable. Then there exist such that

 Pr[X≠^x]≤H(X)

and therefore

 Pr[X≠^x]≤\bH(X)logq.
###### Proof.

Let and let . Let be the value maximizing this probability. Let . We wish to show that . If we have

 α =H(X) =∑ipilog1pi ≥∑i≠^xpilog1pi (Since all summands are non-negative) ≥∑i≠^xpilog1∑j≠^xpj (Since pi≤∑j≠^xpj.) =⎛⎝∑i≠^xpi⎞⎠⋅log(1∑j≠^xpj) =γ⋅log1/γ ≥γ (Since γ≤1/2 and so log1/γ≥1)

as desired. Now if we have a much simpler case since now we have

 α =H(X) =∑ipilog1pi ≥∑ipilog1p^x (Since pi≤px) =log1p^x (Since ∑ipi=1) =log11−γ ≥1. (Since γ≥1/2)

But is always at most so in this case also we have as desired. ∎

For the decoder, we will need a conditional version of Lemma 2.3, saying that if a variable has low conditional entropy conditioned on , then can be predicted well given the instantiation of variable . Let be arbitrary discreete random variables with range respectively. Then there exists a function such that

 PrX,Y[X≠^X(Y)]≤H(X|Y)

In particular, the following estimator satisfies this:

 ^X(y):=\argmaxx{Pr[X=x|Y=y]}
###### Proof.

For every setting of , we can bound the error probability of this estimator using Lemma 2.3 applied to the conditional distribution :

 PrX,Y[X≠^X(Y)] =\EY[PrX|Y[^X(Y)≠X]] ≤\EY[H(X|Y=y)] (Lemma 2.3) =H(X|Y)

We will need an inverse to the usual Chebychev inequality. Recall that Chebychev shows that variables with small variance are concentrated close to their expectation. The Paley-Zygmund inequality below can be used to invert it (somewhat) — for a random variable

with comparable fourth and second central moment, by applying the lemma below to

we can deduce that has positive probability of deviating noticeably from the mean. [Paley-Zygmund] If is a random variable with finite variance, then

 \lx@paragraphsign(Z>λ\E[Z])≥(1−λ)2\E[Z]2\E[Z2].

Next, we define the notion of a sequence of random variables being adapted to another sequence of variables, which will be useful in our later proofs.

We say that a sequence of random variables is adapted to the sequence if and only if for every , is completely determined given . We will use as a shorthand , and as a shorthand for . If the underlying sequence is clear from context, we will skip it and write just .

Consider a sequence of non-negative random variables adapted to the sequence . If for every we have , then for every :

 \lx@paragraphsign(∑i≤TYi>CT)≤exp(−Ω(T))

for some universal constant .

###### Proof.

First, observe that

 \E[exp(Yt+1/2)|\cFt] =∫∞0\lx@paragraphsign(exp(Yt+1/2)>λ|\cFt)dλ ≤1+∫∞1exp(−2logλ)dλ =1+∫∞1λ−2dλ ≤exp(C0) (5)

for some constant . On the other hand, we have decomposition (where we apply (5) in the first equality):

 \E[exp(∑i≤TYi2)] =\E[\E[exp(∑i≤TYi2)|\cFT−1]] =\E[exp(∑i≤T−1Yi/2)\E[exp(YT/2)|\cFT−1]] ≤\E[exp(∑i≤T−1Yi/2)]exp(C0) ≤⋯ ≤exp(C0T).

Now we can apply Markov inequality to obtain desired tail bounds:

 \lx@paragraphsign(∑i≤TYi>4C0T)=\lx@paragraphsign(exp(12∑i≤TYi)>exp(2C0T))≤\E[exp(12∑i≤TYi)]exp(−2C0T)≤exp(−C0T).

Consider a sequence of random variables with , adapted to the sequence . If for some deterministic value , then for we have

 \lx@paragraphsign(∑t≤TYt<μ/2)≤exp(−Ω(μ))
###### Proof.

Let , we know that with probability 1. Standard calculation involving Markov inequality yields following bound

 \lx@paragraphsign(∑t≤TYt<∑t≤TMt/2) =\lx@paragraphsign(exp(−∑t≤TYt+∑t≤TMt/2)>1) ≤\E[exp(∑t≤T(−Yt+Mt/2))] =\E[\E[exp(∑t≤T(−Yt+Mt/2))|X[1:T−1]]] ≤\E[exp(∑t≤T−1(−Yt+Mt/2))\E[exp(−YT+MT/2)|X[1:T−1]]]. (6)

We now observe that for any random variable with , we have

 log\E[exp(−~Y+p/2)]=p2+log[(1−p)+pe]≤p2−p+pe≤−cp

with constant . In particular . Plugging this back to (6), we get

 \lx@paragraphsign(∑t≤TYt<∑t≤TMt/2) ≤\E[exp(∑t≤T−1(−Yt+Mt/2))]exp(−c′μT) ≤⋯ ≤exp(−c∑t≤Tμt) =exp(−cμ)

And moreover, since deterministically, we have as desired.

Finally, we will use the well-known Doob’s martingale inequality: [Doob’s martingale inequality [6, Theorem 5.4.2]] If a sequence is a martingale, then for every we have

 \lx@paragraphsign(supt≤TXt>λ)≤\E[|XT|]λ

If is a nonnegative martingale, then for every we have

 \lx@paragraphsign(supt≤TXt>λ)≤\E[X0]λ

## 3 Local to global polarization

In this section we prove Theorem 1.2, which asserts that every locally polarizing -martingale is also strongly polarizing. The proofs in this section depend on some basic probabilistic concepts and inequalities that we have seen in in Section 2.3.

The proof of this statement is implemented in two main steps: first, we show that any locally polarizing martingale, is -polarizing for some constant depending only on the parameters of local polarization. This means that, except with exponentially small probability, is exponentially small in , which we can use to ensure that for all stays in the range where the conditions of suction at the ends apply (again, except with exponentially small failure probability). Finally, we show that if the martingale stays in the suction at the ends regime, it will polarize strongly — i.e. if we have a -martingale, such that in each step it has probability at least to decrease by a factor of , we can deduce that at the end we have .

We start by showing that in the first steps we do get exponentially small polarization, with all but exponentially small failure probability. This is proved using a simple potential function which we show shrinks by a constant factor, for some , in expectation at each step. Previous analyses in [11, 10] tracked (or some tailormade algebraic functions [13, 17]) as potential functions, and relied on quantitatively strong forms of variance in the middle to demonstrate that the potential diminishes by a constant factor in each step. While such analyses can lead to sharper bounds on the parameter , which in turn translate to better scaling exponents in the polynomial convergence to capacity, e.g. see [13, Thm. 18] or [17, Thm. 1], these analyses are more complex, and less general.

[] If a -martingale sequence is -locally polarizing, then there exist , depending only on , such that

 \E[min(√Xt,√1−Xt)]≤(1−ν)t.
###### Proof.

Take . We will show that , for some depending on and . The statement of the lemma will follow by induction.

Let us condition on , and first consider the case . We know that

 \E[min(√Xt+1,√1−Xt+1)]≤min(\E[√Xt+1],\E[√1−Xt+1]),

we will show that . The proof of is symmetric.

Indeed, let us take . Because is a martingale, we have , and by Jensen’s inequality, we have that , where all the expectations above are conditioned on . Take such that . We will show a lower bound on in terms of and .

The high-level idea of the proof is that we can show that local polarization criteria implies that is relatively far from with noticeable probability, but if were close to one, by Chebyshev inequality we would be able to deduce that is far from its mean with much smaller probability. This implies that mean of has to be bounded away from .

More concretely, observe first that by Chebyshev inequality, we have , hence, for , we have:

 \lx@paragraphsign(|T−1|≥δ+C0√δθ−10τ−10)≤18θ20τ20. (7)

On the other hand, because of the Variation in the middle condition of local polarization, we have

 \Var(T2)=\E[X2t+1]−X2tX2t≥θ0X2t≥θ0,

where the last inequality follows since . Moreover , because