Ever since their discovery  polar codes have been a subject of vast interest, both for their theoretical and practical significance. Theoretical interest in them arises from two desirable features that they exhibit: (1) They give codes of length (for infinitely many ) along with efficient decoding algorithms that correct channel errors with all but exponentially (i.e., ) small failure probability. (2) They also converge to capacity extremely fast — i.e., at block length which is only polynomial in the inverse of the "gap to capacity". The former effect is known to hold in general, i.e., for the entire class of polar codes (up to a minimal and natural necessary condition). The latter was shown to hold in the same generality only recently  — previous works [5, 6, 4] were only able to establish it for one specific construction of polar codes. And while the early works were able to show effects (1) and (2) simultaneously for this construction, the other polar codes were not known to have both features simultaneously.
The main goal of this paper is to remedy this weakness. We show roughly that the techniques of  can be strengthened to achieve both effects simultaneously for the entire broad class of polar codes. In addition to the generality of the result this also leads to quantitative improvements on the error-exponent at polynomially small block lengths in the gap to capacity. We elaborate on these further after some background.
In the theory of Shannon, a memoryless channel is given by a probabilistic map from an input alphabet (a finite field in this paper) to an output alphabet (an abstract set here). A family of codes along with decoding algorithm achieves rate if . It is said to achieve failure probability if for every . Shannon’s celebrated theorem associates a capacity with every channel such that transmission at rate higher than capacity will have constant failure probability, whereas for every , for every sufficiently large , there exist codes of rate with failure probability . The quantity is called the “gap to capacity”. The relationship between the block length , the gap to capacity and the failure probability are the central quantities of interest to this paper.
The specific family of codes we consider in this paper are “polar codes” introduced by Arıkan . These codes are a broad class of (infinite families of) codes, one family for every matrix and symmetric channel. The -th code in the sequence has length , and is given by (affine shift) of some subset of rows of . It is well known that under a simple necessary condition on (that we call mixing), these codes achieve exponentially small failure probability in a weak sense: Specifically for every symmetric channel, for every mixing , there exists a such that for every there exists a such that every code in the family of length has at most gap to capacity and achieves failure probability at most . Indeed by picking carefully one could achieve arbitrarily close to (though this approach can not yield ), and moreover for a given matrix , the range of achievable can be explicitly computed from simple combinatorial properties of this matrix . However note that these analyses did not provide explicit relationship between and .
It was more recently shown [5, 6, 4] that there exists an (specifically ) such that the associated code achieves exponentially small failure probability even at polynomially small block lengths — i.e., when . The associated with this result is bounded well away from . But till last year no other code (for any other matrix ) was even known to achieve failure probability going to zero for polynomially small block lengths. This was remedied in part by a previous work of the authors with Nakkiran and Rudra  where they showed that for every mixing matrix and every symmetric channel the associated code converges at block length growing polynomially with gap to capacity, however their failure probability analysis only yielded . Their work forms the starting point of this work.
1.2 Our results
Our results show that it is possible to combine the general analyses for “polynomial convergence of block length in gap to capacity” (from ) with any strong analysis of the failure probability. Specifically we show the following:
For every mixing matrix and symmetric channel the associated family of polar codes yield exponentially small decoding failure at block lengths polynomial in the gap to capacity.
While the result in Part (1) is general the resulting may not be optimal. We complement this with a result showing that for every there exist polar codes associated with some matrix , that get close to capacity at polynomial block length with decoding failure probability being . We note that no previous analysis yielded such quantitatively strong bounds on any family of polar codes with polynomial block length.
Finally we show that convergence to capacity at polynomial block length comes with almost no price in the failure probability. We show this by proving that if any polar code achieves capacity (even if at very large block lengths) with failure probability , then for every it achieves capacity with failure probability where the block length is a polynomial .
While the third result subsumes the previous two (when combined with known results in the literature), we include the first two to show that it is possible to prove strong results about failure probabilities with blocklength polynomial in the gap to capacity, entirely within the local polarization framework developed in  and here — without appealing to previous analyses. In fact the proofs of those two are quite simple (given the work of ).
On the other hand, for given matrix , the optimal exponent was exactly characterized in terms of explicit combinatorial properties of matrix — but with potentially very large blocklengths . The third result of our paper automatically lifts this theorem to the setting where blocklength is polynomial in the gap to capacity — given matrix one can compute the “correct” exponent as in , and essentially the same exponent is achievable already within polynomial blocklength, whereas no larger exponent is achievable, regardless of how large blocklength one takes.
We now turn to the central ingredient in our analyses of polar codes which we inherit from , namely the “local” analysis of -martingales. It is well-known that the analysis of polar codes can be tied to the analysis of an associated martingale, called the Arıkan martingale in . Specifically given a channel and a matrix one can design a martingale with , such that the performance of the code of length
depends on the behavior of the random variable. Specifically to achieve gap to capacity with failure probability , the associated martingale should satisfy . Considering the fact that we want the failure to be exponentially small in and to be inverse polynomially small in and noting , this requires us to prove that .
Usual proofs of this property typically track many aspects of the distribution of , whereas a “local” analysis simply reasons about the distribution of conditioned on . For the Arıkan martingale (as for many other natural martingales) this one-step evolution is much easier to describe than the cumulative effects of -steps. In 
a simple local property, called “local polarization”, of this one-step evolution was described (enforcing that the random variable has enough variance if it is not close to the boundaryand that it gets sucked to the boundary when it is close). It was then shown that local polarization leads to global polarization, though only for — specifically they showed that .
It is easy to modify the definition of local polarization slightly to get a stronger definition that would imply the desired convergence even for . Indeed we do so, calling it “exponential local polarization” of a martingale, and show that this stronger local polarization leads to exponentially small failure probabilities.
The crux of this paper is in showing that the Arıkan martingale exhibits exponential local polarization. For readers familiar with the technical aspects, this might even be surprising. In fact the most well-studied Arıkan martingale, the one associated with the binary symmetric channel and the matrix is not exponentially locally polarizing. We get around this seemingly forbidding barrier by showing that the martingale associated with
(the tensor-product ofwith itself) is exponentially locally polarizing, and this is almost as good for us. (Instead of reasoning about the martingale this allows us to reason about which is sufficient for us.) Combined with some general reductions as in  this allows us to show that for every symmetric channel and every mixing matrix, the associated martingale is exponentially locally polarizing and this yields our first main result above.
To get failure probability for we show that if the matrix contains the parity check matrix of a code of sufficiently high distance then the Arıkan martingale associated with exhibits exponential local polarization over any symmetric channel, and in turn this leads to codes whose failure probability is for .
Finally we turn to our last result showing that any matrix producing codes with failure probability (but not necessarily for ) also gets failure probability for for some polynomial , and any . This result is obtained by showing that if achieves exponentially small error, then for some large , the matrix contains the parity check matrix of a high-distance code, with distance high enough to imply that its failure probability is .
2 Main Definitions and Results
2.1 Martingales and Polarization
In this section we let be a -bounded martingale, i.e., for all and for every , .
We say that a martingale has exponentially strong polarization if the probability that is not close (as a function of ) to the boundary is exponentially small in . Formally
Definition 2.1 (Exponentially Strong Polarization).
We say that has -exponentially strong polarization if for every there exist constants and such that for every , .
Note that this definition is asymmetric — paths of the martingale that converge to zero, have doubly-exponential rate of convergence, whereas those converging to are doing it only exponentially fast.111It turns out that for the polar coding application, the behavior of the martingale at the lower end is important as it governs the decoding error probability, whereas behavior of the martingale near the upper end is not that important. The probability that the martingale doesn’t polarize corresponds to the gap to capacity. This should be compared with the notion of strong polarization present in , namely
Definition 2.2 (Strong Polarization).
We say that has strong polarization if for every there exist constants and such that for every , .
As in  the notion of Exponential Strong Polarization is not a local one but rather depends on the long run behavior of . A notion of local polarization, that only relates the evolution of from , was defined in , and shown to imply strong polarization. Let us recall this definition.
Definition 2.3 (Local Polarization).
A -martingale sequence is locally polarizing if the following conditions hold:
(Variance in the middle): For every , there is a such that for all , we have: If then .
(Suction at the ends): There exists an , such that for all , there exists a such that:
If then .
Similarly, if then .
We refer to condition (a) above as Suction at the low end and condition (b) as Suction at the high end.
When we wish to be more explicit, we refer to the sequence as -locally polarizing.
With an eye toward showing exponential strong polarization also via a local analysis, we now define a concept of local polarization tailored to exponential polarization.
Definition 2.4 (Exponential Local Polarization).
We say that has -exponential local polarization if it satisfies local polarization, and the following additional property
(Strong suction at the low end): There exists such that if then .
In the same way as local polarization implies the strong global polarization of a martingale [2, Theorem 1.6], this new stronger local condition implies a stronger global polarization behavior.
Theorem 2.5 (Local to Global Exponential Polarization).
Let . Then if a -bounded martingale satisfies -exponential local polarization then it also satisfies -exponentially strong polarization.
2.2 Matrix Polarization
In this section we relate statements about the local polarization of the Arıkan martingale associated with some matrix (and some channel) to structural properties of itself. The formal definition of the Arıkan martingale is included for completeness in Appendix B, but will not be used in this paper.
We first recall the definition of a mixing matrix — it is a simple necessary condition for associated Arıkan martingale to be non-trivial (i.e. non-constant).
Definition 2.6 (Mixing matrix).
For prime and , is said to be a mixing matrix if is invertible and for every permutation of the rows of , the resulting matrix is not upper-triangular.
Let us now rewrite the (technical) condition of the Arıkan martingale associated with being exponentially locally polarizing in more direct terms. This leads us to the following definition.
Definition 2.7 (Exponential polarization of matrix).
We say that a matrix satisfies -exponential polarization, if there exist some , such that for any and for any random sequence , where are i.i.d., and satisfy , we have
for at least fraction of indices .
In the above definition and throughout the paper refers to normalized entropy, i.e. , so that , and , similarly
. Moreover, for a vector, and , by we denote a vector in with coordinates .
The following lemma explicitly asserts that matrix polarization implies martingale polarization (as claimed).
If mixing matrix satisfies -exponential polarization, then Arıkan martingale associated with is -exponentially locally polarizing.
The proof of the above lemma is very similar to the proof of Theorem 1.10 in  — with definitions of Arıkan martingale and exponential polarization of matrix in hand this proof is routine, although somewhat tedious and notationally heavy. We postpone this proof to the full version of this paper.
In the light of the above, and in context of Theorem 2.5, we have reduced the problem of showing (global) exponentially strong polarization of Arıkan martingale, to understanding parameters for exponential polarization of specific matrices, based on the structural propertues of these matrices.
In this paper we provide three results of this form. The first of our results considers mixing matrices and analyzes their local polarization. We recall the definition of a mixing matrix.
It is well known that if a matrix is not mixing then the associated martingale does not polarize at all (and the corresponding martingale satisfies for every ). In contrast if the matrix is mixing, our first lemma shows that (the tensor-product of with itself) is exponentially polarizing.
For every mixing matrix and every , matrix satisfies -exponential polarization.
This translates immediately to our first main theorem stated in Section 2.3.
Our second structural result on matrix polarization shows that matrices that contain the parity check matrix of a high distance code lead to very strong exponential polarization parameters.
If a mixing matrix is decomposed as , where is such that is a linear code of distance larger than , then matrix satisfies -exponential polarization for every .
By using standard results on existence of codes with good distance, we get as an immediate corollary that there exist matrices with almost optimal exponential polarization parameters.
For every and every prime field , there exist , and matrix , such that matrix satisfies exponential polarization.
Consider a parity check matrix of a BCH code with distance . We can achieve this with a matrix , where . Hence, as soon as , we have . We can now complete to a mixing matrix. ∎
It is worth noting, that by the same argument and standard results on the distance of random linear codes, a random matrixwith high probability satisfies a local polarization, with as .
By the whole chain of reductions discussed above, Corollary 2.11 implies that for any there exist polar codes with decoding failure probability , where the blocklength depends polynomially in the desired gap to capacity. Moreover, those codes are ubiquitous — polar codes arising from a large random matrix will usually have this property.
Our final structural result is morally a “converse” to the above: It shows that if a matrix leads to a polar code with exponentially small failure probability then some high tensor power of contains the parity check matrix of a high distance code. In fact more generally if a matrix is the parity check matrix of a code which has a decoding algorithm that corrects errors from a -symmetric channel with failure probability then this code has high distance.
For any finite field we will denote by the distribution on such that for we have , and for any .
Consider a matrix and arbitrary decoding algorithm , such that for independent random variables with , we have . Then is a code of distance at least .
This lemma, when combined with Lemma 2.10 shows that the only way a polar code associated with a matrix can give exponentially small failure probability is that some tensor of this matrix is locally exponentially polarizing and so in particular this matrix also yields exponentially small failure probabilities at block length polynomial in the gap to capacity.
2.3 Implications for polar codes
We start this section by including the definition of symmetric channel — all our results about polar codes show that we can achieve capacity for those channels.
Definition 2.14 (Symmetric memoryless channel).
A -ary symmetric memoryless channel is any probabilistic function , such that for every there is a bijection such that for every it is the case that , and moreover for any pair , we have (see, for example, [3, Section 7.2]).
Such probabilistic function yields a probabilistic function , by acting independently on each coordinate.
We will now recall the following theorem which shows that if the Arıkan martingale polarizes then a corresponding code achieves capacity with small failure probability.
Theorem 2.15 (Implied by Arıkan ).
Let be a -ary symmetric memoryless channel and let be an
If the Arıkan martingale associated with
be an invertible matrix. If the Arıkan martingale associated withis -exponentially strongly polarizing then there is a polynomial such that for every and every , there is a code of dimension at least such that is an affine code generated by the restriction of to a subset of its rows and an affine shift. Moreover there is a decoding algorithm for these codes that has failure probability bounded by , and running time . The running time of accompanying encoding algorithm is also .
We omit the proof of this theorem, which is identical to Theorem 1.7 in  except for minor modifications to incorporate the exponential polarization/failure probability.
Armed with this theorem, we can now convert the structural results asserted in the previous section into convergence and failure probability of polar codes.
For every prime , every mixing matrix , every symmetric memoryless channel over , there is a polynomial and such that for every and every , there is an affine code , that is generated by the rows of and an affine shift, with the property that the rate of is at least , and can be encoded and decoded in time and failure probability at most .
For every prime , every symmetric memoryless channel over , and every , there exists , a mixing matrix , and a polynomial such that for every and every , there is an affine code , that is generated by the rows of and an affine shift, with the property that the rate of is at least , and can be encoded and decoded in time and failure probability at most .
Suppose and satisfy the condition that for every memoryless symmetric additive channel222An additive symmetric channel is a special case of symmetric channels, where the output is the sum of the input with an “error” generated independently of the input. and for every , for sufficiently large , there is an affine code of length generated by the rows of of rate at least such that can be decoded with failure probability at most .
Then, for every and every symmetric channel , there is a polynomial such that for every and every there is an affine code , that is generated by the rows of and an affine shift, with the property that the rate of is at least , and can be encoded and decoded in time and failure probability at most .
We prove this theorem in Section 4.
Note that in this theorem, we assume that achieves failure probabilities for additive channels (which is only a subclass of all symmetric channels), to conclude that it achieves failure probability for all symmetric channels. This is potentially useful, as proving good properties of polar codes for additive channels is often simpler — in this setting there is a very clean equivalence between coding and linear compression schemes.
3 Structural analysis of matrices
3.1 Exponential polarization for all mixing matrices
We will first prove that a single specific matrix, namely , after taking second Kronecker power satisfies exponential polarization. In  local polarization of any mixing matrix was shown essentially by reducing to this case. Here we make this reduction more explicit, so that it commutes with taking Kronecker product of a matrix with itself. That is, we will later show that for any mixing matrix exponential polarization of can be reduced to exponential polarization of .
Consider for nonzero . For every matrix satisfies exponential polarization.
Consider arbitrary sequence of i.i.d. random variables with , as in the definition of exponential polarization. We can explicitly write down matrix as
Matrix has four rows — to achieve parameter of exponential polarization, we just need to show that there is at least one index satisfying the inequality as in the definition of exponential polarization (Definition 2.7). Let us consider vector and similarly . We want to bound
By Lemma C.1 there exist some function , such that . Now, given vector and , we can try to predict as follows: if we report . Otherwise, we report .
We want to show that . Indeed, only if at least two of the variables for are non-zero. By symmetry, we have .
By Fano’s inequality C.2, we have . For any given , there exist such that if we have , hence for those values of we have . ∎
We will now proceed to show that exponential polarization for of any mixing matrix can be reduced to the theorem above. To this end we define the following containment relation for matrices.
Definition 3.2 (Matrix (useful) containment).
We say that a matrix contains a matrix , if there exist some and a permutation matrix , such that . If moreover the last non-zero row of is rescaling of the standard basis vector , we say that containment is in is useful and we denote it by . Note that useful containment is not a partial order.
The following fact about useful containment will be helpful.
If , then for any upper triangular matrix with diagonal elements , we also have .
Consider matrix and permutation as in the definition of useful containment for . We can pick the very same permutation and matrix to witness . All we have to show is that last non-zero row of is standard basis vector . Indeed, if is the last non-zero row of , and , rows are supported exclusively on elements with indices larger than , hence . On the other hand , where the last equality follows from the fact that was useful — that is and for .
Results of the Lemma 5.5 in  can be reintepreted as the following Lemma. We give a full new proof here, as we describe it now in the language of useful containment.
Every mixing matrix contains matrix in a useful way.
For any matrix , there is some permutation matrix and pair , such that where is lower triangular, and is upper triangular. Matrix being mixing is equivalent to the statement that and are invertible, and moreover is not diagonal. As such by Claim 3.3 it is enough to show that any lower-triangular , which is not diagonal, contains in a useful way. Indeed, let be the last column of that contains more than a single non-zero entry, and let to be the last row of non-zero entry in column . Note that column has single non-zero entry . We will show a matrix as in the definition of useful containment. Let us specify a second column of . To specify the first column of we wish to find a linear combination of columns of such that . Then coefficeints can be used as the first column of matrix . We can set those coefficients to for , and — this setting is correct, because columns for has only one non-zero entry . Now if is any matrix corresponding to a permutation which maps and , the containemnt is witnessed by pair and . ∎
If matrix where and , then .
Consider matrix and permutation as in the definition of useful containment for . Note that . As such, restriction of a matrix to rows corresponding to is exactly , and all remaning rows are zero. We can apply additional permutation matrix so that those are exactly first rows of the matrix give matrix , and the remaining rows are zero. ∎
If matrix contains matrix in a useful way, then matrix satisfies exponential polarization.
Take and as in the definition of containment. Let moreover be the last non-zero row of . We have
Observe now that . Indeed — according to the definition of useful containment and because is last non-zero row of , we have (-th row has only one non-zero entry , as well as . Therefore
where the last equality follows from the fact that and are identically distributed (i.e. entries in are i.i.d).
This conditional entropy was bounded in the proof of Lemma 3.1. ∎
3.2 Maximally polarizing matrix
In this subsection we will prove Lemma 2.10.
Proof of Lemma 2.10.
Let us again consider a sequence of i.i.d. pairs for , such that . By Lemma C.1, there is some such that . Let us take .
We wish to bound , for all . We have
where the inequalities follow from the fact that for random variables it is always the case that .
we can produce estimate, where .
Let us observe that if then . Indeed, we have , therefore , but on the other hand , and by the assumption on we deduce that . Therefore . All coordinates of are independent, and each is nonzero with probability at most , therefore
and by Fano inequality (Lemma C.2), we have
where . Again, for any , and small enough (with respect to ), we have .
This shows that for any and small enough we have
which completes the proof of a exponential polarization for matrix . ∎
3.3 Source coding implies good distance
Proof of Lemma 2.13..
Consider maximum likelihood decoder . By definition, we have .
Note that for distributed according to , we have , where is number of non-zero elements of .
Consider set , and observe that . We say that vector is dominated by (denoted by ) if and only if . We wish to argue that for any and any , we have . Indeed, if , then there is some such that . We will show that , which implies that . Given that , we can equivalently say that there is a vector with and . Hence
Consider now to be minimum weight non-zero vector, and let us denote . We wish to show a lower bound for . By definition of the set we have , and by upward closure of with respect to domination we have .
On the other hand we have . By comparing these two inequalities we get
4 Strong polarization from limiting exponential polarization, generically
Suppose we know that polar codes associated with a matrix achieve capacity with error probability in the limit of block lengths . In this section, we prove a general result that ‘lifts" (in a black box manner) such a statement to the claim that, for any , polar codes associated with achieve polynomially fast convergence to capacity (i.e., the block length can be as small as for rates within of capacity), and decoding error probability simuletaneously. Thus convergence to capacity at finite block length comes with almost no price in the failure probability. Put differently, the result states that one can get polynomial convergence to capacity for free once one has a proof of convergence to capacity in the limit with good decoding error probability. This latter fact was shown in  for the binary alphabet and  for general alphabets.
Proof of 2.18.
Consider the channel that outputs on input , where for some (depending on ). The hypothesis on implies that for sufficiently large the polar code corresponding to will have failure probability at most on this channel. Using the well-known equivalence between correcting errors for this additive channel, and linear compression schemes, we obtain that for all large enough there is some subset of columns of that defines a linear compression scheme (for i.i.d copies of ), along with an accompanying decompression scheme with error probability (over the randomness of the source) at most .
We now claim that for all , there exists such that the Arikan martingale associated with some column permuted version of , is -exponentially strongly polarizing.
The proof of this claim is in fact immediate, given the ingredients developed in previous sections. Apply the hypothesis about in the theorem with the choice and chosen small enough as a function so that and let be a large enough promised value of . Put , and and . Using Lemma 2.13, we know there is submatrix