 # Information-Distilling Quantizers

Let X and Y be dependent random variables. This paper considers the problem of designing a scalar quantizer for Y to maximize the mutual information between the quantizer's output and X, and develops fundamental properties and bounds for this form of quantization, which is connected to the log-loss distortion criterion. The main focus is the regime of low I(X;Y), where it is shown that, if X is binary, a constant fraction of the mutual information can always be preserved using O((1/I(X;Y))) quantization levels, and there exist distributions for which this many quantization levels are necessary. Furthermore, for larger finite alphabets 2 < |X| < ∞, it is established that an η-fraction of the mutual information can be preserved using roughly ((| X | /I(X;Y)))^η·(|X| - 1) quantization levels.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Let and be a pair of random variables with alphabets and , respectively, and a given distribution . This paper deals with the problem of quantizing into values, under the objective of maximizing the mutual information between the quantizer’s output and . With a slight abuse of notation111This notation is meant to suggest the distance from a point to a set., we will denote the value of the mutual information attained by the optimal -ary quantizer by

 I(X;[Y]M)≜sup~Y∈[Y]MI(X;~Y). (1)

where is the set of all (deterministic) -ary quantizations of ,

 [Y]M≜{f(Y) : f:Y→[M]}

and .

When and are thought of as the input and output of a channel, this problem corresponds to determining the highest available information rate for -level quantization. It is therefore not surprising that this problem has received considerable attention. For example, it is well known [2, Section 2.11] that when is equiprobable and for Gaussian , it holds that , which is achieved by taking

to be the maximum a posteriori (MAP) estimator of

from .222In , it was demonstrated that if an asymmetric signaling scheme is used, instead of binary phase-shift keying (BPSK), the additive white Gaussian noise channel capacity can be attained at low signal-to-noise ratio (SNR) with an asymmetric -level quantizer.

A characterization of (1) is also required for the construction of good polar codes , since the large output cardinality of polarized channels makes it challenging to evaluate their respective capacities (and identify “frozen” bits). Efficient techniques for channel output quantization that preserve mutual information have been developed to overcome this obstacle, and played a major role in the process of making polar codes implementable [5, 6, 7]. One byproduct of these efforts is a sharp characterization of the additive gap. Specifically, it was recently shown in  that, for arbitrary , it holds that , whereas  demonstrates that there exist such that . The works [5, 6, 7], among others, also provided polynomial-complexity, sub-optimal algorithms for designing such quantizers. In addition, for binary , an algorithm for determining the optimal quantizer was proposed in  (drawing upon a result from ) that runs in time

. A supervised learning algorithm, for the scenario where

is not known, and cannot be estimated with good accuracy, was proposed in .

It may at first appear surprising that the quality of quantization found in [5, 6, 7] depends on the alphabet size but not on . The reason for this is that, given , the relevant information about is the the posterior distribution , which is a point on ()-dimensional simplex. Thus, the goal of quantizing

is essentially a goal of quantizing the probability simplex. The goal of this paper is to understand the fundamental limits of this quantization, as a function of alphabet size. The crucial difference with

[5, 6, 7] is that here we focus on the multiplicative gap, i.e., comparing the ratio of to . The difference is especially profound in the case when is small. We ignore the algorithmic aspects of finding the optimal -level quantizer and instead focus on the fundamental properties of the function . To this end, we define and study the “information distillation” function

 IDM(K,β)≜infPXY:|X|=KI(X;Y)≥βI(X;[Y]M). (2)

The infimum above is taken with respect to all joint distributions with discrete input alphabet

of cardinality and arbitrary (possibly continuous) output alphabet such that the mutual information is at least . One may wonder whether has an essential role in the function . Proposition 4, stated and proved in Section II, shows that for any and it holds that . Thus, one must indeed restrict the cardinality of in (2) in order to get a meaningful quantity.

Special attention will be given to the binary input alphabet case, where for some . In this setting, it may seem at a first glance that the optimal binary quantizer should always retain a significant fraction of , and that the MAP quantizer should be sufficient to this end. For large , this is indeed the case, as we show in Proposition 6. As mentioned above, this is also the case if with Gaussian for all values , since the MAP quantizer always retains at least of the mutual information. However, perhaps surprisingly, we show that there is no constant such that for all with .

### I-a Main Results

Our main result is a complete characterization, up to constants, of the binary information distillation function.

###### Theorem 1

For any mutual information value , the binary information distillation function is lower and upper bounded as follows:

 β⋅flower⎛⎜ ⎜⎝M−1max{log(1β),1}⎞⎟ ⎟⎠≤IDM(2,β) ≤β⋅fupper⎛⎜ ⎜⎝M−1max{log(1β),1}⎞⎟ ⎟⎠,

where

 flower(t)≜{t208t<1041−52tt≥104fupper(t)≜min{3t,1}.

The proof is deferred to Section III-D. Note that the negative aspect of this result is in stark contrast to the intuition from the binary additive white Gaussian noise (AWGN) channel. While for the former, two quantization levels suffice for retaining a fraction of , Theorem 4 shows that there exist sequences of distributions for which at least quantization levels are needed in order to retain a fixed fraction of . Furthermore, as illustrated in Section III, for small and , the MAP quantizer can be arbitrary bad w.r.t. the optimal quantizer, which is in general not “symmetric.” On the positive side, quantization levels always suffice for retaining a fixed fraction of .

For the general case where , we prove the following.

###### Theorem 2

Define

 a0(M,|X|,β)≜1|X|−1⋅min⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩M−1208log((|X|−1)2β),12⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭ (3) a|X|−1(M,|X|,β)≜⎛⎜ ⎜ ⎜⎝1−⎛⎜ ⎜⎝52log(e(|X|−1)β)M1/(|X|−1)⎞⎟ ⎟⎠2/3⎞⎟ ⎟ ⎟⎠2 (4)

and, for , define

 ak(M,|X|,β) ≜k|X|−1⎛⎜ ⎜ ⎜ ⎜⎝1−52log((|X|−1)2β)M1/k⎞⎟ ⎟ ⎟ ⎟⎠. (5)

Then, for any and , the information distillation function is upper and lower bounded as follows

 β⋅maxk∈{0,1,…,|X|−1} ak(M,|X|,β)≤IDM(|X|,β) ≤β⋅fupper⎛⎜ ⎜⎝M−1max{log(1β),1}⎞⎟ ⎟⎠, (6)

where is as defined in Theorem 1.

The proof of the lower bound is deferred to Section IV, whereas the upper bound follows trivially by noting that is monotone non-increasing and invoking the upper bound on from Theorem 1.

The lower bound from Theorem 2 states that for all and , it holds that suffices to guarantee that . In particular, choosing , we obtain that suffices to attain and, on the other hand, by the upper bound, there exist for which is required in order to attain . Thus, Theorem 2 gives a tight characterization (up to constants independent of ) of the number of quantization levels required in order to maintain a fraction of of . However, we were not successful in establishing an upper bound that match the lower bound within the range . We nevertheless conjecture that for close to our lower bound is tight.

###### Conjecture 1

For any , there exists some , and a constant , such that for all and , it holds that

 IDM(|X|,β)<η(|X|)⋅β. (7)

As discussed above, prior work [5, 6, 7] has focused on bounding the additive gap. This corresponds to bounding the so-called “degrading cost” [8, 7], which is defined as

 DC(|X|,M)≜sup0<β≤log|X|β−IDM(|X|,β) (8)

in our notation. In particular, the bound derived in  on is equivalent to the following “constant-gap” result: for every ,

 IDM(|X|,β)≥β−ν(|X|)M−2/(|X|−1)

for some function .333It is also demonstrated in  that there exist values of , for which this bound is tight. Specifically,  found a distribution with and for which for some constant . For small , however, results of this form are less informative. Indeed, for small , this bound requires to scale like in order to preserve a constant fraction of the mutual information. On the other hand, our result shows that scaling like suffices for joint distributions .

Notation: In this paper, logarithms are generally taken w.r.t. base , and all information measures are given in bits. When a logarithm is taken w.r.t. base , we use the notation instead of . We denote the binary entropy function by , and its inverse restricted to the interval by . The notation denotes the “floor” operation, i.e., the largest integer smaller than or equal to .

## Ii Properties of I(x;[y]m)

Let be a joint distribution on and consider the function , as defined in (1). The restriction to deterministic functions incurs no loss of generality, see e.g., . Indeed, any random function of , can be expressed as where is some random variable statistically independent of . Thus,

 I(X;f(Y,U))≤I(X;f(Y,U),U)=I(X;f(Y,U)|U) (9)

and hence there must exist some for which . Furthermore, for any function , we can associate a disjoint partition of the -dimensional cube into regions , such that iff for . A remarkable result of Burshtein et al. [10, Theorem 1] shows that the maximum in (1) can without loss of generality be restricted to functions for which there exists an associated partition where the regions are all convex.

Below, we state simple upper and lower bounds on

###### Proposition 1 (Simple bounds)

For any distribution on with a finite output alphabet, and ,

 M−1|Y|I(X;Y)≤I(X;[Y]M)≤min{I(X;Y),log(M)}.

Proof. The upper bound does not require any assumptions on and follows from the data processing inequality (

forms a Markov chain in this order), and from

.

For the lower bound, we can identify the elements of with such that

 PY(1)D(PX|Y=1||PX)≥⋯≥PY(|Y|)D(PX|Y=|Y|||PX)

and take the quantization function

 f(y)={yif y

Recalling that we see that

 I(X;f(Y))≥M−1|Y|I(X;Y).

For , we can construct a (possibly sub-optimal) -level quantizer by first finding the optimal -level quantizer and then quantizing its output to -levels. This together with the lower bound in Proposition 1, yields the following.

###### Corollary 1

For natural numbers we have

 I(X;[Y]K)≥K−1MI(X;[Y]M).
###### Remark 1

It is tempting to expect that will have “diminishing returns” in for any , i.e., that it will satisfy the inequality . However, as demonstrated by the following example, this is not the case. Let and , where is additive noise statistically independent of with and . Clearly,

 I(X;[Y]4)=I(X;Y)=2−h(δ)−(1−δ)log(3), (10)

and it can be verified that

 I(X;[Y]2)=⎧⎪⎨⎪⎩h(14)−14h(δ)−34h(1−δ3)δ≤1/4,1−h(1+2δ3)δ>1/4. (11)

Thus, for this example we have that for all .

###### Proposition 2 (Data processing inequality)

If form a Markov chain in this order, then

 I(X;[V]M)≤I(X;[Y]M).

Proof. For any function we can generate a random function which first passes through the channel and then applies on its output. By (9), we can always replace by some deterministic function such that

 I(X;¯f(Y))≥I(X;~f(Y))=I(X;f(V)).

###### Proposition 3

For a fixed , the function is convex.

Proof. For any , let , and note that

 I(X;[Y]M)=supf:Y↦[M]If(PX×PY|X).

Since the supremum of convex functions is also convex, it suffices to show that for a fixed the function is convex in . To this end, consider two channels and , and let and , respectively, be the induced channels from to . Clearly, for the channel , the induced channel is . Let be the output of this channel, when the input is . From the convexity of the mutual information w.r.t. the channel we have

 If (PX×(αP1Y|X+(1−α)P2Y|X))=I(X;Z) ≤αIf(PX×P1Y|X)+(1−α)If(PX×P2Y|X),

as desired.

###### Remark 2

In contrast to mutual information, the functional is in general not concave in for a fixed . To see this consider the following example: , , and the channel from to is clean, i.e., . Let and . Clearly, . For any , let . It can be verified that

 I(X;[Y]M)<1.
###### Remark 3 (Complexity of finding the optimal quantizer)

For the special case where , the function reduces to444Recent work by Cicalese, Gargano and Vaccaro  provides closed-form upper and lower bounds on .

 H([Y]M)≜sup~Y∈[Y]MH(~Y). (12)

Furthermore, when the optimization problem in (12) is equivalent to

 maxA⊆X∑x∈Apx subject to: ∑x∈Apx≤12, (13)

where , . The problem (13) is known as the subset sum problem and is NP-hard . See also . Thus, when is not constrained, the problem of finding the optimal quantizer of is in general NP-hard. Nevertheless, for the case where is binary, a dynamic programming algorithm finds the optimal quantizer with complexity , see .

###### Proposition 4

For any , any natural , and large enough, we have that

 IDM(2n,β)≤log(M)nlog(e)β, (14)

Consequently, for any and natural we have that , which motivates the restriction to finite input alphabets in our main theorems.

Proof. Let , , , and . Let . For product distributions we have that for any satisfying the Markov chain , it holds that [14, 15]

 I(U;Xn)I(U;Yn)≤supI(U;X)I(U;Y), (15)

where the supremum is taken w.r.t. all Markov chains with fixed and . For the doubly symmetric binary source of interest, this supremum is  , and consequently, we obtain that for any , it holds that

 I(f(Yn);Xn) ≤(1−2δ)2I(f(Yn);Yn) ≤(1−2δ)2H(f(Yn)) ≤(1−2δ)2log(M). (16)

For any , take large enough such that , and set

 δ=12−√βn⋅2log(e) (17)

we obtain that

 I(Xn;[Yn]M) ≤log(M)n⋅2log(e)β. (18)

On the other hand, we have that

 I(Xn;Yn)=n(1−h(δ))=β(1+o(1)). (19)

Thus, for large enough (14) indeed holds.

### Ii-a Relations to quantization for maximizing divergence

For two distributions on , , define

 ψM(P,Q)≜supf:Y↦[M]D(Pf||Qf), (20)

where and are the distributions on induced by applying the function on the random variables generated by and , respectively. A classical characterization of Gelfand-Yaglom-Perez [16, Section 3.4], shows that as . We are interested here in understanding the speed of this convergence. To this end, we prove the following result.

###### Proposition 5

For any , there exists two distributions on such that and for any .

Proof. Consider the following two distributions:

 P(m) =⎧⎨⎩2−mm=1,…,T2−(T−1)m=T0m>T Q(m) =⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩P(m)1≤m≤kg(m)⋅P(m)k

where is some monotonically non-increasing function. We have that

 D(P||Q)=T∑m=k+12−mlog(1/g(m)), (21)

whereas for any we have that

 D(Pf||Qf)≤M⋅maxA⊂{0,1,…}P(A)logP(A)Q(A). (22)

Let . Without loss of generality, we can assume that , as otherwise . Thus, we can define and write

 P(A) =P(Ak)+P(A∖Ak)≤P(Ak)+2⋅2−ℓ Q(A) =Q(Ak)+Q(A∖Ak)≥P(Ak)+2−ℓg(ℓ) (23)

Let , and such that the bounds above read as and , and

 P(A)logP(A)Q(A) ≤−2−ℓtlog(1−τt). (24)

We note that the function is convex and monotone decreasing in the range . This implies that (24) is maximized by choosing such that , for which , and we obtain

 D(Pf∥Qf)

Now, take for some , and note that it is indeed monotone non-increasing in , which yields

 D(P∥Q) =αT∑m=k+11m (26) D(Pf∥Qf) ≤2M(2−ℓ+αℓ)≤2M(2−k+αk). (27)

The statement follows by noting that we can always choose such that the left hand side of (27) is smaller than , and then we can choose and such that the left hand side of (26) is equal to .

Proposition 5 shows that for any fixed , and any value of , the ratio can be arbitrarily small.555However, under some restrictions on the distributions and , it is shown in  that a -level quantizer suffices to retain a constant fraction of . Note that choosing a different -divergence in the definition of instead of the KL-divergence, could lead to very different results. In particular, under the total variation criterion, the -bit quantizer achieves for any pair of distributions on . An interesting question for future study is for which -divergences is the ratio always positive.

## Iii Bounds for binary X

In this section, we consider the case of , and provide upper and lower bounds on . We begin by studying the case where , through which we shall demonstrate why the multiplicative decrease in mutual information is small when is high (close to ). These findings illustrate that the more interesting regime for is the one where is small. For this regime, we derive lower and upper bound that match up to constants that do not depend on .

### Iii-a Binary Quantization (M=2)

The aim of this subsection is to analyze the performance of quantizers whose cardinality is equal to that of . In this case, a natural choice for the quantizer is the maximum a posteriori (MAP) estimator of from . Intuitively, when is high (close to ), the MAP estimator should not make many errors and the mutual information between it and should be high as well. We make this intuition precise below. However, when is low, it turns out that not only does the MAP estimator fail to retain a significant fraction of , but it can be significantly inferior to other binary quantizers.

Assume without loss of generality that . The maximum a posteriori (MAP) quantizer is defined by

 fMAP(y)=⎧⎪⎨⎪⎩1if Pr(X=1|Y=y)>1/22if Pr(X=1|Y=y)<1/21⋅U+2(1−U)if Pr(X=1|Y=y)=1/2, (28)

where is statistically independent of . Let and . By the concavity of the binary entropy function , we have that for any , with equality iff . Consequently,

 H(X|Y)=EYh(Pe,MAP(Y))≥2Pe,MAP. (29)

Let and . We have that

 I(X;fMAP(Y)) (30) =H(X)−H(X|fMAP(Y)) =h(p)−Pr(fMAP(Y)=1)h(Pr(X≠1|fMAP(Y)=1)) −Pr(fMAP(Y)=2)h(Pr(X≠2|fMAP(Y)=2)) ≥h(p)−h(Pe,MAP) ≥h(p)−h(H(X|Y)2) =h(p)−h(h(p)−β2). (31)

Since , we have obtained that

 I(X; fMAP(Y))≥minβ≤t≤1t−h(t−β2) =⎧⎪⎨⎪⎩β+25−h(15)β<351−h(1−β2)β≥35. (32)

Since , it follows that the right hand side of (32) is a lower bound on .

In order to obtain an upper bound on , assume and is the binary erasure channel (BEC), i.e., and

 Pr(Y=y|X=x)={βif y=x1−βif y=?, (33)

such that . Consider the quantizer

 fZ(y)={1if y∈{1,?}2if y=0.

Since there exists an optimal deterministic quantizer, and any deterministic -bit quantizer for the BEC output is of the form , this must be an optimal -bit quantizer. Note that the induced channel from to is a -channel, and it satisfies

 I(X;fZ(Y))=β2h(1−β2−β)+1−h(1−β2−β). (34)

By the optimality of the quantizer for this particular distribution, it follows that the right hand side of (34) constitutes an upper bound on .

We have therefore established the following proposition.

###### Proposition 6

For all we have

 1−h(ϵ2)≤ID2(2,1−ϵ)≤1−1+ϵ2h(ϵ1+ϵ). (35)

Thus, for large , the loss for quantizing the output to one bit is small and the fraction of the mutual information that can be retained approaches as the mutual information increases. In particular, the natural MAP quantizer is never too bad, and retains a significant fraction of at least of the mutual information .

In the small regime, we arrive at qualitatively different behavior. We next show that the MAP quantizer can be highly sub-optimal when is small. To that end, consider again the distribution and given by (33). i.e., a BEC. It is easy to verify that in this case both inequalities in (31) are in fact equalities for all . It follows that for a BEC with capacity and uniform input, we have that

 I(X;fMAP(Y))=1−h(1−β2)=loge2β2+o(β2). (36) I(X;fZ(Y))=β2h(1−β2−β)+1−h(1−β2−β)=β2+o(β). (37)

Thus, the asymmetric quantizer retains of the mutual information, whereas the fraction of mutual information retained by the symmetric MAP quantizer vanishes as goes to zero.

One can argue that is a MAP estimator just as , as the two quantizers attain the same error probability in guessing the value of based on , and dismiss our findings about the sub-optimality of by attributing it to the randomness required by the MAP quantizer, as defined in (28), in the BEC setting. This is not the case however. To see this consider a channel with binary symmetric input and output alphabet , defined by

 Pr(Y=y|X=x)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩βif y=(x,g)(1−β)(12+δ)if y=(x,b)(1−β)(12−δ)if y=(1−x,b),

for some