Few information-theoretic constructs have captured the attention of machine learning researchers and practitioners as the Information Bottleneck (IB). Given two correlated random variables and
with joint distribution, the goal of the IB is to determine a mapping that produces a new representation of such that (i) and (ii) is maximized (information preserved) while minimizing (compression). This tradeoff can be quantified by the Lagrangian functional . The IB has proved useful in many machine learning problems, such as clustering 3]
. More recently, the IB framework has been used to analyze the training process of deep neural networks[4, 5].
In an inverse context, the Privacy Funnel (PF), introduced in , seeks to determine a mapping that minimizes (privacy leakage) while assuring (revealing useful information). Analogously, the PF can be solved by considering the functional . The privacy funnel (and its variants) has shown to be useful in information-theoretic privacy [6, 7].
The choice of mutual information in both the IB and the PF frameworks does not seem to carry any specific “operational” significance. It does, however, have a desirable practical consequence: it leads to self-consistent equations [1, Eq. 28] that can be solved iteratively in the IB case. In fact, this property is unique to mutual information among many other information metrics .
Nevertheless, at least in theory, one can replace the mutual information with a broader family of measures based on -divergences111 Given two probability distributions
Given two probability distributionsand a convex function with , -divergences is ..
In this paper, we consider a wider class of bottleneck problems which includes the IB and the PF. We define -information between two random variables and as , and introduce the following bottleneck functional
and the funnel functional
where and are convex functions. Different incarnations of -information have already appeared, e.g., -information in  for . These metrics possess “operational” significance that are arguably more useful in statistical learning and privacy applications than mutual information. For instance, total variation and Hellinger distance play important roles in hypothesis testing  and -divergence in estimation problems . Formulations (1) and (2) for a broader class of divergences can be potentially useful to emerging applications of information theory in machine learning.
Computing and reduces to characterizing the upper and lower boundaries, respectively, of the two-dimensional set
It is worth mentioning that studying (3) is at the heart of the strong data processing inequalities  as well as fundamental limits of privacy . Witsenhausen et al.  investigated the lower boundary of a related set , where is the entropy function. In particular, they proposed an algorithm for analytically computing based on a dual formulation. When is binary and is a binary symmetric channel (BSC), the lower bound of corresponds to the well-known Mrs. Gerber’s Lemma . Related convex techniques have also been used to characterize some network information theoretic regions .
We generalize the approach in  to study boundaries of (3) for a broader class of -information metrics, characterizing properties of new bottleneck problems of the form (1) and (2). In particular, we investigate the estimation-theoretic variants of information bottleneck and privacy funnel using -divergence, which we call Estimation Bottleneck and Estimation Privacy Funnel, respectively. In the binary symmetric case, the upper boundary corresponds to the inverse of Mrs. Gerber’s Lemma, which we call Mr. Gerber’s Lemma. We further extend these lemmas for Arimoto conditional entropy .
Ii Geometric Properties
Let and be two random variables having joint distribution with supports and , respectively. We denote by
the marginal probability vector with entries, where is a -dimensional simplex. We denote by
the stochastic matrix whose entries are the channel transformation, i.e. ; thus, . For a discrete random variable with support , let , and let the marginal of be . We denote by the entropy function, i.e. with and . Finally, we denote the convex hull of a set by , and the boundary of a set by .
Ii-B Geometry of Bottleneck Problems
Let and be continuous and bounded mappings over simplices of dimension and , respectively. We study the set (3) by first considering a more general context, and then specialize it to different information metrics in following sections. We consider the tuple
where , and . Recall that , , and
form the Markov chain. Therefore, we are interested in the following set for a fixed channel :
Moreover, we define . The next lemma is a direct generalization of [12, Lemma 2.1].
is convex and compact with . In addition, all points in can be written as a convex combination of at most points of ; in other words, .
Let the upper and lower boundaries of be denoted by and , respectively, i.e., we have
Under appropriate conditions on (depending on the choice of ), is non-empty, and hence the compactness of allows one to replace infimum and supremum in (6) and (7) with minimum and maximum, respectively. Moreover, it follows from the convexity of that is convex and is concave.
Ii-C Dual Formulations
is a convex set, its upper and lower boundaries are equivalently represented by its supporting hyperplanes. We use the dual approach introduced in to evaluate and . For a given , define the conjugate function
Note that the graph of is the lower boundary of . It then follows that the point that achieves the minimum in (8) lies on the lower boundary of with supporting line of slope , and hence corresponds to a point .
We now turn our attention to evaluating . Let
We observe that is the graph of the function on given by
Since the mapping preserves convexity, we have
as the convex hull of the graph of . Thus, is given by the lower convex envelope of at . The same would go, mutatis mutandis, for : its conjugate function , defined as
coincides with the upper concave envelope of at .
These properties leads to a procedure for characterizing and . We illustrate in details and will follow by using concave envelope instead. For , there are two scenarios:
Trivial case: If is in both and , then . In this case, simply reduces to , and the optimal has for some , independent of .
Non-trivial case: If , then is the convex combination of points , with weights where for some , and . Then is given by . Moreover, an optimal is attained by and , .
Hence, the points on the graph of can be obtained by only considering the points of which differs from its convex envelope since those are exactly the points where is not induced from the trivial case. Algorithm 1 summarizes our previous discussions.
Ii-D Matched Channels
The geometry of bottleneck problems leads to intriguing properties of . Our previous discussions reveal that the points used to form the convex envelope of are special: they determine a channel such that for any distribution , the resulting value of is on the boundary of with supporting line of slope . In this case, we say that the points form a matched channel for and , .
Definition 1 (Matched Channel).
For a fixed channel and , , we say that is matched to if there exists such that and
Using an elementary result in convex geometry (see Lemma 2 in Appendix.), we immediately have the following theorem.
Let be a matched channel for . Then for any , we have
See the Appendix. ∎
From Theorem 1, we know that for any distribution , matched channels are entirely determined by the points on the curve whose convex combinations lead to the convex envelope of at . It implies that as long as meets its convex envelope at , small perturbation around does not change the matched channels but simply change the weight . Thus, optimal mappings are surprisingly robust to small errors in estimation of , which could potentially give pragmatic advantages when applying bottleneck problems to real data. However, if changes, we can recover the matched channels by first solving via , where , and then applying Bayes’ rules.
Iii Generalizing Bottleneck Problems
In this section, we demonstrate how the tools developed in Section II can be applied to new bottleneck problems of the form (1) and (2). We then revisit the IB and PF, and also study their estimation-theoretic variants.
Consider the Markov chain . Our goal is to describe the achievable pairs of -divergences
Observe that for a given we have
and hence can be expressed as
for some function . Similarly, define , we have
Hence the corresponding set for varying has the same form as . Letting
we can thus apply Algorithm 1 to characterize and .
Next, we show that how the usual IB and PF fit in our formulation, and study their estimation-theoretic counterparts. We note, however, that the previous analysis does not require .
Iii-a Information Bottleneck
Assuming in the bottleneck functional (1), we have and . Thus, the set of points corresponds to the set of solutions of the IB problem.
It is worth mentioning that the same geometric approach can also be applied directly to entropy functions which also leads to the IB formulation. In fact, this is exactly the setting studied in . Specifically, choosing and , the set of points
also corresponds to the set of solutions of the IB. The IB is closely related to strong data processing inequalities. See [11, Proposition 2] for more details in the case of the BSC (see also Fig. 1 (right)).
Iii-B Privacy Funnel
Assuming in the funnel functional (2), the set corresponds to the set of solutions of the privacy funnel, introduced in . Equivalently, using the entropy function as in Section III-A, the set of points
also corresponds to the set of solutions, which follows from the fact that is monotonically non-decreasing; see Fig. 1 (right).
Iii-C Estimation Bottleneck
One can move away from the usual IB and define new bottleneck problems by considering different functions and . For instance, if , then the corresponding -information, called -information, is defined as
We simplify the notation in (1) for -information as
The reason to specifically study - information are two-fold. First, it has been shown in  that , where is the principal inertia component (PIC) of and . Moreover, if the PICs between and are large, then the minimum mean square error (MMSE) of estimating given will be small [6, Theorem 1], thereby making reliable estimations. Hence, if the goal of an estimation problem is to minimize , we can equivalently consider maximizing . Second, following the spirit of the IB, we also add the constraint for the new representation , as -divergence serves as sharp bounds for any -divergence .
Iii-D Estimation Privacy Funnel
Motivated by the connection between -information and estimation problems mentioned in Section III-C, we propose
where the privacy is measured in terms of MMSE. The practical significance of (25) is justified as follows. Suppose represents private data (e.g. political preferences) and (e.g. movie rating) is correlated with . The main objective, formulated by (25), is to construct a privacy-assuring mapping such that the information disclosed about by is minimized, thus minimizing privacy leakage, while preserving the estimation efficiency that provides about . Similarly, the solutions of the estimation privacy funnel (25) correspond to the set of points with ; see Fig. 1 (left).
Iv Mrs. and Mr. Gerber’s Lemmas
The study of upper and lower boundaries of achievable mutual information pairs are essential in multi-user information theory . In the binary symmetric case, we not only rephrase Mrs. Gerber’s lemma , but also derive its counterpart for the PF. Furthermore, we discuss analogous results to Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy.
Iv-a Mr. Gerber’s Lemma
We apply the duality argument, developed in Section II-C, for and to characterize the IB and the PF the binary symmetric case. In particular, let be the BSC with crossover probability and . For , denote . We denote by the binary entropy function .
Let . It was shown in  that
where is the inverse function of and , for . Eq. (26) is well-known as Mrs. Gerber’s Lemma (MGL). In this case, the matched channel is also a BSC with crossover probability . Using the approach outlined in Section II, we derive a counterpart result for the upper boundary , and call it Mr. Gerber’s Lemma.
Theorem 2 (Mr. Gerber’s Lemma).
For , we have
where and , .
See the Appendix. ∎
In summary, in the binary symmetric case, the set of solutions for the IB follows from Mrs. Gerber’s Lemma (26) and is given by , and the set of solutions for the PF follows from Mr. Gerber’s Lemma (27) and is given by . The upper and lower boundaries of the achievable pairs are therefore characterized by Mrs. and Mr. Gerber’s Lemmas, respectively.
Iv-B Achievable Pairs of Arimoto Conditional Entropy
When , we define . Hence, the set of achievable Arimoto conditional entropy pairs can be obtained by the nonlinear mapping:
With (28) at hand, Arimoto mutual information  of order can be defined as , where is the Rényi entropy of order . Arimoto conditional entropy has been proven useful in approximating the minimum error probability of Bayesian -ary hypothesis testing .
Iv-C Arimoto’s Mr. and Mrs. Gerber’s Lemmas
Due to the importance of Arimoto conditional entropy , we study the extensions of Mr. and Mrs. Gerber’s Lemmas for Arimoto conditional entropy, naming them Arimoto’s Mr. and Mrs. Gerber’s Lemmas respectively.
Let , and also . Since for and the mapping is strictly decreasing, we have . Define and respectively as the minimum and maximum of when for . For simplicity, denote if . In this case, following section II-C, we have , which leads to the following theorem.
Theorem 3 (Arimoto’s Mrs. Gerber’s Lemma).
For , let and for , we have
In particular, s.t. for .
See the Appendix. ∎
Analogous to this theorem, we also obtain the following generalization of Mr. Gerber’s Lemma.
Theorem 4 (Arimoto’s Mr. Gerber’s Lemma).
For , let and for , we have
where . In particular, we have s.t. for .
Consequently, for , Arimoto’s Mrs. and Mr. Gerber’s Lemmas jointly characterize the achievable sets ; see Fig. 1 (right).
V Final Remarks
In this paper, we study the geometric structure behind bottleneck problems, and generalize the IB and PF to a broader class of -divergences. In particular, we consider estimation-theoretic variants of the IB and PF. Moreover, we show how bottleneck problems can be used to calculate the counterpart of Mrs. Gerber’s lemma (called Mr. Gerber’s Lemma), and derive versions of Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy. These results can be potentially useful for new applications of information theory in machine learning.
-a Lemma 2
Lemma 2 ().
Let be a connected and non-empty subset of , and . Assume there exists such that , then there exists , where with , , , and . Furthermore, .
-B Proof of Theorem 1 (Matched Channel)
Recall that is determined parametrically in by the points where does not match its convex envelope . Thus, the columns of the channel transformation matrix of a matched channel correspond to extreme points where matches . However, there exists where the convex combination corresponds to a point . Using lemma 2, any non-trivial convex combination of will result in a point which is on the convex envelope of and determines a corresponding point on the curve .
-C Proof of Theorem 2 (Mr. Gerber’s Lemma)
Take , we have . For , is convex in , and . For , is concave in a region centered at , where it reaches a local maximum. Consequently, if , the upper convex envelope of is the linear combination of and and . Assuming , if , then there exists such that for , is a convex combination of and . Finally, if , then any point on the upper convex envelope of also lies in .
Hence, assuming , the distribution that achieves will be of two cases:
, with , .
assuming values in with , , , with , , and .
Rearranging (1) and (2), the result in (27) follows.
-D Proof of Theorem 3 (Arimoto’s Mr. Gerber’s Lemma)
Since is convex for . For , is negative on an interval , symmetric at , and is positive elsewhere with local maximum at . By symmetry, the lower convex envelope of the graph is obtained by replacing if . Therefore, for a given , if , then is a convex combination of and . Hence, we have for and .
-  N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of IEEE Allerton, 2000.
-  N. Tishby and N. Slonim, “Data clustering by markovian relaxation and the information bottleneck method,” in Proc. of NIPS, 2001.
-  N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of ACM SIGIR, 2000.
N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” inProc. of IEEE ITW, 2015.
-  R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprint arXiv:1703.00810, 2017.
-  F. P. Calmon, A. Makhdoumi, M. Médard, M. Varia, M. Christiansen, and K. R. Duffy, “Principal inertia components and applications,” vol. 63, no. 9, pp. 5011–5038, 2017.
-  F. P. Calmon, A. Makhdoumi, and M. Médard, “Fundamental limits of perfect privacy,” in Proc. of IEEE ISIT, 2015.
-  P. Harremoës and N. Tishby, “The information bottleneck revisited or how to choose a good distortion measure,” in Proc. of IEEE ISIT, 2007.
-  Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” vol. 62, no. 1, pp. 35–55, Jan 2016.
-  ——, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), vol. 6, pp. 2012–2016, 2014.
-  F. P. Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing inequalities in power-constrained gaussian channels,” in Proc. of IEEE ISIT, 2015.
-  H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” vol. 21, no. 5, pp. 493–501, 1975.
-  A. El Gamal and Y.-H. Kim, Network information theory. Cambridge university press, 2011.
-  C. Nair, “Upper concave envelopes and auxiliary random variables,” Int. J. Adv. Eng. Sci. Appl. Math., vol. 5, no. 1, pp. 12–20, 2013.
-  I. Sason and S. Verdú, “Arimoto-Rényi conditional entropy and bayesian -ary hypothesis testing,” arXiv preprint arXiv:1701.01974, 2017.
-  H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon, “Generalizing bottleneck problems - extended version.” [Online]. Available: https://github.com/HsiangHsu/ISIT-18-Extended-Version
-  A. Makur and L. Zheng, “Bounds between contraction coefficients,” in Proc. of IEEE Allerton, 2015.
-  H. G. Eggleston, Convexity. Wiley Online Library, 1966.