I Introduction
Few informationtheoretic constructs have captured the attention of machine learning researchers and practitioners as the Information Bottleneck (IB)
[1]. Given two correlated random variables andwith joint distribution
, the goal of the IB is to determine a mapping that produces a new representation of such that (i) and (ii) is maximized (information preserved) while minimizing (compression). This tradeoff can be quantified by the Lagrangian functional . The IB has proved useful in many machine learning problems, such as clustering [2]and natural language processing
[3]. More recently, the IB framework has been used to analyze the training process of deep neural networks
[4, 5].In an inverse context, the Privacy Funnel (PF), introduced in [6], seeks to determine a mapping that minimizes (privacy leakage) while assuring (revealing useful information). Analogously, the PF can be solved by considering the functional . The privacy funnel (and its variants) has shown to be useful in informationtheoretic privacy [6, 7].
The choice of mutual information in both the IB and the PF frameworks does not seem to carry any specific “operational” significance. It does, however, have a desirable practical consequence: it leads to selfconsistent equations [1, Eq. 28] that can be solved iteratively in the IB case. In fact, this property is unique to mutual information among many other information metrics [8]. Nevertheless, at least in theory, one can replace the mutual information with a broader family of measures based on divergences^{1}^{1}1
Given two probability distributions
and a convex function with , divergences is ..In this paper, we consider a wider class of bottleneck problems which includes the IB and the PF. We define information between two random variables and as , and introduce the following bottleneck functional
(1) 
and the funnel functional
(2) 
where and are convex functions. Different incarnations of information have already appeared, e.g., information in [9] for . These metrics possess “operational” significance that are arguably more useful in statistical learning and privacy applications than mutual information. For instance, total variation and Hellinger distance play important roles in hypothesis testing [10] and divergence in estimation problems [6]. Formulations (1) and (2) for a broader class of divergences can be potentially useful to emerging applications of information theory in machine learning.
Computing and reduces to characterizing the upper and lower boundaries, respectively, of the twodimensional set
(3) 
It is worth mentioning that studying (3) is at the heart of the strong data processing inequalities [11] as well as fundamental limits of privacy [7]. Witsenhausen et al. [12] investigated the lower boundary of a related set , where is the entropy function. In particular, they proposed an algorithm for analytically computing based on a dual formulation. When is binary and is a binary symmetric channel (BSC), the lower bound of corresponds to the wellknown Mrs. Gerber’s Lemma [13]. Related convex techniques have also been used to characterize some network information theoretic regions [14].
We generalize the approach in [12] to study boundaries of (3) for a broader class of information metrics, characterizing properties of new bottleneck problems of the form (1) and (2). In particular, we investigate the estimationtheoretic variants of information bottleneck and privacy funnel using divergence, which we call Estimation Bottleneck and Estimation Privacy Funnel, respectively. In the binary symmetric case, the upper boundary corresponds to the inverse of Mrs. Gerber’s Lemma, which we call Mr. Gerber’s Lemma. We further extend these lemmas for Arimoto conditional entropy [15].
Ii Geometric Properties
Iia Notation
Let and be two random variables having joint distribution with supports and , respectively. We denote by
the marginal probability vector with entries
, where is a dimensional simplex. We denote bythe stochastic matrix whose entries are the channel transformation
, i.e. ; thus, . For a discrete random variable with support , let , and let the marginal of be . We denote by the entropy function, i.e. with and . Finally, we denote the convex hull of a set by , and the boundary of a set by .IiB Geometry of Bottleneck Problems
Let and be continuous and bounded mappings over simplices of dimension and , respectively. We study the set (3) by first considering a more general context, and then specialize it to different information metrics in following sections. We consider the tuple
(4) 
where , and . Recall that , , and
form the Markov chain
. Therefore, we are interested in the following set for a fixed channel :(5)  
Moreover, we define . The next lemma is a direct generalization of [12, Lemma 2.1].
Lemma 1.
is convex and compact with . In addition, all points in can be written as a convex combination of at most points of ; in other words, .
Let the upper and lower boundaries of be denoted by and , respectively, i.e., we have
(6)  
(7) 
Under appropriate conditions on (depending on the choice of ), is nonempty, and hence the compactness of allows one to replace infimum and supremum in (6) and (7) with minimum and maximum, respectively. Moreover, it follows from the convexity of that is convex and is concave.
IiC Dual Formulations
Since
is a convex set, its upper and lower boundaries are equivalently represented by its supporting hyperplanes. We use the dual approach introduced in
[12] to evaluate and . For a given , define the conjugate function(8) 
Note that the graph of is the lower boundary of . It then follows that the point that achieves the minimum in (8) lies on the lower boundary of with supporting line of slope , and hence corresponds to a point .
We now turn our attention to evaluating . Let
(9) 
We observe that is the graph of the function on given by
(10) 
Since the mapping preserves convexity, we have
(11) 
as the convex hull of the graph of . Thus, is given by the lower convex envelope of at . The same would go, mutatis mutandis, for : its conjugate function , defined as
coincides with the upper concave envelope of at .
These properties leads to a procedure for characterizing and . We illustrate in details and will follow by using concave envelope instead. For , there are two scenarios:

Trivial case: If is in both and , then . In this case, simply reduces to , and the optimal has for some , independent of .

Nontrivial case: If , then is the convex combination of points , with weights where for some , and . Then is given by . Moreover, an optimal is attained by and , .
Hence, the points on the graph of can be obtained by only considering the points of which differs from its convex envelope since those are exactly the points where is not induced from the trivial case. Algorithm 1 summarizes our previous discussions.
IiD Matched Channels
The geometry of bottleneck problems leads to intriguing properties of . Our previous discussions reveal that the points used to form the convex envelope of are special: they determine a channel such that for any distribution , the resulting value of is on the boundary of with supporting line of slope . In this case, we say that the points form a matched channel for and , .
Definition 1 (Matched Channel).
For a fixed channel and , , we say that is matched to if there exists such that and
(12) 
Using an elementary result in convex geometry (see Lemma 2 in Appendix.), we immediately have the following theorem.
Theorem 1.
Let be a matched channel for . Then for any , we have
(13) 
Proof.
See the Appendix. ∎
From Theorem 1, we know that for any distribution , matched channels are entirely determined by the points on the curve whose convex combinations lead to the convex envelope of at . It implies that as long as meets its convex envelope at , small perturbation around does not change the matched channels but simply change the weight . Thus, optimal mappings are surprisingly robust to small errors in estimation of , which could potentially give pragmatic advantages when applying bottleneck problems to real data. However, if changes, we can recover the matched channels by first solving via , where , and then applying Bayes’ rules.
Iii Generalizing Bottleneck Problems
In this section, we demonstrate how the tools developed in Section II can be applied to new bottleneck problems of the form (1) and (2). We then revisit the IB and PF, and also study their estimationtheoretic variants.
Consider the Markov chain . Our goal is to describe the achievable pairs of divergences
(14) 
Observe that for a given we have
(15)  
(16)  
(17) 
and hence can be expressed as
(18) 
for some function . Similarly, define , we have
(19) 
Hence the corresponding set for varying has the same form as . Letting
(20) 
we can thus apply Algorithm 1 to characterize and .
Next, we show that how the usual IB and PF fit in our formulation, and study their estimationtheoretic counterparts. We note, however, that the previous analysis does not require .
Iiia Information Bottleneck
Assuming in the bottleneck functional (1), we have and . Thus, the set of points corresponds to the set of solutions of the IB problem.
It is worth mentioning that the same geometric approach can also be applied directly to entropy functions which also leads to the IB formulation. In fact, this is exactly the setting studied in [12]. Specifically, choosing and , the set of points
(21) 
also corresponds to the set of solutions of the IB. The IB is closely related to strong data processing inequalities. See [11, Proposition 2] for more details in the case of the BSC (see also Fig. 1 (right)).
IiiB Privacy Funnel
Assuming in the funnel functional (2), the set corresponds to the set of solutions of the privacy funnel, introduced in [6]. Equivalently, using the entropy function as in Section IIIA, the set of points
(22) 
also corresponds to the set of solutions, which follows from the fact that is monotonically nondecreasing; see Fig. 1 (right).
IiiC Estimation Bottleneck
One can move away from the usual IB and define new bottleneck problems by considering different functions and . For instance, if , then the corresponding information, called information, is defined as
(23) 
We simplify the notation in (1) for information as
(24) 
The reason to specifically study  information are twofold. First, it has been shown in [6] that , where is the principal inertia component (PIC) of and . Moreover, if the PICs between and are large, then the minimum mean square error (MMSE) of estimating given will be small [6, Theorem 1], thereby making reliable estimations. Hence, if the goal of an estimation problem is to minimize , we can equivalently consider maximizing . Second, following the spirit of the IB, we also add the constraint for the new representation , as divergence serves as sharp bounds for any divergence [17].
IiiD Estimation Privacy Funnel
Motivated by the connection between information and estimation problems mentioned in Section IIIC, we propose
(25) 
where the privacy is measured in terms of MMSE. The practical significance of (25) is justified as follows. Suppose represents private data (e.g. political preferences) and (e.g. movie rating) is correlated with . The main objective, formulated by (25), is to construct a privacyassuring mapping such that the information disclosed about by is minimized, thus minimizing privacy leakage, while preserving the estimation efficiency that provides about . Similarly, the solutions of the estimation privacy funnel (25) correspond to the set of points with ; see Fig. 1 (left).
Iv Mrs. and Mr. Gerber’s Lemmas
The study of upper and lower boundaries of achievable mutual information pairs are essential in multiuser information theory [13]. In the binary symmetric case, we not only rephrase Mrs. Gerber’s lemma [13], but also derive its counterpart for the PF. Furthermore, we discuss analogous results to Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy.
Iva Mr. Gerber’s Lemma
We apply the duality argument, developed in Section IIC, for and to characterize the IB and the PF the binary symmetric case. In particular, let be the BSC with crossover probability and . For , denote . We denote by the binary entropy function .
Let . It was shown in [12] that
(26) 
where is the inverse function of and , for . Eq. (26) is wellknown as Mrs. Gerber’s Lemma (MGL). In this case, the matched channel is also a BSC with crossover probability . Using the approach outlined in Section II, we derive a counterpart result for the upper boundary , and call it Mr. Gerber’s Lemma.
Theorem 2 (Mr. Gerber’s Lemma).
For , we have
(27) 
where and , .
Proof.
See the Appendix. ∎
In summary, in the binary symmetric case, the set of solutions for the IB follows from Mrs. Gerber’s Lemma (26) and is given by , and the set of solutions for the PF follows from Mr. Gerber’s Lemma (27) and is given by . The upper and lower boundaries of the achievable pairs are therefore characterized by Mrs. and Mr. Gerber’s Lemmas, respectively.
IvB Achievable Pairs of Arimoto Conditional Entropy
Beside divergence and the entropy functions, one can choose norm for and in (4), which results in Arimoto’s version of conditional Rényi entropy (Arimoto conditional entropy) [15] of order :
(28) 
When , we define . Hence, the set of achievable Arimoto conditional entropy pairs can be obtained by the nonlinear mapping:
(29) 
With (28) at hand, Arimoto mutual information [15] of order can be defined as , where is the Rényi entropy of order . Arimoto conditional entropy has been proven useful in approximating the minimum error probability of Bayesian ary hypothesis testing [15].
IvC Arimoto’s Mr. and Mrs. Gerber’s Lemmas
Due to the importance of Arimoto conditional entropy [15], we study the extensions of Mr. and Mrs. Gerber’s Lemmas for Arimoto conditional entropy, naming them Arimoto’s Mr. and Mrs. Gerber’s Lemmas respectively.
Let , and also . Since for and the mapping is strictly decreasing, we have . Define and respectively as the minimum and maximum of when for . For simplicity, denote if . In this case, following section IIC, we have , which leads to the following theorem.
Theorem 3 (Arimoto’s Mrs. Gerber’s Lemma).
For , let and for , we have
(30) 
In particular, s.t. for .
Proof.
See the Appendix. ∎
Analogous to this theorem, we also obtain the following generalization of Mr. Gerber’s Lemma.
Theorem 4 (Arimoto’s Mr. Gerber’s Lemma).
For , let and for , we have
(31) 
where . In particular, we have s.t. for .
Consequently, for , Arimoto’s Mrs. and Mr. Gerber’s Lemmas jointly characterize the achievable sets ; see Fig. 1 (right).
V Final Remarks
In this paper, we study the geometric structure behind bottleneck problems, and generalize the IB and PF to a broader class of divergences. In particular, we consider estimationtheoretic variants of the IB and PF. Moreover, we show how bottleneck problems can be used to calculate the counterpart of Mrs. Gerber’s lemma (called Mr. Gerber’s Lemma), and derive versions of Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy. These results can be potentially useful for new applications of information theory in machine learning.
a Lemma 2
Lemma 2 ([18]).
Let be a connected and nonempty subset of , and . Assume there exists such that , then there exists , where with , , , and . Furthermore, .
B Proof of Theorem 1 (Matched Channel)
Recall that is determined parametrically in by the points where does not match its convex envelope . Thus, the columns of the channel transformation matrix of a matched channel correspond to extreme points where matches . However, there exists where the convex combination corresponds to a point . Using lemma 2, any nontrivial convex combination of will result in a point which is on the convex envelope of and determines a corresponding point on the curve .
C Proof of Theorem 2 (Mr. Gerber’s Lemma)
Take , we have . For , is convex in , and . For , is concave in a region centered at , where it reaches a local maximum. Consequently, if , the upper convex envelope of is the linear combination of and and . Assuming , if , then there exists such that for , is a convex combination of and . Finally, if , then any point on the upper convex envelope of also lies in .
Hence, assuming , the distribution that achieves will be of two cases:

, with , .

assuming values in with , , , with , , and .
Rearranging (1) and (2), the result in (27) follows.
D Proof of Theorem 3 (Arimoto’s Mr. Gerber’s Lemma)
Since is convex for . For , is negative on an interval , symmetric at , and is positive elsewhere with local maximum at . By symmetry, the lower convex envelope of the graph is obtained by replacing if . Therefore, for a given , if , then is a convex combination of and . Hence, we have for and .
References
 [1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of IEEE Allerton, 2000.
 [2] N. Tishby and N. Slonim, “Data clustering by markovian relaxation and the information bottleneck method,” in Proc. of NIPS, 2001.
 [3] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of ACM SIGIR, 2000.

[4]
N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in
Proc. of IEEE ITW, 2015.  [5] R. ShwartzZiv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprint arXiv:1703.00810, 2017.
 [6] F. P. Calmon, A. Makhdoumi, M. Médard, M. Varia, M. Christiansen, and K. R. Duffy, “Principal inertia components and applications,” vol. 63, no. 9, pp. 5011–5038, 2017.
 [7] F. P. Calmon, A. Makhdoumi, and M. Médard, “Fundamental limits of perfect privacy,” in Proc. of IEEE ISIT, 2015.
 [8] P. Harremoës and N. Tishby, “The information bottleneck revisited or how to choose a good distortion measure,” in Proc. of IEEE ISIT, 2007.
 [9] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” vol. 62, no. 1, pp. 35–55, Jan 2016.
 [10] ——, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), vol. 6, pp. 2012–2016, 2014.
 [11] F. P. Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing inequalities in powerconstrained gaussian channels,” in Proc. of IEEE ISIT, 2015.
 [12] H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” vol. 21, no. 5, pp. 493–501, 1975.
 [13] A. El Gamal and Y.H. Kim, Network information theory. Cambridge university press, 2011.
 [14] C. Nair, “Upper concave envelopes and auxiliary random variables,” Int. J. Adv. Eng. Sci. Appl. Math., vol. 5, no. 1, pp. 12–20, 2013.
 [15] I. Sason and S. Verdú, “ArimotoRényi conditional entropy and bayesian ary hypothesis testing,” arXiv preprint arXiv:1701.01974, 2017.
 [16] H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon, “Generalizing bottleneck problems  extended version.” [Online]. Available: https://github.com/HsiangHsu/ISIT18ExtendedVersion
 [17] A. Makur and L. Zheng, “Bounds between contraction coefficients,” in Proc. of IEEE Allerton, 2015.
 [18] H. G. Eggleston, Convexity. Wiley Online Library, 1966.
Comments
There are no comments yet.