Generalizing Bottleneck Problems

02/16/2018 ∙ by Hsiang Hsu, et al. ∙ MIT The University of Chicago Harvard University 0

Given a pair of random variables (X,Y)∼ P_XY and two convex functions f_1 and f_2, we introduce two bottleneck functionals as the lower and upper boundaries of the two-dimensional convex set that consists of the pairs (I_f_1(W; X), I_f_2(W; Y)), where I_f denotes f-information and W varies over the set of all discrete random variables satisfying the Markov condition W → X → Y. Applying Witsenhausen and Wyner's approach, we provide an algorithm for computing boundaries of this set for f_1, f_2, and discrete P_XY, . In the binary symmetric case, we fully characterize the set when (i) f_1(t)=f_2(t)=t t, (ii) f_1(t)=f_2(t)=t^2-1, and (iii) f_1 and f_2 are both ℓ^β norm function for β > 1. We then argue that upper and lower boundaries in (i) correspond to Mrs. Gerber's Lemma and its inverse (which we call Mr. Gerber's Lemma), in (ii) correspond to estimation-theoretic variants of Information Bottleneck and Privacy Funnel, and in (iii) correspond to Arimoto Information Bottleneck and Privacy Funnel.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Few information-theoretic constructs have captured the attention of machine learning researchers and practitioners as the Information Bottleneck (IB)

[1]. Given two correlated random variables and

with joint distribution

, the goal of the IB is to determine a mapping that produces a new representation of such that (i) and (ii) is maximized (information preserved) while minimizing (compression). This tradeoff can be quantified by the Lagrangian functional . The IB has proved useful in many machine learning problems, such as clustering [2]

and natural language processing

[3]

. More recently, the IB framework has been used to analyze the training process of deep neural networks

[4, 5].

In an inverse context, the Privacy Funnel (PF), introduced in [6], seeks to determine a mapping that minimizes (privacy leakage) while assuring (revealing useful information). Analogously, the PF can be solved by considering the functional . The privacy funnel (and its variants) has shown to be useful in information-theoretic privacy [6, 7].

The choice of mutual information in both the IB and the PF frameworks does not seem to carry any specific “operational” significance. It does, however, have a desirable practical consequence: it leads to self-consistent equations [1, Eq. 28] that can be solved iteratively in the IB case. In fact, this property is unique to mutual information among many other information metrics [8]. Nevertheless, at least in theory, one can replace the mutual information with a broader family of measures based on -divergences111

Given two probability distributions

and a convex function with , -divergences is ..

In this paper, we consider a wider class of bottleneck problems which includes the IB and the PF. We define -information between two random variables and as , and introduce the following bottleneck functional

(1)

and the funnel functional

(2)

where and are convex functions. Different incarnations of -information have already appeared, e.g., -information in [9] for . These metrics possess “operational” significance that are arguably more useful in statistical learning and privacy applications than mutual information. For instance, total variation and Hellinger distance play important roles in hypothesis testing [10] and -divergence in estimation problems [6]. Formulations (1) and (2) for a broader class of divergences can be potentially useful to emerging applications of information theory in machine learning.

Computing and reduces to characterizing the upper and lower boundaries, respectively, of the two-dimensional set

(3)

It is worth mentioning that studying (3) is at the heart of the strong data processing inequalities [11] as well as fundamental limits of privacy [7]. Witsenhausen et al. [12] investigated the lower boundary of a related set , where is the entropy function. In particular, they proposed an algorithm for analytically computing based on a dual formulation. When is binary and is a binary symmetric channel (BSC), the lower bound of corresponds to the well-known Mrs. Gerber’s Lemma [13]. Related convex techniques have also been used to characterize some network information theoretic regions [14].

We generalize the approach in [12] to study boundaries of (3) for a broader class of -information metrics, characterizing properties of new bottleneck problems of the form (1) and (2). In particular, we investigate the estimation-theoretic variants of information bottleneck and privacy funnel using -divergence, which we call Estimation Bottleneck and Estimation Privacy Funnel, respectively. In the binary symmetric case, the upper boundary corresponds to the inverse of Mrs. Gerber’s Lemma, which we call Mr. Gerber’s Lemma. We further extend these lemmas for Arimoto conditional entropy [15].

This paper is organized as follows. Section II introduces the geometry of bottleneck problems. In Section III, we formulate variational bottleneck problems and explore their use, and provide further applications on information inequalities in Section IV. Proofs of the results are available in [16].

Ii Geometric Properties

Ii-a Notation

Let and be two random variables having joint distribution with supports and , respectively. We denote by

the marginal probability vector with entries

, where is a -dimensional simplex. We denote by

the stochastic matrix whose entries are the channel transformation

, i.e. ; thus, . For a discrete random variable with support , let , and let the marginal of be . We denote by the entropy function, i.e. with and . Finally, we denote the convex hull of a set by , and the boundary of a set by .

Ii-B Geometry of Bottleneck Problems

Let and be continuous and bounded mappings over simplices of dimension and , respectively. We study the set (3) by first considering a more general context, and then specialize it to different information metrics in following sections. We consider the tuple

(4)

where , and . Recall that , , and

form the Markov chain

. Therefore, we are interested in the following set for a fixed channel :

(5)

Moreover, we define . The next lemma is a direct generalization of [12, Lemma 2.1].

Lemma 1.

is convex and compact with . In addition, all points in can be written as a convex combination of at most points of ; in other words, .

Let the upper and lower boundaries of be denoted by and , respectively, i.e., we have

(6)
(7)

Under appropriate conditions on (depending on the choice of ), is non-empty, and hence the compactness of allows one to replace infimum and supremum in (6) and (7) with minimum and maximum, respectively. Moreover, it follows from the convexity of that is convex and is concave.

Ii-C Dual Formulations

Since

is a convex set, its upper and lower boundaries are equivalently represented by its supporting hyperplanes. We use the dual approach introduced in

[12] to evaluate and . For a given , define the conjugate function

(8)

Note that the graph of is the lower boundary of . It then follows that the point that achieves the minimum in (8) lies on the lower boundary of with supporting line of slope , and hence corresponds to a point .

We now turn our attention to evaluating . Let

(9)

We observe that is the graph of the function on given by

(10)

Since the mapping preserves convexity, we have

(11)

as the convex hull of the graph of . Thus, is given by the lower convex envelope of at . The same would go, mutatis mutandis, for : its conjugate function , defined as

coincides with the upper concave envelope of at .

These properties leads to a procedure for characterizing and . We illustrate in details and will follow by using concave envelope instead. For , there are two scenarios:

  1. Trivial case: If is in both and , then . In this case, simply reduces to , and the optimal has for some , independent of .

  2. Non-trivial case: If , then is the convex combination of points , with weights where for some , and . Then is given by . Moreover, an optimal is attained by and , .

Hence, the points on the graph of can be obtained by only considering the points of which differs from its convex envelope since those are exactly the points where is not induced from the trivial case. Algorithm 1 summarizes our previous discussions.

1:
2:
3:Compute ,
4: convex envelope of
5:if  then
6:     return
7:else
8:     
9:
10:     return
Algorithm 1 Computing at slope

Ii-D Matched Channels

The geometry of bottleneck problems leads to intriguing properties of . Our previous discussions reveal that the points used to form the convex envelope of are special: they determine a channel such that for any distribution , the resulting value of is on the boundary of with supporting line of slope . In this case, we say that the points form a matched channel for and , .

Definition 1 (Matched Channel).

For a fixed channel and , , we say that is matched to if there exists such that and

(12)

Using an elementary result in convex geometry (see Lemma 2 in Appendix.), we immediately have the following theorem.

Theorem 1.

Let be a matched channel for . Then for any , we have

(13)
Proof.

See the Appendix. ∎

From Theorem 1, we know that for any distribution , matched channels are entirely determined by the points on the curve whose convex combinations lead to the convex envelope of at . It implies that as long as meets its convex envelope at , small perturbation around does not change the matched channels but simply change the weight . Thus, optimal mappings are surprisingly robust to small errors in estimation of , which could potentially give pragmatic advantages when applying bottleneck problems to real data. However, if changes, we can recover the matched channels by first solving via , where , and then applying Bayes’ rules.

Note that the properties above only hold when and do not depend on . Specifically, matched channels do not exist for the cases studied in Section III-C and III-D.

Iii Generalizing Bottleneck Problems

In this section, we demonstrate how the tools developed in Section II can be applied to new bottleneck problems of the form (1) and (2). We then revisit the IB and PF, and also study their estimation-theoretic variants.

Consider the Markov chain . Our goal is to describe the achievable pairs of -divergences

(14)

Observe that for a given we have

(15)
(16)
(17)

and hence can be expressed as

(18)

for some function . Similarly, define , we have

(19)

Hence the corresponding set for varying has the same form as . Letting

(20)

we can thus apply Algorithm 1 to characterize and .

Next, we show that how the usual IB and PF fit in our formulation, and study their estimation-theoretic counterparts. We note, however, that the previous analysis does not require .

Iii-a Information Bottleneck

Assuming in the bottleneck functional (1), we have and . Thus, the set of points corresponds to the set of solutions of the IB problem.

It is worth mentioning that the same geometric approach can also be applied directly to entropy functions which also leads to the IB formulation. In fact, this is exactly the setting studied in [12]. Specifically, choosing and , the set of points

(21)

also corresponds to the set of solutions of the IB. The IB is closely related to strong data processing inequalities. See [11, Proposition 2] for more details in the case of the BSC (see also Fig. 1 (right)).

Iii-B Privacy Funnel

Assuming in the funnel functional (2), the set corresponds to the set of solutions of the privacy funnel, introduced in [6]. Equivalently, using the entropy function as in Section III-A, the set of points

(22)

also corresponds to the set of solutions, which follows from the fact that is monotonically non-decreasing; see Fig. 1 (right).

Iii-C Estimation Bottleneck

One can move away from the usual IB and define new bottleneck problems by considering different functions and . For instance, if , then the corresponding -information, called -information, is defined as

(23)

We simplify the notation in (1) for -information as

(24)

The reason to specifically study - information are two-fold. First, it has been shown in [6] that , where is the principal inertia component (PIC) of and . Moreover, if the PICs between and are large, then the minimum mean square error (MMSE) of estimating given will be small [6, Theorem 1], thereby making reliable estimations. Hence, if the goal of an estimation problem is to minimize , we can equivalently consider maximizing . Second, following the spirit of the IB, we also add the constraint for the new representation , as -divergence serves as sharp bounds for any -divergence [17].

Due to the above connection between and estimation problems, we call Estimation Bottleneck (EB) problem. Clearly, the set of points corresponds to the set of solutions of (24); see Fig. 1 (left). The bound comes from the PIC analysis [6].

Iii-D Estimation Privacy Funnel

Motivated by the connection between -information and estimation problems mentioned in Section III-C, we propose

(25)

where the privacy is measured in terms of MMSE. The practical significance of (25) is justified as follows. Suppose represents private data (e.g. political preferences) and (e.g. movie rating) is correlated with . The main objective, formulated by (25), is to construct a privacy-assuring mapping such that the information disclosed about by is minimized, thus minimizing privacy leakage, while preserving the estimation efficiency that provides about . Similarly, the solutions of the estimation privacy funnel (25) correspond to the set of points with ; see Fig. 1 (left).

Fig. 1: follows BSC with crossover probability and . Left: The estimation bottleneck and privacy funnel. Right: Set of achievable pairs of Arimoto mutual information for BSC with and . Note that when , the upper and lower boundaries correspond to the IB and the PF.

Iv Mrs. and Mr. Gerber’s Lemmas

The study of upper and lower boundaries of achievable mutual information pairs are essential in multi-user information theory [13]. In the binary symmetric case, we not only rephrase Mrs. Gerber’s lemma [13], but also derive its counterpart for the PF. Furthermore, we discuss analogous results to Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy.

Iv-a Mr. Gerber’s Lemma

We apply the duality argument, developed in Section II-C, for and to characterize the IB and the PF the binary symmetric case. In particular, let be the BSC with crossover probability and . For , denote . We denote by the binary entropy function .

Let . It was shown in [12] that

(26)

where is the inverse function of and , for . Eq. (26) is well-known as Mrs. Gerber’s Lemma (MGL). In this case, the matched channel is also a BSC with crossover probability . Using the approach outlined in Section II, we derive a counterpart result for the upper boundary , and call it Mr. Gerber’s Lemma.

Theorem 2 (Mr. Gerber’s Lemma).

For , we have

(27)

where and , .

Proof.

See the Appendix. ∎

In summary, in the binary symmetric case, the set of solutions for the IB follows from Mrs. Gerber’s Lemma (26) and is given by , and the set of solutions for the PF follows from Mr. Gerber’s Lemma (27) and is given by . The upper and lower boundaries of the achievable pairs are therefore characterized by Mrs. and Mr. Gerber’s Lemmas, respectively.

Iv-B Achievable Pairs of Arimoto Conditional Entropy

Beside -divergence and the entropy functions, one can choose -norm for and in (4), which results in Arimoto’s version of conditional Rényi entropy (Arimoto conditional entropy) [15] of order :

(28)

When , we define . Hence, the set of achievable Arimoto conditional entropy pairs can be obtained by the nonlinear mapping:

(29)

With (28) at hand, Arimoto mutual information [15] of order can be defined as , where is the Rényi entropy of order . Arimoto conditional entropy has been proven useful in approximating the minimum error probability of Bayesian -ary hypothesis testing [15].

Iv-C Arimoto’s Mr. and Mrs. Gerber’s Lemmas

Due to the importance of Arimoto conditional entropy [15], we study the extensions of Mr. and Mrs. Gerber’s Lemmas for Arimoto conditional entropy, naming them Arimoto’s Mr. and Mrs. Gerber’s Lemmas respectively.

Let , and also . Since for and the mapping is strictly decreasing, we have . Define and respectively as the minimum and maximum of when for . For simplicity, denote if . In this case, following section II-C, we have , which leads to the following theorem.

Theorem 3 (Arimoto’s Mrs. Gerber’s Lemma).

For , let and for , we have

(30)

In particular, s.t. for .

Proof.

See the Appendix. ∎

Analogous to this theorem, we also obtain the following generalization of Mr. Gerber’s Lemma.

Theorem 4 (Arimoto’s Mr. Gerber’s Lemma).

For , let and for , we have

(31)

where . In particular, we have s.t. for .

Consequently, for , Arimoto’s Mrs. and Mr. Gerber’s Lemmas jointly characterize the achievable sets ; see Fig. 1 (right).

V Final Remarks

In this paper, we study the geometric structure behind bottleneck problems, and generalize the IB and PF to a broader class of -divergences. In particular, we consider estimation-theoretic variants of the IB and PF. Moreover, we show how bottleneck problems can be used to calculate the counterpart of Mrs. Gerber’s lemma (called Mr. Gerber’s Lemma), and derive versions of Mrs. and Mr. Gerber’s lemmas for Arimoto conditional entropy. These results can be potentially useful for new applications of information theory in machine learning.

-a Lemma 2

Lemma 2 ([18]).

Let be a connected and non-empty subset of , and . Assume there exists such that , then there exists , where with , , , and . Furthermore, .

-B Proof of Theorem 1 (Matched Channel)

Recall that is determined parametrically in by the points where does not match its convex envelope . Thus, the columns of the channel transformation matrix of a matched channel correspond to extreme points where matches . However, there exists where the convex combination corresponds to a point . Using lemma 2, any non-trivial convex combination of will result in a point which is on the convex envelope of and determines a corresponding point on the curve .

-C Proof of Theorem 2 (Mr. Gerber’s Lemma)

Take , we have . For , is convex in , and . For , is concave in a region centered at , where it reaches a local maximum. Consequently, if , the upper convex envelope of is the linear combination of and and . Assuming , if , then there exists such that for , is a convex combination of and . Finally, if , then any point on the upper convex envelope of also lies in .

Hence, assuming , the distribution that achieves will be of two cases:

  1. , with , .

  2. assuming values in with , , , with , , and .

Rearranging (1) and (2), the result in (27) follows.

-D Proof of Theorem 3 (Arimoto’s Mr. Gerber’s Lemma)

Since is convex for . For , is negative on an interval , symmetric at , and is positive elsewhere with local maximum at . By symmetry, the lower convex envelope of the graph is obtained by replacing if . Therefore, for a given , if , then is a convex combination of and . Hence, we have for and .

References

  • [1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of IEEE Allerton, 2000.
  • [2] N. Tishby and N. Slonim, “Data clustering by markovian relaxation and the information bottleneck method,” in Proc. of NIPS, 2001.
  • [3] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of ACM SIGIR, 2000.
  • [4]

    N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in

    Proc. of IEEE ITW, 2015.
  • [5] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprint arXiv:1703.00810, 2017.
  • [6] F. P. Calmon, A. Makhdoumi, M. Médard, M. Varia, M. Christiansen, and K. R. Duffy, “Principal inertia components and applications,” vol. 63, no. 9, pp. 5011–5038, 2017.
  • [7] F. P. Calmon, A. Makhdoumi, and M. Médard, “Fundamental limits of perfect privacy,” in Proc. of IEEE ISIT, 2015.
  • [8] P. Harremoës and N. Tishby, “The information bottleneck revisited or how to choose a good distortion measure,” in Proc. of IEEE ISIT, 2007.
  • [9] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” vol. 62, no. 1, pp. 35–55, Jan 2016.
  • [10] ——, “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC), vol. 6, pp. 2012–2016, 2014.
  • [11] F. P. Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing inequalities in power-constrained gaussian channels,” in Proc. of IEEE ISIT, 2015.
  • [12] H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” vol. 21, no. 5, pp. 493–501, 1975.
  • [13] A. El Gamal and Y.-H. Kim, Network information theory.   Cambridge university press, 2011.
  • [14] C. Nair, “Upper concave envelopes and auxiliary random variables,” Int. J. Adv. Eng. Sci. Appl. Math., vol. 5, no. 1, pp. 12–20, 2013.
  • [15] I. Sason and S. Verdú, “Arimoto-Rényi conditional entropy and bayesian -ary hypothesis testing,” arXiv preprint arXiv:1701.01974, 2017.
  • [16] H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon, “Generalizing bottleneck problems - extended version.” [Online]. Available: https://github.com/HsiangHsu/ISIT-18-Extended-Version
  • [17] A. Makur and L. Zheng, “Bounds between contraction coefficients,” in Proc. of IEEE Allerton, 2015.
  • [18] H. G. Eggleston, Convexity.   Wiley Online Library, 1966.