The Multi-layer Information Bottleneck Problem

The muti-layer information bottleneck (IB) problem, where information is propagated (or successively refined) from layer to layer, is considered. Based on information forwarded by the preceding layer, each stage of the network is required to preserve a certain level of relevance with regards to a specific hidden variable, quantified by the mutual information. The hidden variables and the source can be arbitrarily correlated. The optimal trade-off between rates of relevance and compression (or complexity) is obtained through a single-letter characterization, referred to as the rate-relevance region. Conditions of successive refinabilty are given. Binary source with BSC hidden variables and binary source with BSC/BEC mixed hidden variables are both proved to be successively refinable. We further extend our result to Guassian models. A counterexample of successive refinability is also provided.

Authors

• 11 publications
• 34 publications
• 128 publications
04/05/2016

Collaborative Information Bottleneck

This paper investigates a multi-terminal source coding problem under a l...
01/29/2021

An Information Bottleneck Problem with Rényi's Entropy

This paper considers an information bottleneck problem with the objectiv...
10/23/2019

A Hierarchical Mixture Density Network

The relationship among three correlated variables could be very sophisti...
04/30/2020

Binary autoencoder with random binary weights

Here is presented an analysis of an autoencoder with binary activations ...
04/16/2018

Undecidability of approximating the capacity of time-invariant Markoff channel with feedback, and non-existence of linear finite-letter conditional mutual information character

It is proved that approximating, within an additive constant, the capaci...
01/26/2020

Inference in Multi-Layer Networks with Matrix-Valued Unknowns

We consider the problem of inferring the input and hidden variables of a...
11/02/2020

On the Relevance-Complexity Region of Scalable Information Bottleneck

The Information Bottleneck method is a learning technique that seeks a r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A fundamental problem in statistical learning is to extract the relevant essence of data from high-dimensional, noisy, salient sources. In supervised learning (e.g., speaker identification in speech recognition), a set of properties or statistical relationships is pre-specified as relevant information of interest (e.g., name, age or gender of the speaker) targeted to be learned from data; while in unsupervised learning, clusters or low-dimensional representations play the same role. This can be connected to the lossy source compression problem in information theory, where an original source is compressed subject to specifically defined distortion (or loss) with regards to specified relevant information.

A remarkable step towards understanding the information relevance problem using fundamental information theoretical concepts was made by Tishby et al. [1] with the introduction of the “information bottleneck ” (IB) method. The relevant information in an observable variable is defined as the information can provide about another hidden variable . The IB framework characterizes the trade-off between the information rates (or complexity) of the reproduction signal , and the amount of mutual information it provides about . The IB method has been found useful in a wide variety of learning applications, e.g., word clustering [2], image clustering [3]

, etc. In particular, interesting connections have been recently made between deep learning

[4] and the successively refined IB method [5].

Despite of the success of the IB method in the machine learning domain, less efforts have been invested in studying it from an information theoretical view. Gilad-Bachrach

et al. [6] characterize the optimal trade-off between the rates of information and relevance, and provide a single-letter region. As a matter of fact, the conventional IB problem follows as a special instance of the conventional noisy lossy source coding problem [7]. Extension of this information-theoretic framework address the collaborative IB problem by Vera et al. [8], and the distributed biclustering problem by Pichler et al. [9]. Further connections to the problem of joint testing and lossy reconstruction has been recently studied by Katz et al. [10]. Also in the information theoretic context, the IB problem is closely related to the pattern classification problem studied in [11, 12, 13]; which provides another operational meaning to IB.

In this work, we introduce and investigate the multi-layer IB problem with non-identical hidden variables at each layer. This scenario is highly motivated by deep neural networks (DNN) and the recent work in

[5]. Along the propagation of a DNN, each layer compresses its input, which is the output of the preceding layer, to a lower dimensional output, which is forwarded to the next layer. Another scenario may be the hierarchical, multi-layer network, in which information is propagated from higher layers to lower layers sequentially. Users in different layers may be interested in different properties of the original source. The main result of this paper is the full characterization of the rate-relevance region of the multi-layer IB problem. Conditions are provided for successive refinability in the sense of the existence of codes that asymptotically achieve the rate-relevance function, simultaneously at all the layers. Binary source with BSC hidden variables and binary source with mixed BSCBEC hidden variables111BSC hidden variables are obtained by passing the source through a binary symmetric channel, whereas BEC hidden variables are obtained through a binary erasure channel. are both proved to successively refinable. The successive refinability is also shown for Guassian sources. We further present a counterexample for which successive refinability no longer holds. It is worth mentioning that the successive refinability of the IB problem is also investigated in [14], with identical hidden variables.

The rest of the paper is organized as follows. Section II provides the definitions and presents the main result, the achievability and converse proofs of which are provided in the Appendices. The definition and conditions of successive refinability are shown in Section III. Examples are presented in Section IV. Finally, we conclude the paper in Section V.

Ii Problem formulation

Let be a sequence of

i.i.d. copies of discrete random variables

taking values in finite alphabets

, jointly distributed according to

, where is an observable variable, while are hidden variables arbitrarily correlated with .

An code for the -layer IB problem, as illustrated in Fig. 1, consists of encoding functions , defined as , where we set , and , . That is, is the rate of the -th layer encoding function , , and we assume .

Definition 1.

(Achievability) For some and non-negative values, is said to be achievable if, for every , there exists an -code s.t.

 1nI(Ynl;Zl)≥μl−ϵ,        l=1,…,L, (1)

for sufficiently large , where , for , and .

The value of imposes a lower bound on , i.e., the relevance with respect to the hidden variable after -layer encoding of the observable sequence . Our goal is to characterize the rate-relevance region, , which is the set of all achievable tuples .

Theorem 1.

The rate-relevance region, , is characterized by the closure of the set of all tuples that satisfy

 Rl≥I(X;Ul,…,UL),     l=1,…,L, (2)

for some probability

s.t.

 μl≤I(Yl;Ul,…,UL),     l=1,…,L. (3)
Proof.

A proof is provided in the Appendices. ∎

Iii Successive Refinability of Multi-Layer IB

The rate-relevance function for a single-layer setting with relevance constraint regarding the hidden variable is denoted by , and characterized in [6] as:

 RX→Y(μ)=minp(U|X): U−−X−−Y,I(Y;U)≥μI(X;U). (4)
Definition 2.

Source is said to be successively refinable for the -layer IB problem with regards to correlated relevant hidden variables with relevance constraints , respectively, if

 (RX→Y1(μ1),…,RX→YL(μL),μ1,…,μL)∈R. (5)
Theorem 2.

Source is successively refinable for the -layer IB problem with relevance constraints , …, with regards to hidden variables , iff there exist random variables , satisfying , such that the following conditions hold simultaneously for :

1. ,

2. .

Proof.

Theorem 2 follows directly from Definition 2 and Theorem 1. ∎

Iv Examples

Iv-a Binary Source with Symmetric Hidden Variables

We consider , . The observable variable

(denoted as Bern ), and the hidden variables are obtained by passing the source through independent BSCs, i.e., , where , , is independent of , and denotes modulo-2 addition.

We first derive the rate-relevance function . Denote by any random variable for which , . We have the following inequality:

 μl ≤H(Yl)−H(X⊕Nl|Ul) (6a) ≤1−Hb(pl∗H−1b(H(X|Ul))) (6b) =1−Hb(pl∗H−1b(1−I(X;Ul))) (6c)

where operation is defined as , is the binary entropy function, defined as , and is the inverse of the binary entropy function with . (6b) follows from Mrs. Gerber’s Lemma and the fact that . From (6), we obtain . Thus, we have . Note that by letting , where is independent of and , we have and . We can conclude that and given above is a rate-relevance function achieving auxiliary random variable.

Lemma 1.

Binary sources as described above are always successively refinable for the -layer IB problem if and .

Proof.

Since

, we can find binary variables

, independent of each other and X, such that for . By choosing auxiliary random variables: , we have and , for , and . Together with Theorem 2, this conclude the proof of Lemma 1. ∎

Iv-B Binary Source with Mixed Hidden Variables

Here we consider a two-layer IB problem, i.e., . The joint distribution of is illustrated in Fig. 2, where is a binary random variable of distribution Bernoulli as in the previous example, but is the output of a BEC with erasure probability () when is the input, and is the output of a (BSC) with crossover probability , . A similar example can be found in [15] where the optimality of proposed coding scheme not always holds for their setting. We first derive the rate-relevance function . Denote by any random variable such that , . We have the following inequality:

 H(Y1|U1) =H(Y1|U1)+H(X|U1,Y1)−H(X|U1,Y1) (7a) =H(Y1,X|U1)−H(X|U1,Y1) (7b) =H(X|U1)+H(Y1|X,U1)−H(X|U1,Y1) (7c) =1−I(X;U1)+H(Y1|X) −(p(Y1≠e)H(X|U1,X) +p(Y1=e)H(X|U1,Y1=e)) (7d) =1−I(X;U1)+Hb(ϵ)−ϵH(X|U1) (7e) =1+(ϵ−1)I(X;U1)−ϵ+Hb(ϵ). (7f)

Since , where , is defined as , it follows that , which can be achieved by setting , where is independent of . We have from Section. IV-A, which can be achieved by setting , where is independent of .

Lemma 2.

Binary source with mixed BEC/BSC hidden variables as described above is always successively refinable for the -layer IB problem if and .

Proof.

The proof follows the same arguments as in the proof of Lemma 1. ∎

Note that successive refinability is still achievable in this example despite the mixed hidden variables contrast to our expectation, since an auxiliary in the form achieves the rate-relevance function despite BEC hidden variable.

Iv-C Jointly Gaussian Source and Hidden Variables

It is not difficult to verify that the above achievability results are still valid for the Gaussian sources by employing a quantization procedure over the sources and appropriate test channels[16].

Let and , , be jointly Gaussian zero-mean random variables, such that , where and , . As in the previous examples, we first derive a lower bound on the rate-relevance function . Denote by any random variable such that , . We have the following sequence of inequalities:

 μl≤I(Yl;Ul) (8a) =h(Yl)−h(X+Nl|Ul) (8b) ≤12log(2πe(σ2x+σ2Nl)) −12log(2πeσ2Nl+exp(2h(X|Ul))) (8c) =12log(2πe(σ2x+σ2Nl)) −12log(2πeσ2Nl+exp(2h(X)−2I(X;Ul))) (8d) =12log(2πe(σ2x+σ2Nl)) −12log(2πeσ2Nl+2πeσ2xexp(−2I(X;Ul)), (8e)

where (8d) follows from the conditional Entropy Power Inequality (EPI) (Section 2.2 in [16]). We can also obtain an outer bound on :

 RX→Yl(μl)≥12log22μlσ2xσ2x+σ2Nl−22μlσ2Nl, (9)

by setting , , , where is given by:

 σ2Pl=2πeσ2x−22RX→Yl(μl)2πe(22RX→Yl(μl)−1). (10)
Lemma 3.

Gaussian sources as described above are always successively refinable for the -layer IB problem if and .

Iv-D Counterexample on successive refinability

In this section, we show that the multi-layer IB problem is not always successively refinable. We consider a two-layer IB problem, i.e., . Let , where and are two independent discrete random variables, and we have and . We first derive the rate-relevance function . Denote by any random variable such that , . We have:

 I(X;U1) =I(X1,X2;U1) (11a) ≥I(X1;U1) (11b) ≥μ1. (11c)

By setting as

 U∗1=⎧⎨⎩X1,with probability  μ1H(X1),0,with probability  1−μ1H(X1), (12)

we have , and , which achieves the lower bound shown in (11). We can conclude that , and any rate-relevance function achieving random variable should satisfy , since and . Similarly, we can conclude that , and any rate-relevance function achieving random variable should satisfy .

Lemma 4.

Source with hidden variables and as described above is not successively refinable for the two-layer IB problem.

Proof.

For any rate-relevance function achieving random variables and , we have

 I(U∗2;X|U∗1)=I(U∗2;X1,X2|U∗1) (13a) =I(U∗2;X2|U∗1)+I(U∗2;X1|U∗1,X2) (13b) ≥I(U∗2;X2|U∗1) (13c) =I(U∗1,U∗2;X2)−I(U∗1;X2) (13d) =I(U∗2;X2)+I(U∗1;X2|U∗2) (13e) ≥I(U∗2;X2)≥μ2, (13f)

where (13e) is due to , which follows from

 I(X1;X2|U∗1) =I(X1,U∗1;X2)−I(U∗1;X2) (14a) =I(X2;X1)+I(X2;U∗1|X1)−I(U∗1;X2) (14b) =−I(U∗1;X2). (14c)

If , , which implies and

cannot form a Markov chain for any rate-relevance function achieving random variables

and . With Theorem 2, we have proven Lemma 4. ∎

V Conclusion

The multi-layer IB problem with non-identical relevant variables was investigated. A single-letter expression of the rate-relevance region was given. The definition and conditions of successive refinability were presented, which was further investigated for the binary sources and Guassian sources. A counterexample of successive refinability was also proposed.

References

• [1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of Annu. Allerton Conf. Commun., Contr. Comput., Monticello, IL, Sep. 1999, pp. 368–377.
• [2] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of Annu. Conf. Research and Development in inform. retrieval, Athens, Greece, 2000, pp. 208–215.
• [3] J. Goldberger, H. Greenspan, and S. Gordon, “Unsupervised image clustering using the information bottleneck method,” in Joint Pattern Recog. Symp., Zurich, Switzerland, Sep. 2002, pp. 158–165.
• [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
• [5] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW), April 2015, pp. 1–5.
• [6] R. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Proc. of Conf. Learning Theory (COLT), Washington, Aug. 2003, pp. 595–609.
• [7] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Transactions on Information Theory, vol. 8, no. 5, pp. 293–304, September 1962.
• [8] M. Vera, L. R. Vega, and P. Piantanida, “Collaborative representation learning,” CoRR, vol. abs/1604.01433, 2016. [Online]. Available: http://arxiv.org/abs/1604.01433
• [9] G. Pichler, P. Piantanida, and G. Matz, “Distributed information-theoretic biclustering,” in 2016 IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 1083–1087.
• [10] G. Katz, P. Piantanida, and M. Debbah, “Distributed binary detection with lossy data compression,” IEEE Transactions on Information Theory, vol. PP, no. 99, pp. 1–1, 2017.
• [11] J. G. F. Willems, T. Kalker and J.-P. Linnartz, “Successive refinement for hypothesis testing and lossless one-helper problem,” p. 82, Jun. 2003.
• [12] E. Tuncel, “Capacity/storage tradeoff in high-dimensional identification systems,” IEEE Trans. Inform. Theory, no. 5, 2009.
• [13] E. Tuncel and D. Gunduz, “Identification and lossy reconstruction in noisy databases,” IEEE Trans. Inform. Theory, no. 2, 2014.
• [14] C. Tian and J. Chen, “Successive refinement for hypothesis testing and lossless one-helper problem,” IEEE Trans. Inform. Theory, vol. 54, no. 10, pp. 4666–4681, 2008.
• [15] J. Villard and P. Piantanida, “Secure multiterminal source coding with side information at the eavesdropper,” IEEE Trans. Inform. Theory, vol. 59, no. 6, pp. 3668–3692, 2013.
• [16] A. E. Gamal and Y. H. Kim, Network Information Theory.   New York, NY, USA: Cambridge University Press, 2012.

Appendix A Achivability of Theorem 1

Consider first the direct part, i.e., every tuple is achievable.

Code generation. Fix a conditional probability mass function (pmf) such that , for . First randomly generate sequences , , independent and identically distributed (i.i.d.) according to ; then for each randomly generate sequences , conditionally i.i.d. according to ; and continue in the same manner, for each randomly generate sequences , , conditionally i.i.d. according to , for .

Encoding and Decoding After observing , the first encoder finds an index tuple such that is in the set , which is the set of jointly typical vectors of random variables . If more than one such tuple exist, any one of them is selected. If no such tuple exists, we call it an error, and set . Then the th encoder outputs , for , and sends to the encoder, if , the index tuple at a total rate of . Given the index tuple , the th decoder declares as its output, for .

Relevance. First, we note that if there is no error in the encoding step, i.e., an index tuple such that is found, then the relevance condition , , is satisfied by the definition of and the Markov lemma. Then we focus on the analysis of the probability of error, i.e., the probability that such an index tuple cannot be found in the encoding step.

An error occurs if one of the following events happens:

 E0:xn∉Tnϵ(X); (15a) E1:xn∈Tnϵ(X),(xn,unL(iL))∉Tnϵ(X,UL), for all  iL=1,…,2RL; (15b) El:(xn,unL−l+2(iL−l+2),…,unL(iL)) ∈Tnϵ(X,UL−l+2,…,UL), (xn,unL−l+1(iL−l+1),…,unL(iL)) ∉Tnϵ(X,UL−l+1,…,UL), for all  iL−l+1=1,…,2RL−l+1−RL−l+2; (15c)

for . It is clear that as . Based on the properties of typical sequences:

 P(E1)n→∞→0, if  RL≥I(X;UL); (16a) P(El)n→∞→0, if  RL−l+1−RL−l+2 ≥I(X;UL−l+1|UL−l+2,…,UL), (16b)

for .

Appendix B Converse of Theorem 1

Next, we prove that every achievable tuple must belong to . The system achieving is specified by the encoding functions , i.e.,

 f1:Xn→Z1; (17a) fl:Zl−1→Zl,   l=2,…,L, (17b)

such that

 Rl≥1nlog|Zl|; (18a) μl≤1nI(Ynl;Zl),for l=1,…,L. (18b)

By setting , and , for and , where , we have

 nRl ≥I(Zl,…,ZL;Xn) (19a) =n∑i=1I(Uli,…,ULi;Xi) (19b) =nI(Ul,…,UL;X), (19c)

where (19a) is due to the fact that are all deterministic function of ; (19b) follows from the definitions of ; and (19c) follows by defining , where

is a random variable independent of all other random variables, and uniformly distributed over the set

. We can also write

 nμl ≤I(Ynl;Zl) (20a) =n∑i=1I(Yl,i;Zl,...,ZL,Yi−1l) (20b) ≤n∑i=1I(Yl,i;Zl,...,ZL,Yi−1l,Xi−1) (20c) =n∑i=1I(Yl,i;Zl,...,ZL,Xi−1) (20d) +I(Yl,i;Yi−1l|Zl,...,ZL,Xi−1) (20e) =n∑i=1I(Yl,i;Uli,...,ULi) (20f) =nI(Yl;Ul,...,UL), (20g)

where (20c) is due to the non-negativity of mutual information; and (20f) follows since