I Introduction
A fundamental problem in statistical learning is to extract the relevant essence of data from highdimensional, noisy, salient sources. In supervised learning (e.g., speaker identification in speech recognition), a set of properties or statistical relationships is prespecified as relevant information of interest (e.g., name, age or gender of the speaker) targeted to be learned from data; while in unsupervised learning, clusters or lowdimensional representations play the same role. This can be connected to the lossy source compression problem in information theory, where an original source is compressed subject to specifically defined distortion (or loss) with regards to specified relevant information.
A remarkable step towards understanding the information relevance problem using fundamental information theoretical concepts was made by Tishby et al. [1] with the introduction of the “information bottleneck ” (IB) method. The relevant information in an observable variable is defined as the information can provide about another hidden variable . The IB framework characterizes the tradeoff between the information rates (or complexity) of the reproduction signal , and the amount of mutual information it provides about . The IB method has been found useful in a wide variety of learning applications, e.g., word clustering [2], image clustering [3]
, etc. In particular, interesting connections have been recently made between deep learning
[4] and the successively refined IB method [5].Despite of the success of the IB method in the machine learning domain, less efforts have been invested in studying it from an information theoretical view. GiladBachrach
et al. [6] characterize the optimal tradeoff between the rates of information and relevance, and provide a singleletter region. As a matter of fact, the conventional IB problem follows as a special instance of the conventional noisy lossy source coding problem [7]. Extension of this informationtheoretic framework address the collaborative IB problem by Vera et al. [8], and the distributed biclustering problem by Pichler et al. [9]. Further connections to the problem of joint testing and lossy reconstruction has been recently studied by Katz et al. [10]. Also in the information theoretic context, the IB problem is closely related to the pattern classification problem studied in [11, 12, 13]; which provides another operational meaning to IB.In this work, we introduce and investigate the multilayer IB problem with nonidentical hidden variables at each layer. This scenario is highly motivated by deep neural networks (DNN) and the recent work in
[5]. Along the propagation of a DNN, each layer compresses its input, which is the output of the preceding layer, to a lower dimensional output, which is forwarded to the next layer. Another scenario may be the hierarchical, multilayer network, in which information is propagated from higher layers to lower layers sequentially. Users in different layers may be interested in different properties of the original source. The main result of this paper is the full characterization of the raterelevance region of the multilayer IB problem. Conditions are provided for successive refinability in the sense of the existence of codes that asymptotically achieve the raterelevance function, simultaneously at all the layers. Binary source with BSC hidden variables and binary source with mixed BSCBEC hidden variables^{1}^{1}1BSC hidden variables are obtained by passing the source through a binary symmetric channel, whereas BEC hidden variables are obtained through a binary erasure channel. are both proved to successively refinable. The successive refinability is also shown for Guassian sources. We further present a counterexample for which successive refinability no longer holds. It is worth mentioning that the successive refinability of the IB problem is also investigated in [14], with identical hidden variables.The rest of the paper is organized as follows. Section II provides the definitions and presents the main result, the achievability and converse proofs of which are provided in the Appendices. The definition and conditions of successive refinability are shown in Section III. Examples are presented in Section IV. Finally, we conclude the paper in Section V.
Ii Problem formulation
Let be a sequence of
i.i.d. copies of discrete random variables
taking values in finite alphabets, jointly distributed according to
, where is an observable variable, while are hidden variables arbitrarily correlated with .An code for the layer IB problem, as illustrated in Fig. 1, consists of encoding functions , defined as , where we set , and , . That is, is the rate of the th layer encoding function , , and we assume .
Definition 1.
(Achievability) For some and nonnegative values, is said to be achievable if, for every , there exists an code s.t.
(1) 
for sufficiently large , where , for , and .
The value of imposes a lower bound on , i.e., the relevance with respect to the hidden variable after layer encoding of the observable sequence . Our goal is to characterize the raterelevance region, , which is the set of all achievable tuples .
Theorem 1.
The raterelevance region, , is characterized by the closure of the set of all tuples that satisfy
(2) 
for some probability
s.t.(3) 
Proof.
A proof is provided in the Appendices. ∎
Iii Successive Refinability of MultiLayer IB
The raterelevance function for a singlelayer setting with relevance constraint regarding the hidden variable is denoted by , and characterized in [6] as:
(4) 
Definition 2.
Source is said to be successively refinable for the layer IB problem with regards to correlated relevant hidden variables with relevance constraints , respectively, if
(5) 
Theorem 2.
Source is successively refinable for the layer IB problem with relevance constraints , …, with regards to hidden variables , iff there exist random variables , satisfying , such that the following conditions hold simultaneously for :

,

.
Proof.
Theorem 2 follows directly from Definition 2 and Theorem 1. ∎
Iv Examples
Iva Binary Source with Symmetric Hidden Variables
We consider , . The observable variable
has a Bernoulli distribution
(denoted as Bern ), and the hidden variables are obtained by passing the source through independent BSCs, i.e., , where , , is independent of , and denotes modulo2 addition.We first derive the raterelevance function . Denote by any random variable for which , . We have the following inequality:
(6a)  
(6b)  
(6c) 
where operation is defined as , is the binary entropy function, defined as , and is the inverse of the binary entropy function with . (6b) follows from Mrs. Gerber’s Lemma and the fact that . From (6), we obtain . Thus, we have . Note that by letting , where is independent of and , we have and . We can conclude that and given above is a raterelevance function achieving auxiliary random variable.
Lemma 1.
Binary sources as described above are always successively refinable for the layer IB problem if and .
Proof.
Since
, we can find binary variables
, independent of each other and X, such that for . By choosing auxiliary random variables: , we have and , for , and . Together with Theorem 2, this conclude the proof of Lemma 1. ∎IvB Binary Source with Mixed Hidden Variables
Here we consider a twolayer IB problem, i.e., . The joint distribution of is illustrated in Fig. 2, where is a binary random variable of distribution Bernoulli as in the previous example, but is the output of a BEC with erasure probability () when is the input, and is the output of a (BSC) with crossover probability , . A similar example can be found in [15] where the optimality of proposed coding scheme not always holds for their setting. We first derive the raterelevance function . Denote by any random variable such that , . We have the following inequality:
(7a)  
(7b)  
(7c)  
(7d)  
(7e)  
(7f) 
Since , where , is defined as , it follows that , which can be achieved by setting , where is independent of . We have from Section. IVA, which can be achieved by setting , where is independent of .
Lemma 2.
Binary source with mixed BEC/BSC hidden variables as described above is always successively refinable for the layer IB problem if and .
Proof.
The proof follows the same arguments as in the proof of Lemma 1. ∎
Note that successive refinability is still achievable in this example despite the mixed hidden variables contrast to our expectation, since an auxiliary in the form achieves the raterelevance function despite BEC hidden variable.
IvC Jointly Gaussian Source and Hidden Variables
It is not difficult to verify that the above achievability results are still valid for the Gaussian sources by employing a quantization procedure over the sources and appropriate test channels[16].
Let and , , be jointly Gaussian zeromean random variables, such that , where and , . As in the previous examples, we first derive a lower bound on the raterelevance function . Denote by any random variable such that , . We have the following sequence of inequalities:
(8a)  
(8b)  
(8c)  
(8d)  
(8e) 
where (8d) follows from the conditional Entropy Power Inequality (EPI) (Section 2.2 in [16]). We can also obtain an outer bound on :
(9) 
by setting , , , where is given by:
(10) 
Lemma 3.
Gaussian sources as described above are always successively refinable for the layer IB problem if and .
IvD Counterexample on successive refinability
In this section, we show that the multilayer IB problem is not always successively refinable. We consider a twolayer IB problem, i.e., . Let , where and are two independent discrete random variables, and we have and . We first derive the raterelevance function . Denote by any random variable such that , . We have:
(11a)  
(11b)  
(11c) 
By setting as
(12) 
we have , and , which achieves the lower bound shown in (11). We can conclude that , and any raterelevance function achieving random variable should satisfy , since and . Similarly, we can conclude that , and any raterelevance function achieving random variable should satisfy .
Lemma 4.
Source with hidden variables and as described above is not successively refinable for the twolayer IB problem.
Proof.
For any raterelevance function achieving random variables and , we have
(13a)  
(13b)  
(13c)  
(13d)  
(13e)  
(13f) 
where (13e) is due to , which follows from
(14a)  
(14b)  
(14c) 
If , , which implies and
cannot form a Markov chain for any raterelevance function achieving random variables
and . With Theorem 2, we have proven Lemma 4. ∎V Conclusion
The multilayer IB problem with nonidentical relevant variables was investigated. A singleletter expression of the raterelevance region was given. The definition and conditions of successive refinability were presented, which was further investigated for the binary sources and Guassian sources. A counterexample of successive refinability was also proposed.
References
 [1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of Annu. Allerton Conf. Commun., Contr. Comput., Monticello, IL, Sep. 1999, pp. 368–377.
 [2] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of Annu. Conf. Research and Development in inform. retrieval, Athens, Greece, 2000, pp. 208–215.
 [3] J. Goldberger, H. Greenspan, and S. Gordon, “Unsupervised image clustering using the information bottleneck method,” in Joint Pattern Recog. Symp., Zurich, Switzerland, Sep. 2002, pp. 158–165.
 [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
 [5] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW), April 2015, pp. 1–5.
 [6] R. GiladBachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Proc. of Conf. Learning Theory (COLT), Washington, Aug. 2003, pp. 595–609.
 [7] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Transactions on Information Theory, vol. 8, no. 5, pp. 293–304, September 1962.
 [8] M. Vera, L. R. Vega, and P. Piantanida, “Collaborative representation learning,” CoRR, vol. abs/1604.01433, 2016. [Online]. Available: http://arxiv.org/abs/1604.01433
 [9] G. Pichler, P. Piantanida, and G. Matz, “Distributed informationtheoretic biclustering,” in 2016 IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 1083–1087.
 [10] G. Katz, P. Piantanida, and M. Debbah, “Distributed binary detection with lossy data compression,” IEEE Transactions on Information Theory, vol. PP, no. 99, pp. 1–1, 2017.
 [11] J. G. F. Willems, T. Kalker and J.P. Linnartz, “Successive refinement for hypothesis testing and lossless onehelper problem,” p. 82, Jun. 2003.
 [12] E. Tuncel, “Capacity/storage tradeoff in highdimensional identification systems,” IEEE Trans. Inform. Theory, no. 5, 2009.
 [13] E. Tuncel and D. Gunduz, “Identification and lossy reconstruction in noisy databases,” IEEE Trans. Inform. Theory, no. 2, 2014.
 [14] C. Tian and J. Chen, “Successive refinement for hypothesis testing and lossless onehelper problem,” IEEE Trans. Inform. Theory, vol. 54, no. 10, pp. 4666–4681, 2008.
 [15] J. Villard and P. Piantanida, “Secure multiterminal source coding with side information at the eavesdropper,” IEEE Trans. Inform. Theory, vol. 59, no. 6, pp. 3668–3692, 2013.
 [16] A. E. Gamal and Y. H. Kim, Network Information Theory. New York, NY, USA: Cambridge University Press, 2012.
Appendix A Achivability of Theorem 1
Consider first the direct part, i.e., every tuple is achievable.
Code generation. Fix a conditional probability mass function (pmf) such that , for . First randomly generate sequences , , independent and identically distributed (i.i.d.) according to ; then for each randomly generate sequences , conditionally i.i.d. according to ; and continue in the same manner, for each randomly generate sequences , , conditionally i.i.d. according to , for .
Encoding and Decoding After observing , the first encoder finds an index tuple such that is in the set , which is the set of jointly typical vectors of random variables . If more than one such tuple exist, any one of them is selected. If no such tuple exists, we call it an error, and set . Then the th encoder outputs , for , and sends to the encoder, if , the index tuple at a total rate of . Given the index tuple , the th decoder declares as its output, for .
Relevance. First, we note that if there is no error in the encoding step, i.e., an index tuple such that is found, then the relevance condition , , is satisfied by the definition of and the Markov lemma. Then we focus on the analysis of the probability of error, i.e., the probability that such an index tuple cannot be found in the encoding step.
An error occurs if one of the following events happens:
(15a)  
(15b)  
(15c) 
for . It is clear that as . Based on the properties of typical sequences:
(16a)  
(16b) 
for .
Appendix B Converse of Theorem 1
Next, we prove that every achievable tuple must belong to . The system achieving is specified by the encoding functions , i.e.,
(17a)  
(17b) 
such that
(18a)  
(18b) 
By setting , and , for and , where , we have
(19a)  
(19b)  
(19c) 
where (19a) is due to the fact that are all deterministic function of ; (19b) follows from the definitions of ; and (19c) follows by defining , where
is a random variable independent of all other random variables, and uniformly distributed over the set
. We can also write(20a)  
(20b)  
(20c)  
(20d)  
(20e)  
(20f)  
(20g) 
where (20c) is due to the nonnegativity of mutual information; and (20f) follows since
Comments
There are no comments yet.