1. Introduction
The information bottleneck (IB) is an approach to learning data representations based on a notion of minimal sufficiency. The general idea is to map an input source into a representation that retains as little information as possible about the input (minimality), but retains as much information as possible in relation to a target variable of interest (sufficiency). See Figure 1. For example, in a classification problem, the target variable could be the class label of the input data. In a reconstruction problem, the target variable could be a denoised reconstruction of the input. Intuitively, a representation which is minimal in relation to a given task, will discard nuisances in the inputs that are irrelevant to the task, and hence distill more meaningful information and allow for a better generalization.
In a typical bottleneck paradigm, an input variable is first mapped to an intermediate representation variable , and then is mapped to an output variable of interest . We call the mappings, resp., a representation model (encoder) and an inference model (decoder). The channel models the true relation between the input and the output . In general, the channel is unknown, and only accessible through a set of examples . We would like to obtain an approximation of using a probabilistic model that comprises of the encoderdecoder pair.
The IB methods (Witsenhausen and Wyner 1975, Tishby et al. 1999, Harremoës and Tishby 2007, Hsu et al. 2018)
have found numerous applications, e.g., in representation learning, clustering, classification, generative modelling, reinforcement learning, analyzing training in deep neural networks, among others
(see, e.g., Shamir et al. 2008, Gondek and Hofmann 2003, Higgins et al. 2017, Alemi et al. 2016, Tishby and Zaslavsky 2015).In the traditional IB, minimality and sufficiency are measured in terms of the mutual information. Computing the mutual information can be challenging in practice. Various recent works have formulated more tractable functions by way of variational bounds on the mutual information (Chalk et al. 2016, Alemi et al. 2016, Kolchinsky et al. 2017), sandwiching the objective function of interest.
Instead of approximating the sufficiency term of the IB, we formulate a new bottleneck method that minimizes deficiency. Deficiencies provide a principled way of approximating complex channels by relatively simpler ones. The deficiency of a decoder with respect to the true channel between input and output variables quantifies how well any stochastic encoding at the decoder input can be used to approximate the true channel. Deficiencies have a rich heritage in the theory of comparison of statistical experiments (Blackwell 1953, Le Cam 1964, Torgersen 1991). From this angle, the formalism of deficiencies has been used to obtain bounds on optimal risk gaps of statistical decision problems. As we show, the deficiency bottleneck minimizes a regularized risk gap. Moreover, the proposed method has an immediate variational formulation and that can be easily implemented as a modification of the Variational Information Bottleneck (VIB) (Alemi et al. 2016). In fact, both methods coincide in the limit of singleshot Monte Carlo approximations. We call our method the Variational Deficiency Bottleneck (VDB).
As we show in Proposition 2
, perfect optimization of the IB sufficiency corresponds to perfect minimization of the DB deficiency. However, when working over a parametrized model and adding the bottleneck regularizer, both methods have different preferences, with the DB being closer to the optimal risk gap. Experiments on basic data sets show that the VDB is able to obtain more compressed representations than the VIB while performing equally well or better in terms of test accuracy.
2. The variational deficiency bottleneck (VDB)
Let denote an observation or input variable and an output variable of interest. Let
be the true joint distribution, where the conditional distribution or
channel describes how the output depends on the input. We consider the situation where the true channel is unknown, but we are given a set of independent and identically distributed (i.i.d.) samples from . Our goal is to use this data to learn a more structured version of the channel , by first “compressing” the input to an intermediate representation variable and subsequently mapping the representation back to the output . The presence of an intermediate representation can be regarded as a bottleneck, a model selection problem, or as a regularization strategy.We define a representation model and an inference model using two parameterized families of channels and . We will refer to and as an encoder and a decoder. The encoderdecoder pair induces a model . Equivalently, we write .
Given a representation, we want the decoder to be as powerful as the original channel in terms of ability to recover the output. The deficiency of a decoder w.r.t. quantifies the extent to which any preprocessing of (by way of randomized encodings) can be used to approximate (in the KLdistance sense). Let denote the space of all channels from to . We define the deficiency of w.r.t. as follows.
Definition 1.
Given the channel from to , and a decoder from some to , the deficiency of w.r.t. is defined as
(1) 
Here is the conditional KL divergence (Csiszár and Körner 2011), and is an input distribution over . The definition is similar in spirit to Lucien Le Cam’s notion of weighted deficiencies of one channel w.r.t. another (Le Cam 1964; Torgersen 1991; Section 6.2) and its recent generalization by Raginsky (2011).
We propose to train the model by minimizing the deficiency of w.r.t. subject to a regularization that penalizes complex representations. The regularization is achieved by limiting the rate , the mutual information between the representation and the raw inputs. We call our method the Deficiency Bottleneck (DB). The DB minimizes the following objective over all tuples :
(2) 
The parameter allows us to adjust the level of regularization.
For any distribution , the rate term admits a simple variational upper bound (Csiszár and Körner 2011; Eq. (8.7)):
(3) 
Let be the empirical distribution of the data (inputoutput pairs). By noting that for any , and ignoring (unknown) datadependent constants, we obtain the following optimization objective which we call the Variational Deficiency Bottleneck (VDB) objective:
(4) 
The computation is simplified by defining
to be a standard multivariate Gaussian distribution
and using an encoder that outputs a Gaussian distribution. More precisely, we consider an encoder of the form , where is a neural network that outputs the parameters of a Gaussian distribution. Using the reparametrization trick (Kingma and Welling 2013, Rezende et al. 2014), we then write , where is a function of and the realizationof a standard normal distribution. This allows us to do stochastic backpropagation through a single sample
. The KL term admits an analytic expression for a choice of Gaussian and encoders. We train the model by minimizing the following empirical objective:(5) 
For training, we choose a minibatch size of
. For Monte Carlo estimates of the expectation inside the log, we choose
samples from the encoding distribution.We note that the Variational Information Bottleneck (VIB) (Alemi et al. 2016) leads to a similarlooking objective function, with the only difference that the sum over is outside of the log. By Jensen’s inequality, the VIB loss is an upper bound to our loss. If one uses a single sample from the encoding distribution (i.e., ), the VDB and the VIB objective functions coincide.
The average logloss and the rate term in the VDB objective equation 4
are the two fundamental quantities that govern the probability of error when the model is a classifier. For a detailed discussion of these relations, see Appendix
A.3. Blackwell Sufficiency and Channel Deficiency
In this section, we discuss an intuitive geometric interpretation of the deficiency in the space of probability distributions over the output variable. We also give an operational interpretation of the deficiency as a deviation from Blackwell sufficiency (in the KLdistance sense). Finally, we discuss its relation to the logloss.
3.1. Deficiency and Decision Geometry
We first formulate the learning task as a decision problem. We show that quantifies the gap in the optimal risks of decision problems when using the channel rather than .
Let , denote the space of possible inputs and outputs. In the following, we assume that and are finite. Let be the set of all distributions on . For every , define as . Nature draws and . The learner observes and quotes a distribution that expresses her uncertainty about the true value . The quality of a quote in relation to
is measured by an extended realvalued loss function called the
score . For a background on such special kind of loss functions see, e.g., Grünwald et al. 2004, Gneiting and Raftery 2007, Parry et al. 2012. Ideally, the quote should to be as close as possible to the true conditional distribution . This is achieved by minimizing the expected loss , for all . The score is called proper if . Define the Bayes act against as the optimal quoteIf multiple Bayes acts exist then select one arbitrarily. Define the Bayes risk for the distribution as . A score is strictly proper if the Bayes act is unique. An example of a strictly proper score is the logloss function defined as . For the logloss, the Bayes act is and the Bayes risk is just the conditional entropy
(6) 
Given a representation (output by some encoder), when using the decoder , the learner is constrained to quote a distribution from a subset of which is the convex hull of the points . Let . The Bayes act against is
(7) 
has an interpretation as the reverse Iprojection of to the convex set of probability measures (Csiszár and Matuš 2003)^{1}^{1}1Such a projection exists and is not necessarily unique since the set we are projecting onto is not logconvex. If nonunique, we arbitrarily select one of the minimizers as the Bayes act.. We call the associated Bayes risk as the projected Bayes risk and the associated conditional entropy as the projected conditional entropy ,
(8) 
The gap in the optimal risks, when making a decision based on an intermediate representation and a decision based on the input data is just the deficiency. This follows from noting that
(9) 
vanishes if and only if the optimal quote against , matches for all . This gives an intuitive geometric interpretation of a vanishing deficiency in the space of distributions over .
Given a decoder channel , since for any , the loss term in the VDB objective is a variational upper bound on the projected conditional entropy . However, this loss is still a lower bound to the standard crossentropy loss in the VIB objective (Alemi et al. 2016), i.e.,
(10) 
This follows simply from the convexity of the negative logarithm function.
3.2. Deficiency as a KLdistance from InputBlackwell Sufficiency
In a seminal paper David Blackwell (1953) asked the following question: if a learner wishes to make an optimal decision about some target variable of interest and she can choose between two channels with a common input alphabet, which one should she prefer? She can rank the channels by comparing her optimal risks: she will always prefer one channel over another if her optimal risk when using the former is at most that when using the latter for any decision problem. She can also rank the variables purely probabilistically: she will always prefer the former if the latter is an outputdegraded version of the former, in the sense that she can simulate a single use of the latter by randomizing at the output of the former. Blackwell showed that these two criteria are equivalent.
Very recently, Nasser (2017) asked the same question, only now the learner has to choose between two channels with a common output alphabet. Given two channels, and , we say that is inputdegraded from and write if for some . Stated in another way, can be reduced to by applying a randomization at its input. Nasser (2017) gave a characterization of inputdegradedness that is similar to Blackwell’s theorem (Blackwell 1953). We say, is inputBlackwell sufficient for if .
InputBlackwell sufficiency induces a preorder on the set of all channels with the same output alphabet. In practice, most channels are uncomparable, i.e., one cannot be reduced to another by a randomization. When such is the case, our deficiency quantifies how far the true channel is from being a randomization (by way of all input encodings) of the decoder . See Appendix B for a brief summary of BlackwellLe Cam theory.
3.3. Deficiency and the LogLoss
When
is a Markov chain, the conditional mutual information
is the Bayes risk gap for the logloss. This is apparent from noting that . This risk gap is closely related to Blackwell’s original notion of sufficiency. Since the logloss is strictly proper, a vanishing implies that the risk gap is zero for all loss functions. This suggests that minimizing the logloss risk gap under a suitable regularization constraint is a potential recipe for constructing representations that are approximately sufficient for w.r.t. , since in the limit when one would achieve . This is indeed the basis for the IB algorithm (Tishby et al. 1999) and its generalization, clustering with Bregman divergences (Banerjee et al. 2005, van Rooyen and Williamson 2015).One can also approximate a sufficient statistic by minimizing deficiencies instead. This follows from noting the following equivalence. The proof is in Appendix C.
Proposition 2.
When is a Markov chain, .
In general, for the bottleneck paradigms involving the conditional mutual information (IB) and the deficiency (DB), we have the following relationship:
(11) 
It is clear from Proposition 2 that the representations are going to be the same only in the limit of exact sufficiency. Our experiments corroborate that for achieving the same level of sufficiency, one needs to store less information about the input when minimizing the deficiencies than when minimizing the conditional mutual information.
4. Experiments
We present some experiments on the MNIST dataset (LeCun and Cortes 2010). Classification on MNIST is a very well studied problem. The main objective of our experiments is to evaluate the informationtheoretic properties of the representations learned by the VDB and whether it can match the classification accuracy provided by other bottleneck methods.
For the encoder we use a fully connected feedforward network with 784 input units, 1024 ReLUs, 1024 ReLUs, and 512 linear output units. The deterministic output of this network is interpreted as the vector of means and variances of a 256 dimensional Gaussian distribution. The decoder is a simple logistic regression model with a softmax layer. These are the same settings of the model used by
Alemi et al. (2016). We implement the algorithm in Tensorflow and train for 200 epochs using the Adam optimizer.
As can be seen from the upper panels in Figure 2, the test accuracy is stable with increasing . We note that is just the VIB model (Alemi et al. 2016). The lower left panel of Figure 2 shows the information bottleneck curve. The IB curve traces the mutual information between representation and output vs. the mutual information between representation and input, for different values of the regularization parameter at the end of training. In the case of the VDB, we substitute by the corresponding term in our algorithm, which is , and which is the value that we are actually plotting. Here is the entropy of the output, . For orientation, lower values of have higher values of (towards the right of the plot). For small values of , when the effect of the regularization is negligible, the bottleneck allows more information from the input through the representation. In this case, the mutual information between the representation and output increases on the training set, but not necessarily on the test set. This is manifest in the gap between the train and test curves indicative of a degradation in generalization. For intermediate values of , the gap is smaller for larger values of (our method).
The lower right panel of Figure 2 plots the minimality term vs. . We see that, for in the range between and , for the same level of sufficiency, setting consistently achieves more compression of the input compared to the setting . The dynamics of the information quantities during training are also interesting. We provide figures on these in Appendix D.
In order to visualize the representations, we also train the VDB on MNIST with a 2 dimensional representation. We use the same settings as before, with the only difference that the dimension of the output layer of the encoder is , with two coordinates representing the mean, and two a diagonal covariance matrix. The results are shown in Figure 3. For
, the representations are well separated, depending on the class. For related figures in the setting of unsupervised learning see Appendix
E. The learning dynamics of the mutual information and classification accuracy are shown in Figure 4. The left panel has an interpretation in terms of a phase where the model is mainly fitting the inputoutput relationship and hence increasing the mutual information , followed by a compression phase, where training is mainly reducing , leading to a better generalization. The right panel shows the test accuracy as training progresses. Higher values of (our method) usually lead to better accuracy. An exception is when the number of posterior samples for classification is large.5. Discussion
We have formulated a bottleneck method based on channel deficiencies. The deficiency of a decoder with respect to the true channel between input and output quantifies how well a randomization at the decoder input (by way of stochastic encodings) can be used to simulate the true channel. The VDB has a natural variational formulation which recovers the VIB in the limit of a single sample of the encoder output. Experiments demonstrate that the VDB can learn more compressed representations while retaining the same discriminative capacity. The method has a statistical decisiontheoretic appeal. Moreover, the resulting variational objective of the DB can be implemented as an easy modification of the VIB, with little to no computational overhead.
Given two channels that convey information about a target variable of interest, two different notions of deficiencies arise, depending on whether the target resides at the common input or the common output of the given channels. When the target is at the common output of the two channels, as is in a typical bottleneck setting (see Figure 1), our Definition 1 has a natural interpretation as a KLdivergence from inputBlackwell sufficiency (Nasser 2017). Here sufficiency is achieved by applying a randomization at the input of the decoder with the goal of simulating the true channel. The notion of inputBlackwell sufficiency contrasts with Blackwell’s original notion of sufficiency (Blackwell 1953) in the sense that Blackwell’s theory compares two channels with a common input. One can again define a notion of deficiency in this setting (see Appendix B for a discussion on deficiencies in the classical Blackwell setup). The associated channels (one from to and the other from to ) do not however have a natural interpretation in a typical bottleneck setting. In contrast, the inputBlackwell setup appears to be much more intuitive in this context. This subtle distinction seems to have gone unnoticed in the literature (see e.g. van Rooyen and Williamson 2014; 2015).
The more detailed view of information emerging from this analysis explains various effects and opens the door to multiple generalizations. In the spirit of the VDB, one can formulate a variational deficiency autoencoder (VDAE) as well (see sketch in Appendix E). On a related note, we mention that the deficiency is a lower bound to a quantity called the Unique information (Bertschinger et al. 2014, Banerjee et al. 2018a) (see details in Appendix C). An alternating minimization algorithm similar in spirit to the classical BlahutArimoto algorithm (Blahut 1972) has been proposed to compute this quantity (Banerjee et al. 2018b). Such an algorithm however, is not feasible in a deep neural network implementation. In the limit , the VDB is a step forward towards estimating the unique information. This might be of independent interest in improving the practicality of the theory of information decompositions.
References
 Alemi et al. (2016) A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck, 2016. URL http://arxiv.org/abs/1612.00410. ICLR17.

Banerjee et al. (2005)
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh.
Clustering with Bregman divergences.
Journal of Machine Learning Research
, 6(Oct):1705–1749, 2005.  Banerjee et al. (2018a) P. K. Banerjee, E. Olbrich, J. Jost, and J. Rauh. Unique informations and deficiencies. arXiv preprint arXiv:1807.05103, 2018a. Allerton 2018.
 Banerjee et al. (2018b) P. K. Banerjee, J. Rauh, and G. Montúfar. Computing the unique information. In Proc. IEEE ISIT, pages 141–145. IEEE, 2018b.
 Bertschinger and Rauh (2014) N. Bertschinger and J. Rauh. The Blackwell relation defines no lattice. In Proc. IEEE ISIT, pages 2479–2483. IEEE, 2014.
 Bertschinger et al. (2014) N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay. Quantifying unique information. Entropy, 16(4):2161–2183, 2014.
 Blackwell (1953) D. Blackwell. Equivalent comparisons of experiments. The Annals of Mathematical Statistics, 24(2):265–272, 1953.
 Blahut (1972) R. Blahut. Computation of channel capacity and ratedistortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972.
 Boucheron et al. (2005) S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
 Chalk et al. (2016) M. Chalk, O. Marre, and G. Tkacik. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pages 1957–1965, 2016.
 Csiszár (1972) I. Csiszár. A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(14):191–213, 1972.
 Csiszár and Körner (2011) I. Csiszár and J. Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
 Csiszár and Matuš (2003) I. Csiszár and F. Matuš. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
 Gneiting and Raftery (2007) T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 Gondek and Hofmann (2003) D. Gondek and T. Hofmann. Conditional information bottleneck clustering. In 3rd IEEE International Conference on Data Mining, Workshop on Clustering Large Data Sets, 2003.
 Grünwald et al. (2004) P. D. Grünwald, A. P. Dawid, et al. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004.
 Harder et al. (2013) M. Harder, C. Salge, and D. Polani. A bivariate measure of redundant information. Physical Review E, 87:012130, 2013.
 Harremoës and Tishby (2007) P. Harremoës and N. Tishby. The information bottleneck revisited or how to choose a good distortion measure. In Proc. IEEE ISIT, pages 566–570. IEEE, 2007.
 Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. VAE: Learning basic visual concepts with a constrained variational framework. 2017. ICLR17.
 Hsu et al. (2018) H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon. Generalizing bottleneck problems. In Proc. IEEE ISIT, pages 531–535. IEEE, 2018.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013. ICLR13.
 Kolchinsky et al. (2017) A. Kolchinsky, B. D. Tracey, and D. H. Wolpert. Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436, 2017.
 Körner and Marton (1975) J. Körner and K. Marton. Comparison of two noisy channels. In Topics in information theory, volume 16, pages 411–423. Colloquia Mathematica Societatis J nos Bolyai, Keszthely (Hungary), 1975.
 Le Cam (1964) L. Le Cam. Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics, pages 1419–1455, 1964.
 LeCun and Cortes (2010) Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Liese and Vajda (2006) F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
 Nasser (2017) R. Nasser. On the inputdegradedness and inputequivalence between channels. In Proc. IEEE ISIT, pages 2453–2457. IEEE, 2017.
 Parry et al. (2012) M. Parry, A. P. Dawid, S. Lauritzen, et al. Proper local scoring rules. The Annals of Statistics, 40(1):561–592, 2012.
 Raginsky (2011) M. Raginsky. Shannon meets Blackwell and Le Cam: Channels, codes, and statistical experiments. In Proc. IEEE ISIT, pages 1220–1224. IEEE, 2011.
 Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
 Shamir et al. (2008) O. Shamir, S. Sabato, and N. Tishby. Learning and generalization with the information bottleneck. In International Conference on Algorithmic Learning Theory, pages 92–107. Springer, 2008.
 Tishby and Zaslavsky (2015) N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pages 1–5. IEEE, 2015.
 Tishby et al. (1999) N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
 Torgersen (1991) E. Torgersen. Comparison of statistical experiments, volume 36. Cambridge University Press, 1991.
 van Rooyen and Williamson (2014) B. van Rooyen and R. C. Williamson. Le Cam meets LeCun: Deficiency and generic feature learning. arXiv preprint arXiv:1402.4884, 2014.
 van Rooyen and Williamson (2015) B. van Rooyen and R. C. Williamson. A theory of feature learning. arXiv preprint arXiv:1504.00083, 2015.
 Vera et al. (2018) M. Vera, P. Piantanida, and L. R. Vega. The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355, 2018.

Witsenhausen and Wyner (1975)
H. Witsenhausen and A. Wyner.
A conditional entropy bound for a pair of discrete random variables.
IEEE Transactions on Information Theory, 21(5):493–501, 1975.
Appendix A Misclassification error and the average logloss
In a classification task, the goal is to use the training dataset to learn a classifier that minimizes the probability of error under the true data distribution, defined as follows.
(12) 
It is well known that the optimal classifier that gives the smallest probability of error is the Bayes classifier [Boucheron et al., 2005]. Since we do not know the true data distribution we try to learn based on the empirical error. Directly minimizing the empirical probability of error over the training dataset is in general a NPhard problem. In practice, one minimizes a surrogate loss function that is a convex upper bound on . A natural surrogate is the average logloss function . When the model is , the following upper bounds are immediate from using Jensen’s inequality.
(13) 
The bound using the standard crossentropy loss is evidently weaker than the average logloss. A lower bound on the probability of error is controlled by a convex functional of the mutual information between the representation and the raw inputs [Vera et al., 2018; see, e.g., Lemma 4]. The average logloss and the rate term in the VDB objective equation 4 are two fundamental quantities that govern the probability of error.
Appendix B Classical Theory of Comparison of Channels
In this section, we discuss the classical theory of comparison of channels due to Blackwell [1953] and its extension by Le Cam [1964], Torgersen [1991] and more recently by Raginsky [2011].
Suppose that a learner wishes to predict the value of a random variable that takes values in a set . She has a set of actions . Each action incurs a loss that depends on the true state of and the chosen action . Let encode the learners’ uncertainty about the true state . The tuple is called a decision problem. Before choosing her action, the learner observes a random variable through a channel . An ideal learner chooses a strategy that minimizes her expected loss or risk . The optimal risk when using the channel is .
Suppose now that the learner has to choose between and another random variable that she observes through a second channel with common input . She can always discard in favor of if, knowing , she can simulate a single use of by randomly sampling a after each observation .
Definition 3.
We say that is outputdegraded from w.r.t. , denoted , if there exists a random variable such that the pairs and are stochastically indistinguishable, and is a Markov chain.
She can also discard if her optimal risk when using is at most that when using for any decision problem. Write if for any decision problem. Blackwell [1953] showed the equivalence of these two relations.
Theorem 4.
(Blackwell’s Theorem) .
Write if for some . If has full support, then it easy to check that [Bertschinger and Rauh, 2014; Theorem 4].
The learner can also compare and by comparing the mutual informations and between the common input and the channel outputs and .
Definition 5.
is said to be more capable than , denoted , if for all probability distribution on .
It follows from the data processing inequality that . However, the converse implication is not true in general [Körner and Marton, 1975].
The converse to the Blackwell’s theorem states that if the relation does not hold, then there exists a set of actions and a loss function such that . Le Cam introduced the concept of a deficiency of w.r.t. to express this deficit in optimal risks [Le Cam, 1964] in terms of an approximation of from via Markov kernels.
Definition 6.
The deficiency of w.r.t. is
(14) 
where denotes the total variation distance.
When the distribution of the common input to the channels is fixed, one can define a weighted deficiency [Torgersen, 1991; Section 6.2].
Definition 7.
Given , the weighted deficiency of w.r.t. is
(15) 
Le Cam’s randomization criterion [Le Cam, 1964] shows that deficiencies quantify the maximal gap in the optimal risks of decision problems when using the channel rather than .
Theorem 8 (Le Cam [1964]).
Fix , and a probability distribution on and write . For every , if and only if for any set of actions and any bounded loss function .
Raginsky [2011] introduced a broad class of deficiencylike quantities using the notion of a generalized divergence between probability distributions that satisfies a monotonicity property w.r.t. data processing. The family of divergences due to Csiszár belongs to this class [Liese and Vajda, 2006].
Definition 9.
The deficiency of w.r.t. is
(16) 
Many common divergences, such as the KullbackLeibler (KL) divergence, the reverseKL divergence, and the total variation distance are divergences. When the channel is such that its output is constant, no matter what the input, the corresponding deficiency is called informativity [Csiszár, 1972]. The informativity associated with the KL divergence is just the channel capacity which has a geometric interpretation as an “information radius”.
We can also define a weighted deficiency of w.r.t. .
Definition 10.
The weighted deficiency of w.r.t. is
(17) 
Specializing to the KL divergence, we have the following definition.
Definition 11.
The weighted output deficiency of w.r.t. is
(18) 
where the subscript in emphasizes the fact that the randomization is at the output of the channel .
Note that if and only if , which captures the intuition that if is small, then is approximately outputdegraded from w.r.t. . Using Pinsker’s inequality, we have
(19) 
Appendix C Unique Information Bottleneck
In this section, we give a new perspective on the Information Bottleneck paradigm using nonnegative mutual information decompositions. The quantity we are interested in is the notion of Unique information proposed in [Bertschinger et al., 2014]. Work in similar vein include [Harder et al., 2013] and more recently [Banerjee et al., 2018a] which gives an operationalization of the unique information.
Consider three random variables , , and with joint distribution . The mutual information between and can be decomposed into information that has about that is unknown to (we call this the unique information of w.r.t. ) and information that has about that is known to (we call this the shared information).
(20) 
Conditioning on destroys the shared information but creates complementary or synergistic information from the interaction of and .
(21) 
Using the chain rule, the total information that the pair
conveys about can be decomposed into four terms.(22)  
(23) 
, , and are nonnegative functions that depend continuously on the joint distribution of .
For completeness, we rewrite the information decomposition equations below.
(24a)  
(24b)  
(24c)  
(24d) 
The unique information can be interpreted as either the conditional mutual information without the synergy, or as the mutual information without the redundancy.
When is a Markov chain, the information decomposition is
(25a)  
(25b)  
(25c)  
(25d) 
The Information bottleneck [Tishby et al., 1999] minimizes the following objective
(26) 
over all encoders . Since is a Markov chain, the sufficiency term in the IB objective depends on the pairwise marginals and , while the minimality term depends on the marginal. From equation 25b, it follows that one can equivalently write the IB objective function as
(27) 
From an information decomposition perspective, the original IB is actually minimizing just the unique information subject to a regularization constraint. This is a simple consequence of the fact that the synergistic information (see equation 25d) when we have the Markov chain condition . Hence, one might equivalently call the original IB as the Unique information bottleneck.
Appealing to classical Blackwell theory, Bertschinger et al. [2014] defined a nonnegative decomposition of the mutual information based on the idea that the unique and shared information should depend only on the pairwise marginals and .
Definition 12.
Let , and let , be two channels with the same input alphabet such that and . Define
(28a)  
(28b)  
(28c)  
(28d)  
(28e) 
where the subscript in denotes that joint distribution on which the quantities are computed.
The functions , , and are nonnegative and satisfy equation 24. Furthermore, and depend on the marginal distributions of the pairs and . Only the function depends on the full joint .
satisfies the following intuitive property in relation to Blackwell’s theorem 4.
Proposition 13.
[Bertschinger et al., 2014; Lemma 6] .
The equivalence in Proposition 2 follows from noting that [Banerjee et al., 2018a; Proposition 28] and the fact that when is a Markov chain.
Since in the IB setting, there is no complementary information, one may choose to minimize either which is in fact equal to (the original IB objective) or minimize the deficiency . From the discussion above, it is clear that the results are going to be equivalent only in the limit of exact sufficiency since . In all other cases, we expect to find something (subtly) different from IB. In general, for the bottlenecks, we have:
(29) 
Hence, for achieving the same level of sufficiency, one needs to store less information about the input when minimizing the deficiencies than when minimizing the conditional mutual information.
Appendix D Additional figures on VDB experiments
Appendix E Unsupervised representation learning using the VDB
In this section, we discuss some preliminary results on an unsupervised version of the VDB objective which bears some resemblance to the VAE [Higgins et al., 2017]. The cross entropy loss that appears in an autoencoder is similar to the cross entropy that appears in the information bottleneck. In a spirit similar to the VDB, we can formulate a Deficiency Autoencoder. We call the unsupervised version the Variational Deficiency Autoencoder (VDAE).
Let be the true data density. The optimization objective in the VDAE is
(30) 
where is defined to be a standard multivariate Gaussian distribution and the parameter
allows us to interpolate between pure autoencoding (
) and pure autodecoding () behavior.The optimization objective in the VAE [Higgins et al., 2017] is
(31) 
We note that the VAE has a similarlooking training objective as the VDAE, with the only difference that the integral is outside the log.
Figures 6 and 7 show some preliminary experiments on the MNIST dataset. The representations are optically comparable with results obtained in other standard works on the Variational Autoencoder.
Comments
There are no comments yet.