 # The Variational Deficiency Bottleneck

We introduce a bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single-shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. The deficiency of one channel w.r.t. another has an operational interpretation in terms of the optimal risk gap of decision problems, capturing classification as a special case. Unsupervised generalizations are possible, such as the deficiency autoencoder, which can also be formulated in a variational form. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in classification and reconstruction tasks.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The information bottleneck (IB) is an approach to learning data representations based on a notion of minimal sufficiency. The general idea is to map an input source into a representation that retains as little information as possible about the input (minimality), but retains as much information as possible in relation to a target variable of interest (sufficiency). See Figure 1. For example, in a classification problem, the target variable could be the class label of the input data. In a reconstruction problem, the target variable could be a denoised reconstruction of the input. Intuitively, a representation which is minimal in relation to a given task, will discard nuisances in the inputs that are irrelevant to the task, and hence distill more meaningful information and allow for a better generalization.

In a typical bottleneck paradigm, an input variable  is first mapped to an intermediate representation variable , and then  is mapped to an output variable of interest . We call the mappings, resp., a representation model (encoder) and an inference model (decoder). The channel  models the true relation between the input  and the output . In general, the channel  is unknown, and only accessible through a set of examples . We would like to obtain an approximation of  using a probabilistic model that comprises of the encoder-decoder pair.

The IB methods (Witsenhausen and Wyner 1975, Tishby et al. 1999, Harremoës and Tishby 2007, Hsu et al. 2018)

have found numerous applications, e.g., in representation learning, clustering, classification, generative modelling, reinforcement learning, analyzing training in deep neural networks, among others

(see, e.g., Shamir et al. 2008, Gondek and Hofmann 2003, Higgins et al. 2017, Alemi et al. 2016, Tishby and Zaslavsky 2015).

In the traditional IB, minimality and sufficiency are measured in terms of the mutual information. Computing the mutual information can be challenging in practice. Various recent works have formulated more tractable functions by way of variational bounds on the mutual information (Chalk et al. 2016, Alemi et al. 2016, Kolchinsky et al. 2017), sandwiching the objective function of interest.

Instead of approximating the sufficiency term of the IB, we formulate a new bottleneck method that minimizes deficiency. Deficiencies provide a principled way of approximating complex channels by relatively simpler ones. The deficiency of a decoder with respect to the true channel between input and output variables quantifies how well any stochastic encoding at the decoder input can be used to approximate the true channel. Deficiencies have a rich heritage in the theory of comparison of statistical experiments (Blackwell 1953, Le Cam 1964, Torgersen 1991). From this angle, the formalism of deficiencies has been used to obtain bounds on optimal risk gaps of statistical decision problems. As we show, the deficiency bottleneck minimizes a regularized risk gap. Moreover, the proposed method has an immediate variational formulation and that can be easily implemented as a modification of the Variational Information Bottleneck (VIB) (Alemi et al. 2016). In fact, both methods coincide in the limit of single-shot Monte Carlo approximations. We call our method the Variational Deficiency Bottleneck (VDB).

As we show in Proposition 2

, perfect optimization of the IB sufficiency corresponds to perfect minimization of the DB deficiency. However, when working over a parametrized model and adding the bottleneck regularizer, both methods have different preferences, with the DB being closer to the optimal risk gap. Experiments on basic data sets show that the VDB is able to obtain more compressed representations than the VIB while performing equally well or better in terms of test accuracy.

We describe the details of our method in Section 2. We elaborate on the theory of deficiencies in Section 3. Experimental results with the VDB are presented in Section 4.

## 2. The variational deficiency bottleneck (VDB)

Let  denote an observation or input variable and  an output variable of interest. Let

be the true joint distribution, where the conditional distribution or

channel  describes how the output depends on the input. We consider the situation where the true channel is unknown, but we are given a set of  independent and identically distributed (i.i.d.) samples  from . Our goal is to use this data to learn a more structured version of the channel , by first “compressing” the input  to an intermediate representation variable  and subsequently mapping the representation back to the output . The presence of an intermediate representation can be regarded as a bottleneck, a model selection problem, or as a regularization strategy.

We define a representation model and an inference model using two parameterized families of channels  and . We will refer to  and  as an encoder and a decoder. The encoder-decoder pair induces a model . Equivalently, we write .

Given a representation, we want the decoder to be as powerful as the original channel  in terms of ability to recover the output. The deficiency of a decoder  w.r.t.  quantifies the extent to which any pre-processing of  (by way of randomized encodings) can be used to approximate  (in the KL-distance sense). Let  denote the space of all channels from  to . We define the deficiency of w.r.t. as follows.

###### Definition 1.

Given the channel  from  to , and a decoder from some  to , the deficiency of w.r.t.  is defined as

 δπ(d,κ)=mine∈M(X;Z)DKL(κ∥d∘e|π). (1)

Here  is the conditional KL divergence (Csiszár and Körner 2011), and is an input distribution over . The definition is similar in spirit to Lucien Le Cam’s notion of weighted deficiencies of one channel w.r.t. another (Le Cam 1964; Torgersen 1991; Section 6.2) and its recent generalization by Raginsky (2011).

We propose to train the model by minimizing the deficiency of  w.r.t.  subject to a regularization that penalizes complex representations. The regularization is achieved by limiting the rate , the mutual information between the representation and the raw inputs. We call our method the Deficiency Bottleneck (DB). The DB minimizes the following objective over all tuples :

 LDB(e,d):=δπ(d,κ)+βI(Z;X). (2)

The parameter allows us to adjust the level of regularization.

For any distribution , the rate term admits a simple variational upper bound (Csiszár and Körner 2011; Eq. (8.7)):

 I(Z;X)≤∫p(x,z)loge(z|x)r(z)dxdz. (3)

Let  be the empirical distribution of the data (input-output pairs). By noting that  for any , and ignoring (unknown) data-dependent constants, we obtain the following optimization objective which we call the Variational Deficiency Bottleneck (VDB) objective:

 LVDB(e,d):=E(x,y)∼^pdata[−log((d∘e)(y|x))+βDKL(e(Z|x)∥r(Z))]. (4)

The computation is simplified by defining

to be a standard multivariate Gaussian distribution

and using an encoder that outputs a Gaussian distribution. More precisely, we consider an encoder of the form , where is a neural network that outputs the parameters of a Gaussian distribution. Using the reparametrization trick (Kingma and Welling 2013, Rezende et al. 2014), we then write , where is a function of and the realization

of a standard normal distribution. This allows us to do stochastic backpropagation through a single sample

. The KL term admits an analytic expression for a choice of Gaussian and encoders. We train the model by minimizing the following empirical objective:

 1NN∑i=1[−log(1MM∑j=1[d(y(i)|f(x(i),ϵ(j)))])+βDKL(e(Z|x(i))∥r(Z))]. (5)

For training, we choose a mini-batch size of

. For Monte Carlo estimates of the expectation inside the log, we choose

samples from the encoding distribution.

We note that the Variational Information Bottleneck (VIB) (Alemi et al. 2016) leads to a similar-looking objective function, with the only difference that the sum over  is outside of the log. By Jensen’s inequality, the VIB loss is an upper bound to our loss. If one uses a single sample from the encoding distribution (i.e., ), the VDB and the VIB objective functions coincide.

The average log-loss and the rate term in the VDB objective equation 4

are the two fundamental quantities that govern the probability of error when the model is a classifier. For a detailed discussion of these relations, see Appendix

A.

## 3. Blackwell Sufficiency and Channel Deficiency

In this section, we discuss an intuitive geometric interpretation of the deficiency in the space of probability distributions over the output variable. We also give an operational interpretation of the deficiency as a deviation from Blackwell sufficiency (in the KL-distance sense). Finally, we discuss its relation to the log-loss.

### 3.1. Deficiency and Decision Geometry

We first formulate the learning task as a decision problem. We show that  quantifies the gap in the optimal risks of decision problems when using the channel rather than .

Let , denote the space of possible inputs and outputs. In the following, we assume that and are finite. Let  be the set of all distributions on . For every , define as . Nature draws  and . The learner observes  and quotes a distribution  that expresses her uncertainty about the true value . The quality of a quote  in relation to

is measured by an extended real-valued loss function called the

score . For a background on such special kind of loss functions see, e.g., Grünwald et al. 2004, Gneiting and Raftery 2007, Parry et al. 2012. Ideally, the quote  should to be as close as possible to the true conditional distribution . This is achieved by minimizing the expected loss , for all . The score is called proper if . Define the Bayes act against  as the optimal quote

 q∗x:=argminqx∈PYL(κx,qx).

If multiple Bayes acts exist then select one arbitrarily. Define the Bayes risk for the distribution  as . A score is strictly proper if the Bayes act is unique. An example of a strictly proper score is the log-loss function defined as . For the log-loss, the Bayes act is  and the Bayes risk is just the conditional entropy

 R(pXY,ℓL)=Ex∼πEy∼κx[−logq∗x(y)]=Ex∼πEy∼κx[−logκx(y)]=H(Y|X). (6)

Given a representation  (output by some encoder), when using the decoder , the learner is constrained to quote a distribution from a subset of  which is the convex hull of the points . Let . The Bayes act against  is

 q∗xZ:=argminqx∈CEy∼κx[−logqx(y)]. (7)

has an interpretation as the reverse I-projection of  to the convex set of probability measures  (Csiszár and Matuš 2003)111Such a projection exists and is not necessarily unique since the set we are projecting onto is not log-convex. If nonunique, we arbitrarily select one of the minimizers as the Bayes act.. We call the associated Bayes risk as the projected Bayes risk  and the associated conditional entropy as the projected conditional entropy ,

 RZ(pXY,ℓL)=Ex∼πEy∼κx[−logq∗xZ(y)]=HZ(Y|X). (8)

The gap in the optimal risks, when making a decision based on an intermediate representation and a decision based on the input data is just the deficiency. This follows from noting that

 ΔR=HZ(Y|X)−H(Y|X) =∑x∈Xπ(x)minqx∈C⊂PYDKL(κx∥qx) =mine∈M(X;Z)∑x∈Xπ(x)DKL(κx∥d∘ex) =mine∈M(X;Z)DKL(κ∥d∘e|π)=δπ(d,κ). (9)

vanishes if and only if the optimal quote against , matches  for all . This gives an intuitive geometric interpretation of a vanishing deficiency in the space of distributions over .

Given a decoder channel , since for any , the loss term in the VDB objective is a variational upper bound on the projected conditional entropy . However, this loss is still a lower bound to the standard cross-entropy loss in the VIB objective (Alemi et al. 2016), i.e.,

 E(x,y)∼^pdata[−logd∘e(y|x)]≤E(x,y)∼^pdata[∫−e(z|x)logd(y|z)dz]. (10)

This follows simply from the convexity of the negative logarithm function.

### 3.2. Deficiency as a KL-distance from Input-Blackwell Sufficiency

In a seminal paper David Blackwell (1953) asked the following question: if a learner wishes to make an optimal decision about some target variable of interest and she can choose between two channels with a common input alphabet, which one should she prefer? She can rank the channels by comparing her optimal risks: she will always prefer one channel over another if her optimal risk when using the former is at most that when using the latter for any decision problem. She can also rank the variables purely probabilistically: she will always prefer the former if the latter is an output-degraded version of the former, in the sense that she can simulate a single use of the latter by randomizing at the output of the former. Blackwell showed that these two criteria are equivalent.

Very recently, Nasser (2017) asked the same question, only now the learner has to choose between two channels with a common output alphabet. Given two channels, and , we say that is input-degraded from  and write  if  for some . Stated in another way, can be reduced to  by applying a randomization at its input. Nasser (2017) gave a characterization of input-degradedness that is similar to Blackwell’s theorem (Blackwell 1953). We say, is input-Blackwell sufficient for  if .

Input-Blackwell sufficiency induces a preorder on the set of all channels with the same output alphabet. In practice, most channels are uncomparable, i.e., one cannot be reduced to another by a randomization. When such is the case, our deficiency quantifies how far the true channel  is from being a randomization (by way of all input encodings) of the decoder . See Appendix B for a brief summary of Blackwell-Le Cam theory.

### 3.3. Deficiency and the Log-Loss

When

is a Markov chain, the conditional mutual information

is the Bayes risk gap for the log-loss. This is apparent from noting that . This risk gap is closely related to Blackwell’s original notion of sufficiency. Since the log-loss is strictly proper, a vanishing  implies that the risk gap is zero for all loss functions. This suggests that minimizing the log-loss risk gap under a suitable regularization constraint is a potential recipe for constructing representations  that are approximately sufficient for  w.r.t. , since in the limit when  one would achieve . This is indeed the basis for the IB algorithm (Tishby et al. 1999) and its generalization, clustering with Bregman divergences (Banerjee et al. 2005, van Rooyen and Williamson 2015).

One can also approximate a sufficient statistic by minimizing deficiencies instead. This follows from noting the following equivalence. The proof is in Appendix C.

###### Proposition 2.

When is a Markov chain, .

In general, for the bottleneck paradigms involving the conditional mutual information (IB) and the deficiency (DB), we have the following relationship:

 mine(z|x): I(Y;X|Z)≤ϵI(X;Z)≥mine(z|x): δπ(d,κ)≤ϵI(X;Z). (11)

It is clear from Proposition 2 that the representations are going to be the same only in the limit of exact sufficiency. Our experiments corroborate that for achieving the same level of sufficiency, one needs to store less information about the input  when minimizing the deficiencies than when minimizing the conditional mutual information.

## 4. Experiments

We present some experiments on the MNIST dataset (LeCun and Cortes 2010). Classification on MNIST is a very well studied problem. The main objective of our experiments is to evaluate the information-theoretic properties of the representations learned by the VDB and whether it can match the classification accuracy provided by other bottleneck methods.

For the encoder we use a fully connected feedforward network with 784 input units, 1024 ReLUs, 1024 ReLUs, and 512 linear output units. The deterministic output of this network is interpreted as the vector of means and variances of a 256 dimensional Gaussian distribution. The decoder is a simple logistic regression model with a softmax layer. These are the same settings of the model used by

Alemi et al. (2016)

. We implement the algorithm in Tensorflow and train for 200 epochs using the Adam optimizer.

As can be seen from the upper panels in Figure 2, the test accuracy is stable with increasing . We note that is just the VIB model (Alemi et al. 2016). The lower left panel of Figure 2 shows the information bottleneck curve. The IB curve traces the mutual information  between representation and output vs. the mutual information  between representation and input, for different values of the regularization parameter  at the end of training. In the case of the VDB, we substitute by the corresponding term in our algorithm, which is , and which is the value that we are actually plotting. Here is the entropy of the output, . For orientation, lower values of have higher values of  (towards the right of the plot). For small values of , when the effect of the regularization is negligible, the bottleneck allows more information from the input through the representation. In this case, the mutual information between the representation and output increases on the training set, but not necessarily on the test set. This is manifest in the gap between the train and test curves indicative of a degradation in generalization. For intermediate values of , the gap is smaller for larger values of (our method).

The lower right panel of Figure 2 plots the minimality term  vs. . We see that, for  in the range between  and , for the same level of sufficiency, setting  consistently achieves more compression of the input compared to the setting . The dynamics of the information quantities during training are also interesting. We provide figures on these in Appendix D. Figure 2. Effect of the regularization parameter β. The upper panels show the accuracy on train and test data after training for different values of M and two different values of L. Here, M is the number of encoder output samples used in the training objective. L is the number of encoder output samples used for evaluating the classifier (i.e., we use 1L∑Lj=1d(y|z(j)) where z(j)∼e(z|x)). The lower left panel shows the information bottleneck curve, which traces sufficiency vs. minimality terms after training for different values of β (see text). The curves are averages over 5 repetitions of the experiment. Each curve corresponds to one value of M=1,3,6,12. Notice the generalization gap in the lower left panel for small values of β (towards the right of the plot). The lower right panel plots the minimality term vs. β. Evidently, the levels of compression vary depending on M. Higher values of M (our method) lead to a more compressed representation. For M=1, the VDB and the VIB models coincide. Figure 3. We trained the VDB on MNIST with the basic encoder given by a fully connected network with two hidden layers of ReLUs producing the means and variance of 2D independent Gaussian latent representation. Ellipses represent the posterior distributions of 1000 input images in latent space after training with β=100,10−1,10−3,10−5 and M=1,3,6,12. Color corresponds to the class label. Figure 4. Learning curves for MNIST, where the encoder is a MLP of size 784–1024–1024–2K, the last layer being a K=2 dimensional diagonal Gaussian. The decoder is simply a softmax with 10 classes. The left figure plots the mutual information between the representation and the class label, I(Z;Y), against the mutual information between the representation at the last layer of the encoder and the input, I(Z;X), as training progresses. The former increases monotonically, while the latter increases and then decreases.

In order to visualize the representations, we also train the VDB on MNIST with a 2 dimensional representation. We use the same settings as before, with the only difference that the dimension of the output layer of the encoder is , with two coordinates representing the mean, and two a diagonal covariance matrix. The results are shown in Figure 3. For

, the representations are well separated, depending on the class. For related figures in the setting of unsupervised learning see Appendix

E. The learning dynamics of the mutual information and classification accuracy are shown in Figure 4. The left panel has an interpretation in terms of a phase where the model is mainly fitting the input-output relationship and hence increasing the mutual information , followed by a compression phase, where training is mainly reducing , leading to a better generalization. The right panel shows the test accuracy as training progresses. Higher values of (our method) usually lead to better accuracy. An exception is when the number  of posterior samples for classification is large.

## 5. Discussion

We have formulated a bottleneck method based on channel deficiencies. The deficiency of a decoder with respect to the true channel between input and output quantifies how well a randomization at the decoder input (by way of stochastic encodings) can be used to simulate the true channel. The VDB has a natural variational formulation which recovers the VIB in the limit of a single sample of the encoder output. Experiments demonstrate that the VDB can learn more compressed representations while retaining the same discriminative capacity. The method has a statistical decision-theoretic appeal. Moreover, the resulting variational objective of the DB can be implemented as an easy modification of the VIB, with little to no computational overhead.

Given two channels that convey information about a target variable of interest, two different notions of deficiencies arise, depending on whether the target resides at the common input or the common output of the given channels. When the target is at the common output of the two channels, as is in a typical bottleneck setting (see Figure 1), our Definition 1 has a natural interpretation as a KL-divergence from input-Blackwell sufficiency (Nasser 2017). Here sufficiency is achieved by applying a randomization at the input of the decoder with the goal of simulating the true channel. The notion of input-Blackwell sufficiency contrasts with Blackwell’s original notion of sufficiency (Blackwell 1953) in the sense that Blackwell’s theory compares two channels with a common input. One can again define a notion of deficiency in this setting (see Appendix B for a discussion on deficiencies in the classical Blackwell setup). The associated channels (one from  to and the other from  to ) do not however have a natural interpretation in a typical bottleneck setting. In contrast, the input-Blackwell setup appears to be much more intuitive in this context. This subtle distinction seems to have gone unnoticed in the literature (see e.g. van Rooyen and Williamson 2014; 2015).

The more detailed view of information emerging from this analysis explains various effects and opens the door to multiple generalizations. In the spirit of the VDB, one can formulate a variational deficiency autoencoder (VDAE) as well (see sketch in Appendix E). On a related note, we mention that the deficiency is a lower bound to a quantity called the Unique information (Bertschinger et al. 2014, Banerjee et al. 2018a) (see details in Appendix C). An alternating minimization algorithm similar in spirit to the classical Blahut-Arimoto algorithm (Blahut 1972) has been proposed to compute this quantity (Banerjee et al. 2018b). Such an algorithm however, is not feasible in a deep neural network implementation. In the limit , the VDB is a step forward towards estimating the unique information. This might be of independent interest in improving the practicality of the theory of information decompositions.

## References

• Alemi et al. (2016) A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck, 2016. URL http://arxiv.org/abs/1612.00410. ICLR17.
• Banerjee et al. (2005) A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences.

Journal of Machine Learning Research

, 6(Oct):1705–1749, 2005.
• Banerjee et al. (2018a) P. K. Banerjee, E. Olbrich, J. Jost, and J. Rauh. Unique informations and deficiencies. arXiv preprint arXiv:1807.05103, 2018a. Allerton 2018.
• Banerjee et al. (2018b) P. K. Banerjee, J. Rauh, and G. Montúfar. Computing the unique information. In Proc. IEEE ISIT, pages 141–145. IEEE, 2018b.
• Bertschinger and Rauh (2014) N. Bertschinger and J. Rauh. The Blackwell relation defines no lattice. In Proc. IEEE ISIT, pages 2479–2483. IEEE, 2014.
• Bertschinger et al. (2014) N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay. Quantifying unique information. Entropy, 16(4):2161–2183, 2014.
• Blackwell (1953) D. Blackwell. Equivalent comparisons of experiments. The Annals of Mathematical Statistics, 24(2):265–272, 1953.
• Blahut (1972) R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972.
• Boucheron et al. (2005) S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
• Chalk et al. (2016) M. Chalk, O. Marre, and G. Tkacik. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pages 1957–1965, 2016.
• Csiszár (1972) I. Csiszár. A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1-4):191–213, 1972.
• Csiszár and Körner (2011) I. Csiszár and J. Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
• Csiszár and Matuš (2003) I. Csiszár and F. Matuš. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
• Gneiting and Raftery (2007) T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
• Gondek and Hofmann (2003) D. Gondek and T. Hofmann. Conditional information bottleneck clustering. In 3rd IEEE International Conference on Data Mining, Workshop on Clustering Large Data Sets, 2003.
• Grünwald et al. (2004) P. D. Grünwald, A. P. Dawid, et al. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004.
• Harder et al. (2013) M. Harder, C. Salge, and D. Polani. A bivariate measure of redundant information. Physical Review E, 87:012130, 2013.
• Harremoës and Tishby (2007) P. Harremoës and N. Tishby. The information bottleneck revisited or how to choose a good distortion measure. In Proc. IEEE ISIT, pages 566–570. IEEE, 2007.
• Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. -VAE: Learning basic visual concepts with a constrained variational framework. 2017. ICLR17.
• Hsu et al. (2018) H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon. Generalizing bottleneck problems. In Proc. IEEE ISIT, pages 531–535. IEEE, 2018.
• Kingma and Welling (2013) D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013. ICLR13.
• Kolchinsky et al. (2017) A. Kolchinsky, B. D. Tracey, and D. H. Wolpert. Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436, 2017.
• Körner and Marton (1975) J. Körner and K. Marton. Comparison of two noisy channels. In Topics in information theory, volume 16, pages 411–423. Colloquia Mathematica Societatis J nos Bolyai, Keszthely (Hungary), 1975.
• Le Cam (1964) L. Le Cam. Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics, pages 1419–1455, 1964.
• LeCun and Cortes (2010) Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
• Liese and Vajda (2006) F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
• Nasser (2017) R. Nasser. On the input-degradedness and input-equivalence between channels. In Proc. IEEE ISIT, pages 2453–2457. IEEE, 2017.
• Parry et al. (2012) M. Parry, A. P. Dawid, S. Lauritzen, et al. Proper local scoring rules. The Annals of Statistics, 40(1):561–592, 2012.
• Raginsky (2011) M. Raginsky. Shannon meets Blackwell and Le Cam: Channels, codes, and statistical experiments. In Proc. IEEE ISIT, pages 1220–1224. IEEE, 2011.
• Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
• Shamir et al. (2008) O. Shamir, S. Sabato, and N. Tishby. Learning and generalization with the information bottleneck. In International Conference on Algorithmic Learning Theory, pages 92–107. Springer, 2008.
• Tishby and Zaslavsky (2015) N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pages 1–5. IEEE, 2015.
• Tishby et al. (1999) N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
• Torgersen (1991) E. Torgersen. Comparison of statistical experiments, volume 36. Cambridge University Press, 1991.
• van Rooyen and Williamson (2014) B. van Rooyen and R. C. Williamson. Le Cam meets LeCun: Deficiency and generic feature learning. arXiv preprint arXiv:1402.4884, 2014.
• van Rooyen and Williamson (2015) B. van Rooyen and R. C. Williamson. A theory of feature learning. arXiv preprint arXiv:1504.00083, 2015.
• Vera et al. (2018) M. Vera, P. Piantanida, and L. R. Vega. The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355, 2018.
• Witsenhausen and Wyner (1975) H. Witsenhausen and A. Wyner.

A conditional entropy bound for a pair of discrete random variables.

IEEE Transactions on Information Theory, 21(5):493–501, 1975.

## Appendix A Misclassification error and the average log-loss

In a classification task, the goal is to use the training dataset to learn a classifier  that minimizes the probability of error under the true data distribution, defined as follows.

 PE(ˆκ):=1−E(x,y)∼p[ˆκ(y|x)]. (12)

It is well known that the optimal classifier that gives the smallest probability of error is the Bayes classifier [Boucheron et al., 2005]. Since we do not know the true data distribution we try to learn based on the empirical error. Directly minimizing the empirical probability of error over the training dataset is in general a NP-hard problem. In practice, one minimizes a surrogate loss function that is a convex upper bound on . A natural surrogate is the average log-loss function . When the model is , the following upper bounds are immediate from using Jensen’s inequality.

 PE(ˆκ) ≤1−exp(−E(x,y)∼p[−logd∘e(y|x)]) ≤1−exp(−E(x,y)∼pEz∼e(z|x)[−logd(y|z)]) (13)

The bound using the standard cross-entropy loss is evidently weaker than the average log-loss. A lower bound on the probability of error is controlled by a convex functional of the mutual information between the representation and the raw inputs  [Vera et al., 2018; see, e.g., Lemma 4]. The average log-loss and the rate term in the VDB objective equation 4 are two fundamental quantities that govern the probability of error.

## Appendix B Classical Theory of Comparison of Channels

In this section, we discuss the classical theory of comparison of channels due to Blackwell  and its extension by Le Cam , Torgersen  and more recently by Raginsky .

Suppose that a learner wishes to predict the value of a random variable  that takes values in a set . She has a set of actions . Each action incurs a loss  that depends on the true state  of  and the chosen action . Let  encode the learners’ uncertainty about the true state . The tuple  is called a decision problem. Before choosing her action, the learner observes a random variable  through a channel . An ideal learner chooses a strategy  that minimizes her expected loss or risk . The optimal risk when using the channel  is .

Suppose now that the learner has to choose between  and another random variable  that she observes through a second channel  with common input . She can always discard  in favor of  if, knowing , she can simulate a single use of  by randomly sampling a after each observation .

###### Definition 3.

We say that  is output-degraded from  w.r.t. , denoted , if there exists a random variable  such that the pairs  and are stochastically indistinguishable, and is a Markov chain.

She can also discard  if her optimal risk when using  is at most that when using  for any decision problem. Write  if  for any decision problem. Blackwell  showed the equivalence of these two relations.

###### Theorem 4.

(Blackwell’s Theorem) .

Write  if  for some . If  has full support, then it easy to check that  [Bertschinger and Rauh, 2014; Theorem 4].

The learner can also compare  and  by comparing the mutual informations  and  between the common input  and the channel outputs  and .

###### Definition 5.

is said to be more capable than , denoted , if for all probability distribution on .

It follows from the data processing inequality that . However, the converse implication is not true in general [Körner and Marton, 1975].

The converse to the Blackwell’s theorem states that if the relation  does not hold, then there exists a set of actions  and a loss function  such that . Le Cam introduced the concept of a deficiency of  w.r.t.  to express this deficit in optimal risks [Le Cam, 1964] in terms of an approximation of  from  via Markov kernels.

###### Definition 6.

The deficiency of  w.r.t.  is

 (14)

where  denotes the total variation distance.

When the distribution of the common input to the channels is fixed, one can define a weighted deficiency [Torgersen, 1991; Section 6.2].

###### Definition 7.

Given , the weighted deficiency of  w.r.t.  is

 δπ(μ,κ):=infλ∈M(Z;X)Ey∼πY∥λ∘μy−κy∥TV. (15)

Le Cam’s randomization criterion [Le Cam, 1964] shows that deficiencies quantify the maximal gap in the optimal risks of decision problems when using the channel  rather than .

###### Theorem 8 (Le Cam ).

Fix , and a probability distribution  on  and write . For every , if and only if for any set of actions  and any bounded loss function .

Raginsky  introduced a broad class of deficiency-like quantities using the notion of a generalized divergence between probability distributions that satisfies a monotonicity property w.r.t. data processing. The family of -divergences due to Csiszár belongs to this class [Liese and Vajda, 2006].

###### Definition 9.

The -deficiency of  w.r.t.  is

 δf(μ,κ):=infλ∈M(Z;X)supy∈YDf(κy∥λ∘μy), (16)

Many common divergences, such as the Kullback-Leibler (KL) divergence, the reverse-KL divergence, and the total variation distance are -divergences. When the channel  is such that its output is constant, no matter what the input, the corresponding -deficiency is called -informativity [Csiszár, 1972]. The -informativity associated with the KL divergence is just the channel capacity which has a geometric interpretation as an “information radius”.

We can also define a weighted -deficiency of  w.r.t. .

###### Definition 10.

The weighted -deficiency of  w.r.t.  is

 δf(μ,κ):=infλ∈M(Z;X)Df(κy∥λ∘μy|πY), (17)

Specializing to the KL divergence, we have the following definition.

###### Definition 11.

The weighted output deficiency of  w.r.t.  is

 (18)

where the subscript  in  emphasizes the fact that the randomization is at the output of the channel .

Note that  if and only if , which captures the intuition that if  is small, then  is approximately output-degraded from  w.r.t. . Using Pinsker’s inequality, we have

 δπ(μ,κ)≤√ln(2)2δπo(μ,κ). (19)

## Appendix C Unique Information Bottleneck

In this section, we give a new perspective on the Information Bottleneck paradigm using nonnegative mutual information decompositions. The quantity we are interested in is the notion of Unique information proposed in [Bertschinger et al., 2014]. Work in similar vein include [Harder et al., 2013] and more recently [Banerjee et al., 2018a] which gives an operationalization of the unique information.

Consider three random variables , , and  with joint distribution . The mutual information between  and  can be decomposed into information that  has about  that is unknown to  (we call this the unique information of  w.r.t. ) and information that  has about  that is known to  (we call this the shared information).

 I(Y;X)=˜UI(Y;X∖Z)% unique X wrt Z+˜SI(Y;X,Z)shared (% redundant). (20)

Conditioning on  destroys the shared information but creates complementary or synergistic information from the interaction of  and .

 I(Y;X|Z)=˜UI(Y;X∖Z)% unique X wrt Z+˜CI(Y;X,Z)complementary (% synergistic). (21)

Using the chain rule, the total information that the pair

conveys about  can be decomposed into four terms.

 I(Y;XZ) =I(Y;X)+I(Y;Z|X) (22) =˜UI(Y;X∖Z)+˜SI(Y;X,Z)+˜UI(Y;Z∖X)+˜CI(Y;X,Z). (23)

, , and  are nonnegative functions that depend continuously on the joint distribution of .

For completeness, we rewrite the information decomposition equations below.

 I(Y;X) =˜UI(Y;X∖Z)+˜SI(Y;X,Z), (24a) I(Y;Z) =˜UI(Y;Z∖X)+˜SI(Y;X,Z), (24b) I(Y;X|Z) =˜UI(Y;X∖Z)+˜CI(Y;X,Z), (24c) I(Y;Z|X) =˜UI(Y;Z∖X)+˜CI(Y;X,Z), (24d)

The unique information can be interpreted as either the conditional mutual information without the synergy, or as the mutual information without the redundancy.

When is a Markov chain, the information decomposition is

 ˜UI(Y;Z∖X) =0, (25a) ˜UI(Y;X∖Z) =I(Y;X|Z)=I(Y;X)−I(Y;Z), (25b) ˜SI(Y;X,Z) =I(Y;Z), (25c) ˜CI(Y;X,Z) =0. (25d)

The Information bottleneck [Tishby et al., 1999] minimizes the following objective

 LIB(e)=I(Y;X|Z)+βI(X;Z), (26)

over all encoders . Since  is a Markov chain, the sufficiency term in the IB objective depends on the pairwise marginals  and , while the minimality term depends on the -marginal. From equation 25b, it follows that one can equivalently write the IB objective function as

 LIB(e)=˜UI(Y;X∖Z)+βI(X;Z). (27)

From an information decomposition perspective, the original IB is actually minimizing just the unique information subject to a regularization constraint. This is a simple consequence of the fact that the synergistic information  (see equation 25d) when we have the Markov chain condition . Hence, one might equivalently call the original IB as the Unique information bottleneck.

Appealing to classical Blackwell theory, Bertschinger et al.  defined a nonnegative decomposition of the mutual information  based on the idea that the unique and shared information should depend only on the pairwise marginals  and .

###### Definition 12.

Let , and let , be two channels with the same input alphabet such that  and . Define

 ΔP={Q∈PY×X×Z: QYX(y,x)=πY(y)κy(x), QYZ(y,z)=πY(y)μy(z)}, (28a) UI(Y;X∖Z) =minQ∈ΔPIQ(Y;X|Z), (28b) UI(Y;Z∖X) =minQ∈ΔPIQ(Y;Z|X), (28c) SI(Y;X,Z) =I(Y;X)−UI(Y;X∖Z), (28d) CI(Y;X,Z) =I(Y;X|Z)−UI(Y;X∖Z), (28e)

where the subscript  in  denotes that joint distribution on which the quantities are computed.

The functions , , and  are nonnegative and satisfy equation 24. Furthermore, and  depend on the marginal distributions of the pairs  and . Only the function  depends on the full joint .

satisfies the following intuitive property in relation to Blackwell’s theorem 4.

###### Proposition 13.

[Bertschinger et al., 2014; Lemma 6] .

The equivalence in Proposition 2 follows from noting that   [Banerjee et al., 2018a; Proposition 28] and the fact that when is a Markov chain.

Since in the IB setting, there is no complementary information, one may choose to minimize either which is in fact equal to (the original IB objective) or minimize the deficiency . From the discussion above, it is clear that the results are going to be equivalent only in the limit of exact sufficiency since . In all other cases, we expect to find something (subtly) different from IB. In general, for the bottlenecks, we have:

 mine(z|x): I(Y;X|Z)≤ϵI(X;Z)≥mine(z|x): δπ(d,κ)≤ϵI(X;Z). (29)

Hence, for achieving the same level of sufficiency, one needs to store less information about the input  when minimizing the deficiencies than when minimizing the conditional mutual information.

## Appendix D Additional figures on VDB experiments Figure 5. Evolution of the mutual information between representation and output vs. representation and input (values farther up and to the left are better) over 200 training epochs (dark to light color) on MNIST. The curves are averages over 20 repetitions of the experiment. At early epochs, training mainly effects fitting of the input-output relationship and an increase of I(Z;Y). At later epochs, training mainly effects a decrease of I(Z;X), which corresponds to the representation increasingly discarding information about the input. An exception is when the regularization parameter β is very small. In this case the representation captures more information about the input, and longer training decreases I(Z;Y), which is indicative of overfitting to the training data. Higher values of M (our method) lead to the representation capturing more information about the target, while at the same time discarding more information about the input. M=1 corresponds to the Variational Information Bottleneck.

## Appendix E Unsupervised representation learning using the VDB

In this section, we discuss some preliminary results on an unsupervised version of the VDB objective which bears some resemblance to the -VAE [Higgins et al., 2017]. The cross entropy loss that appears in an autoencoder is similar to the cross entropy that appears in the information bottleneck. In a spirit similar to the VDB, we can formulate a Deficiency Autoencoder. We call the unsupervised version the Variational Deficiency Autoencoder (VDAE).

Let  be the true data density. The optimization objective in the VDAE is

 mine(z|x), d(x|z)∫p(x)dx[−log∫d(x|z)e(z|x)dz+βDKL(e(Z|x)∥r(Z))] (30)

where  is defined to be a standard multivariate Gaussian distribution  and the parameter

allows us to interpolate between pure autoencoding (

) and pure autodecoding () behavior.

The optimization objective in the -VAE [Higgins et al., 2017] is

 mine(z|x), d(x|z)∫p(x)dx[−∫e(z|x)logd(x|z)dz+βDKL(e(Z|x)∥r(Z))]. (31)

We note that the -VAE has a similar-looking training objective as the VDAE, with the only difference that the integral is outside the log.

Figures 6 and 7 show some preliminary experiments on the MNIST dataset. The representations are optically comparable with results obtained in other standard works on the Variational Autoencoder. Figure 6. The learned MNIST manifold for the VDAE with M=3. The plot shows the representation (mean values of the posterior) for 5000 test examples. Figure 7. Sampling grids in latent space for the VDAE. These plots show the geometric coherence in the latent space of the decoder. The settings are as in Figure 6.