1 Introduction
Tishby et al. (2000) introduced the Information Bottleneck (IB) objective function which learns a representation of observed variables that retains as little information about as possible, but simultaneously captures as much information about as possible:
(1) 
is the mutual information. The hyperparameter
controls the tradeoff between compression and prediction, in the same spirit as RateDistortion Theory (Shannon, 1948), but with a learned representation function that automatically captures some part of the “semantically meaningful” information, where the semantics are determined by the observed relationship between and .The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables (Chechik et al. (2005)), metaGaussians (Rey and Roth (2012)), continuous variables via variational methods (Alemi et al. (2016); Chalk et al. (2016); Fischer (2018)), deterministic scenarios (Strouse and Schwab (2017a); Kolchinsky et al. (2019)), geometric clustering (Strouse and Schwab (2017b)), and is used for learning invariant and disentangled representations in deep neural nets (Achille and Soatto (2018a, b)). However, a core issue remains: how should we set a good ? In the original work, the authors recommend sweeping , which can be prohibitively expensive in practice, but also leaves open interesting theoretical questions around the relationship between , , and the observed data, .
This work begins to answer some of those questions by characterizing the onset of learning. Specifically:

We show that improperly chosen may result in a failure to learn: the trivial solution becomes the global minimum of the IB objective, even for (Section 1.1).

We introduce the concept of IBLearnability, and show that when we vary , the IB objective will undergo a phase transition from the inability to learn to the ability to learn (Section 3).

Using the secondorder variation, we derive sufficient conditions for IBLearnability, which provide theoretical guidance for choosing a good (Section 4).

We show that IBLearnability is determined by the largest confident, typical, and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane vs. , and discuss its relation to model capacity (Section 5).

We additionally prove a deep relationship between IBLearnability, the hypercontractivity coefficient, the contraction coefficient, and the maximum correlation (Section 5).
We also present an algorithm for estimating the onset of IBLearnability and the conspicuous subset, and demonstrate that it does a good job of approximating both the theoretical predictions and the empirical results (Section 6). Finally, we use our main results to demonstrate on synthetic datasets, MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky and Hinton, 2009) that the theoretical prediction for IBLearnability closely matches experiment (Section 7).
1.1 A Motivating Example
How can we choose a good ? To gain intuition, consider learning multiple Variational Information Bottleneck (VIB) representations (Alemi et al., 2016) of MNIST (LeCun et al., 1998) at different . We select the digits 0 and 1 for binary classification, and add classconditional noise (Angluin and Laird, 1988)
to the labels with flip probability 0.2, which simulates a general scenario where the data may be noisy and the dependence of
on is not deterministic. The algorithm only sees the corrupted labels. Fig. 1 shows the converged accuracy on the true labels for the VIB models plotted against . We see clearly that when , no learning happens, and the accuracy is the same as random guessing. Beginning with , there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a nontrivial representation. This kind of phase transition is typical in our experiments in Section 7. When the noise rate is high, the transition can happen at ; i.e., we need a large “ force” to extract relevant information from to predict . In that case, an improperlychosen in the unlearnable region will preclude learning a useful representation.2 Related Work
The original IB work (Tishby et al., 2000) provides a tabular method for exactly computing the optimal encoder distribution for a given and cardinality of the discrete representation, . Thus, the search for the desired model involves not only sweeping , but also considering different representation dimensionalities. These restrictions were lifted somewhat by Chechik et al. (2005), which presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation of , assuming that both and are also multivariate Gaussians. They also note the presence of the trivial solution not only when , but also depending on the eigenspectrum of the observed variables. However, the restriction to multivariate Gaussian datasets limits the generality of the analysis. Another analytic treatment of IB is given in Rey and Roth (2012)
, which reformulates the objective in terms of the copula functions. As with the GIB approach, this formulation restricts the form of the data distributions – the copula functions for the joint distribution
are assumed to be known, which is unlikely in practice.Strouse and Schwab (2017a) presents the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, , rather than the transmission cost, as in IB. This approach learns hard clusterings with different code entropies that vary with . In this case, it is clear that a hard clustering with minimal will result in a single cluster for all of the data, which is the DIB trivial solution. No analysis is given beyond this fact to predict the actual onset of learnability, however.
The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. (2016)
. VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution (
) and marginal distribution (). This approach cleanly permits learning a stochastic encoder, , that is applicable to any , rather than just the particular seen at training time. The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method. Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) (Fischer, 2018). CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where . The MNI point may not be achievable even in principle for a particular dataset. However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when . The CEB objective is equivalent to IB at , so our analysis of IBLearnability applies equally to CEB.
Kolchinsky et al. (2019) presents analytic and empirical results about trivial solutions in the particular setting of being a deterministic function of in the observed sample. However, their use of the term “trivial solution” is distinct from ours. They are referring to the observation that
will demonstrate trivial interpolation between two different but valid solutions on the optimal frontier, rather than demonstrating a nontrivial tradeoff between compression and prediction as expected when varying the IB Lagrangian. Our use of “trivial” refers to whether IB is capable of learning at all given a certain dataset and value of
.Achille and Soatto (2018b)
apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout. In
Achille and Soatto (2018a), the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations. They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a nontrivial representation. More recently, Achille et al. (2018) repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previouslytrained network can be finetuned for a new task. While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.Our work is also closely related to the hypercontractivity coefficient (Anantharam et al. (2013); Polyanskiy and Wu (2017)), defined as , which by definition equals the inverse of , our IBlearnability threshold. In Anantharam et al. (2013), the authors prove that the hypercontractivity cofficient equals the contraction coefficient , and Kim et al. (2017) propose a practical algorithm to estimate , which provides a measure for potential influence in the data. Although our goal is different, the sufficient conditions we provide for IBLearnability are also lower bounds for the hypercontractivity coefficient.
3 IbLearnability
We are given instances of drawn from a distribution with probability (density) , where unless otherwise stated, both and can be discrete or continuous variables. is our training data, and may be characterized by different types of noise. The nature of this training data and the choice of will be sufficient to predict the transition from unlearnable to learnable.
We can learn a representation of with conditional probability^{1}^{1}1We use capital letters
for random variables and lowercase
to denote the instance of variables, with and denoting their probability or probability density, respectively. , such thatobey the Markov chain
. Eq. 1 above gives the IB objective with Lagrange multiplier , , which is a functional of : . The IB learning task is to find a conditional probability that minimizes . The larger , the more the objective favors making a good prediction for . Conversely, the smaller , the more the objective favors learning a concise representation.How can we select such that the IB objective learns a useful representation? In practice, the selection of is done empirically. Indeed, Tishby et al. (2000) recommends “sweeping ”. In this paper, we provide theoretical guidance for choosing by introducing the concept of IBLearnability and providing a series of IBlearnable conditions.
Definition 1.
is learnable if there exists a given by some , such that , where characterizes the trivial representation where is independent of .
If is learnable, then when is globally minimized, it will not learn a trivial representation. On the other hand, if is not learnable, then when is globally minimized, it may learn a trivial representation.
Trivial solutions.
Definition 1 defines trivial solutions in terms of representations where . Another type of trivial solution occurs when but . This type of trivial solution is not directly achievable by the IB objective, as is minimized, but it can be achieved by construction or by chance. It is possible that starting learning from could result in access to nontrivial solutions not available from . We do not attempt to investigate this type of trivial solution in this work.
Necessary condition for IBLearnability.
From Definition 1, we can see that Learnability for any dataset requires . In fact, from the Markov chain , we have via the dataprocessing inequality. If , then since and , we have that . Hence is not learnable for .
Due to the reparameterization invariance of mutual information, we have the following theorem for Learnability:
Theorem 1.
Let be an invertible map (if is a continuous variable, is additionally required to be continuous). Then and have the same Learnability.
4 Sufficient Conditions for IbLearnability
Given , how can we determine whether it is learnable? To answer this question, we derive a series of sufficient conditions for Learnability, starting from its definition. The conditions are in increasing order of practicality, while sacrificing as little generality as possible.
Theorem 2.
If is learnable, then for any , it is learnable.
Based on Theorem 2, the range of such that is learnable has the form . Thus, is the threshold of IBLearnability.
Lemma 2.1.
is a stationary solution for .
The proof in Appendix F shows that both firstorder variations and vanish at the trivial representation , so at the trivial representation.
Lemma 2.1 yields our strategy for finding sufficient conditions for learnability: find conditions such that is not a local minimum for the functional . Based on the necessary condition for the minimum (Appendix D), we have the following theorem ^{2}^{2}2The theorems in this paper deal with learnability w.r.t. true mutual information. If parameterized models are used to approximate the mutual information, the limitation of the model capacity will translate into more uncertainty of given , viewed through the lens of the model.:
Theorem 3 (Suff. Cond. 1).
A sufficient condition for to be learnable is that there exists a perturbation function^{3}^{3}3so that the perturbed probability (density) is . Also, for integrals, whenever a variable is discrete, we can simply replace the integral by summation . with , such that the secondorder variation at the trivial representation .
The proof for Theorem 3 is given in Appendix D. Intuitively, if , we can always find a in the neighborhood of the trivial representation , such that , thus satisfying the definition for Learnability.
To make Theorem 3 more practical, we perturb around the trivial solution , and expand to the second order of . We can then prove Theorem 4:
Theorem 4 (Suff. Cond. 2).
A sufficient condition for to be learnable is and are not independent, and
(2) 
where the functional is given by
Moreover, we have that is a lower bound of the slope of the Pareto frontier in the information plane vs. at the origin.
The proof is given in Appendix G, which also shows that if in Theorem 4 is satisfied, we can construct a perturbation function with , for some , such that satisfies Theorem 3. It also shows that the converse is true: if there exists such that the condition in Theorem 3 is true, then Theorem 4 is satisfied^{4}^{4}4We do not claim that any satisfying Theorem 3 can be decomposed to at the onset of learning. But from the equivalence of Theorems 3 and 4 as explained above, when there exists an such that Theorem 3 is satisfied, we can always construct an that also satisfies Theorem 3. , i.e. . Moreover, letting the perturbation function at the trivial solution, we have
(3) 
where is the estimated by IB for a certain , , and is a constant. This shows how the by IB explicitly depends on at the onset of learning. The proof is provided in Appendix H.
Theorem 4 suggests a method to estimate : we can parameterize e.g. by a neural network, with the objective of minimizing . At its minimization, provides an upper bound for , and provides a soft clustering of the examples corresponding to a nontrivial perturbation of at that minimizes .
Alternatively, based on the property of , we can also use a specific functional form for in Eq. (2), and obtain a stronger sufficient condition for Learnability. But we want to choose as near to the infimum as possible. To do this, we note the following characteristics for the R.H.S of Eq. (2):

We can set to be nonzero if for some region and 0 otherwise. Then we obtain the following sufficient condition:
(4)
Based on these observations, we can let be a nonzero constant inside some region and 0 otherwise, and the infimum over an arbitrary function is simplified to infimum over , and we obtain a sufficient condition for Learnability, which is a key result of this paper:
Theorem 5 (Conspicuous Subset Suff. Cond.).
A sufficient condition for to be learnable is and are not independent, and
(5) 
where
denotes the event that , with probability .
gives a lower bound of the slope of the Pareto frontier in the information plane vs. at the origin.
The proof is given in Appendix I. In the proof we also show that this condition is invariant to invertible mappings of .
5 Discussion
The conspicuous subset determines .
From Eq. (5), we see that three characteristics of the subset lead to low : (1) confidence: is large; (2) typicality and size: the number of elements in is large, or the elements in are typical, leading to a large probability of ; (3) imbalance: is small for the subset , but large for its complement. In summary, will be determined by the largest confident, typical and imbalanced subset of examples, or an equilibrium of those characteristics. We term at the minimization of the conspicuous subset.
Multiple phase transitions.
Based on this characterization of , we can hypothesize datasets with multiple learnability phase transitions. Specifically, consider a region that is small but “typical”, consists of all elements confidently predicted as by , and where is the least common class. By construction, this will dominate the infimum in Eq. (5), resulting in a small value of . However, the remaining effectively form a new dataset, . At exactly , we may have that the current encoder, , has no mutual information with the remaining classes in ; i.e., . In this case, Definition 1 applies to with respect to . We might expect to see that, at , learning will plateau until we get to some that defines the phase transition for . Clearly this process could repeat many times, with each new dataset being distinctly more difficult to learn than .
Similarity to information measures.
The denominator of in Eq. (5) is closely related to mutual information. Using the inequality for , it becomes:
where is the mutual information “density” at . Of course, this quantity is also , so we know that the denominator of Eq. (5) is nonnegative. Incidentally, is the density of “rational mutual information” (Lin and Tegmark (2016)) at .
Similarly, the numerator of is related to the selfinformation of :
so we can estimate the phase transition as:
(6) 
Since Eq. (6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on .
Estimating model capacity.
The observation that a model can’t distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IBLearnability to measure the capacity of a set of models relative to the task they are being used to solve.
Learnability and the Information Plane.
Many of our results can be interpreted in terms of the geometry of the Pareto frontier illustrated in Fig. 2, which describes the tradeoff between increasing and decreasing . At any point on this frontier that minimizes , the frontier will have slope if it is differentiable. If the frontier is also concave (has negative second derivative), then this slope will take its maximum at the origin, which implies Learnability for , so that the threshold for Learnability is simply the inverse slope of the frontier at the origin. More generally, as long as the Pareto frontier is differentiable, the threshold for IBlearnability is the inverse of its maximum slope. Indeed, Theorem 4 and Theorem 5 give lower bounds of the slope of the Pareto frontier at the origin.
IBLearnability, hypercontractivity, and maximum correlation.
IBLearnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation:
(7) 
which we prove in Appendix K. Here s.t. and is the maximum correlation (Hirschfeld, 1935; Gebelein, 1941), is the hypercontractivity coefficient, and is the contraction coefficient. Our proof relies on Anantharam et al. (2013)’s proof . Our work reveals the deep relationship between IBLearnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of .
6 Estimating the IbLearnability Condition
Theorem 5 not only reveals the relationship between the learnability threshold for and the least noisy region of , but also provides a way to practically estimate , both in the general classification case, and in more structured settings.
6.1 Estimation Algorithm
Based on Theorem 5, for general classification tasks we suggest Algorithm 1 to empirically estimate an upperbound , as well as discovering the conspicuous subset that determines .
We approximate the probability of each example by its empirical probability, . E.g., for MNIST, , where is the number of examples in the dataset. The algorithm starts by first learning a maximum likelihood model of
, using e.g. feedforward neural networks. It then constructs a matrix
and a vector
to store the estimated and for all the examples in the dataset. To find the subset such that the is as small as possible, by previous analysis we want to find a conspicuous subset such that its is large for a certain class (to make the denominator of Eq. (5) large), and containing as many elements as possible (to make the numerator small).We suggest the following heuristics to discover such a conspicuous subset. For each class
, we sort the rows of according to its probability for the pivot class by decreasing order, and then perform a search over for . Since is large when contains too few or too many elements, the minimum of for class will typically be reached with some intermediatesized subset, and we can use binary search or other discrete search algorithm for the optimization. The algorithm stops when does not improve by tolerance . The algorithm then returns the as the minimum over all the classes , as well as the conspicuous subset that determines this .After estimating , we can then use it for learning with IB, either directly, or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have. This may be particularly important for very noisy datasets, where can be very large.
6.2 Special Cases for Estimating
Theorem 5 may still be challenging to estimate, due to the difficulty of making accurate estimates of and searching over . However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.
Classconditional label noise.
Classification with noisy labels is a common practical scenario. An important noise model is that the labels are randomly flipped with some hidden classconditional probabilities and we only observe the corrupted labels. This problem has been studied extensively (Angluin and Laird, 1988; Natarajan et al., 2013; Liu and Tao, 2016; Xiao et al., 2015; Northcutt et al., 2017). If IB is applied to this scenario, how large do we need? The following corollary provides a simple formula.
Corollary 5.1.
Suppose that the true class labels are , and the input space belonging to each has no overlap. We only observe the corrupted labels with classconditional noise , and is not independent of . We have that a sufficient condition for Learnability is:
(8) 
We see that under classconditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates and the true class probability , which can be accurately estimated via e.g. Northcutt et al. (2017). Additionally, if we know that the noise is classconditional, but the observed is greater than the R.H.S. of Eq. (8), we can deduce that there is overlap between the true classes. The proof of Corollary 5.1 is provided in Appendix J.
Deterministic relationships.
Theorem 5 also reveals that relates closely to whether is a deterministic function of , as shown by Corollary 5.2:
Corollary 5.2.
Assume that contains at least one value such that its probability . If is a deterministic function of and not independent of , then a sufficient condition for Learnability is .
The assumption in the corollary 5.2 is satisfied by classification, and certain regression problems. Combined with the necessary condition for any dataset to be learnable (Section 3), we have that under the assumption, if is a deterministic function of , then a necessary and sufficient condition for learnability is ; i.e., its is 1. The proof of Corollary 5.2 is provided in Appendix J.
Therefore, in practice, if we find that , we may infer that is not a deterministic function of . For a classification task, we may infer that either some classes have overlap, or the labels are noisy. However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed , even when learning deterministic functions.
7 Experiments
To test how the theoretical conditions for learnability match with experiment, we apply them to synthetic data with varying noise rates and class overlap, MNIST binary classification with varying noise rates, and CIFAR10 classification, comparing with the found experimentally. We also compare with the algorithm in Kim et al. (2017) for estimating the hypercontractivity coefficient (=) via the contraction coefficient . Experiment details are in Section L.
7.1 Synthetic Dataset Experiments
We construct a set of datasets from 2D mixtures of 2 Gaussians as and the identity of the mixture component as . We simulate two practical scenarios with these datasets: (1) noisy labels with classconditional noise, and (2) class overlap. For (1), we vary the classconditional noise rates. For (2), we vary class overlap by tuning the distance between the Gaussians. For each experiment, we sweep with exponential steps, and observe and . We then compare the empirical indicated by the onset of abovezero with predicted values for .
Classification with classconditional noise.
In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix
. The two components have distance 16 (hence virtually no overlap) and equal mixture weight. For each , the label is the identity of which component it belongs to. We create multiple datasets by randomly flipping the labels with a certain noise rate . For each dataset, we train VIB models across a range of , and observe the onset of learning via random (Observed). To test how different methods perform in estimating , we apply the following methods: (1) Corollary 5.1, since this is classification with classconditional noise, and the two true classes have virtually no overlap; (2) Alg. 1 with true ; (3) The algorithm in Kim et al. (2017) that estimates , provided with true ; (4) in Eq. (2); (2) Alg. 1 with estimated by a neural net; (3) with the same as in (2). The results are shown in Fig. 3 and in Appendix L.1.From Fig. 3 we see the following. (A) When using the true , both Alg. 1 and generally upper bound the empirical , and Alg. 1 is generally tighter. (B) When using the true , Alg. 1 and Corollary 5.1 give the same result. (C) Comparing Alg. 1 and both of which use the same empirically estimated , both approaches provide good estimation in the lownoise region; however, in the highnoise region, Alg. 1 gives more precise values than , indicating that Alg. 1 is more robust to the estimation error of . (D) Eq. (2) empirically upper bounds the experimentally observed , and gives almost the same result as theoretical estimation in Corollary 5.1 and Alg. 1 with the true . In the classification setting, this approach doesn’t require any learned estimate of , as we can directly use the empirical and from SGD minibatches.
This experiment also shows that for dataset where the signaltonoise is small, can be very high. Instead of blindly sweeping , our result can provide guidance for setting so learning can happen.
Classification with class overlap.
In this experiment, we test how different amounts of overlap among classes influence . We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix . The two components have weights 0.6 and 0.4. We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the . Since we don’t add noise to the labels, if there were no overlap and a deterministic map from to , we would have by Corollary 5.2. The more overlap between the two classes, the more uncertain is given . By Eq. 5 we expect to be larger, which is corroborated in Fig. 4.
7.2 MNIST Experiments
We perform binary classification with digits 0 and 1, and as before, add classconditional noise to the labels with varying noise rates
. To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with
neurons, the other with neurons. As we describe in Section 4, insufficient capacity will result in more uncertainty of given from the point of view of the model, so we expect the observed for the model to be larger. This result is confirmed by the experiment (Fig. 5). Also, in Fig. 5 we plot given by different estimation methods. We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.7.3 MNIST Experiments using Equation 2
To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Eq. (2) w.r.t. the full MNIST dataset, and visualize the clustering of digits by . Eq. (2) can be optimized using SGD using any differentiable parameterized mapping . In this case, we chose to parameterize with a PixelCNN++ architecture (van den Oord et al., 2016; Salimans et al., 2017)
, as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as
). Eq. (2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of correspond to the subset of the data that is easiest to learn. Fig. 6 shows two strongly separated clusters, as well as the threshold we choose to divide them. Fig. 8 shows the first 5,776 MNIST training examples as sorted by our learned , with the examples above the threshold highlighted in red. We can clearly see that our learned has separated the “easy” one (1) digits from the rest of the MNIST training set.7.4 CIFAR10 Forgetting Experiments
For CIFAR10 (Krizhevsky and Hinton, 2009), we study how forgetting varies with . In other words, given a VIB model trained at some high , if we anneal it down to some much lower , what does the model converge to? Using Alg. 1, we estimated on a version of CIFAR10 with 20% label noise, where the is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB. For the VIB models, the lowest with performance above chance was , a very tight match with the estimate from Alg. 1. See Appendix L.2 for details.
8 Conclusion
In this paper, we have presented theoretical results for predicting the onset of learning, and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset, and showed that those predictions are accurate, even in cases of extreme label noise. We believe these results will provide theoretical and practical guidance for choosing in the IB framework for balancing prediction and compression. Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified. We hope to address some of those questions in future work.
Acknowledgements
Tailin Wu’s work was supported by the The Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. He thanks the Center for Brains, Minds, and Machines (CBMM) for hospitality.
References

Achille and Soatto (2018a)
Alessandro Achille and Stefano Soatto.
Emergence of invariance and disentanglement in deep representations.
The Journal of Machine Learning Research
, 19(1):1947–1980, 2018a.  Achille and Soatto (2018b) Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018b.
 Achille et al. (2018) Alessandro Achille, Glen Mbeng, and Stefano Soatto. The Dynamics of Differential Learning I: InformationDynamics and Task Reachability. arXiv preprint arXiv:1810.02440, 2018.
 Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 Anantharam et al. (2013) Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover. arXiv preprint arXiv:1304.6133, 2013.
 Angluin and Laird (1988) Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
 Chalk et al. (2016) Matthew Chalk, Olivier Marre, and Gasper Tkacik. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pages 1957–1965, 2016.
 Chechik et al. (2005) Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.
 Cubuk et al. (2018) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 Erkip and Cover (1998) Elza Erkip and Thomas M Cover. The efficiency of investment information. IEEE Transactions on Information Theory, 44(3):1026–1040, 1998.
 Fischer (2018) Ian Fischer. The conditional entropy bottleneck, 2018. URL openreview.net/forum?id=rkVOXhAqY7.
 Gebelein (1941) Hans Gebelein. Das statistische problem der korrelation als variationsund eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMMJournal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 21(6):364–379, 1941.
 Gelfand et al. (2000) Izrail Moiseevitch Gelfand, Richard A Silverman, et al. Calculus of variations. Courier Corporation, 2000.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep Residual Learning for Image Recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  Hirschfeld (1935) Hermann O Hirschfeld. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 31, pages 520–524. Cambridge University Press, 1935.
 Kim et al. (2017) Hyeji Kim, Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Discovering potential correlations via hypercontractivity. In Advances in Neural Information Processing Systems, pages 4577–4587, 2017.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kolchinsky et al. (2019) Artemy Kolchinsky, Brendan D Tracey, and Steven Van Kuyk. Caveats for information bottleneck in deterministic scenarios. ICLR, 2019.
 Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lin and Tegmark (2016) Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics. arXiv preprint arXiv:1606.06737, 2016.
 Liu and Tao (2016) Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016.
 Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
 Northcutt et al. (2017) Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017.

Polyanskiy and Wu (2017)
Yury Polyanskiy and Yihong Wu.
Strong dataprocessing inequalities for channels and bayesian networks.
In Convexity and Concentration, pages 211–249. Springer, 2017.  Rényi (1959) Alfréd Rényi. On measures of dependence. Acta mathematica hungarica, 10(34):441–451, 1959.
 Rey and Roth (2012) Mélanie Rey and Volker Roth. Metagaussian information bottleneck. In Advances in Neural Information Processing Systems, pages 1916–1924, 2012.
 Salimans et al. (2017) Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In ICLR, 2017.
 Shannon (1948) Claude Elwood Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379–423, 1948.
 Strouse and Schwab (2017a) DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29(6):1611–1630, 2017a.
 Strouse and Schwab (2017b) DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXiv preprint arXiv:1712.09657, 2017b.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional Image Generation with PixelCNN Decoders. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4790–4798. Curran Associates, Inc., 2016.
 Xiao et al. (2015) Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.
 Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv: 1605.07146, 2016.
Appendix A Preliminaries: firstorder and secondorder variations
Let functional be defined on some normed linear space . Let us add a perturbative function to , and now the functional can be expanded as
where denotes the norm of , is a linear functional of , and is called the firstorder variation, denoted as . is a quadratic functional of , and is called the secondorder variation, denoted as .
If , we call a stationary solution for the functional .
If for all such that is at the neighborhood of , we call a (local) minimum of .
Appendix B Proof of Theorem 1
Proof.
If is learnable, then there exists given by some such that , where satisfies . Since is a invertible map (if is continuous variable, is additionally required to be continuous), and mutual information is invariant under such an invertible map (Kraskov et al. (2004)), we have that , so is learnable. On the other hand, if is not learnable, then , we have . Again using mutual information’s invariance under , we have for all , , leading to that is not learnable. Therefore, we have that and have the same learnability.
∎
Appendix C Proof of Theorem 2
Proof.
At the trivial representation , we have , and due to the Markov chain, so for any . Since is learnable, there exists a given by a such that . Since , and , we have
Comments
There are no comments yet.