# Learnability for the Information Bottleneck

The Information Bottleneck (IB) method (tishby2000information) provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective I(X;Z)-β I(Y;Z) employs a Lagrange multiplier β to tune this trade-off. However, in practice, not only is β chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between β, learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if β is improperly chosen, learning cannot happen -- the trivial representation P(Z|X)=P(Z) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as β is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good β. We further show that IB-learnability is determined by the largest confident, typical, and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum β for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST, and CIFAR10.

## Authors

• 10 publications
• 20 publications
• 5 publications
• 22 publications
01/07/2020

### Phase Transitions for the Information Bottleneck in Representation Learning

In the Information Bottleneck (IB), when tuning the relative strength be...
09/29/2021

### PAC-Bayes Information Bottleneck

Information bottleneck (IB) depicts a trade-off between the accuracy and...
06/26/2018

### Phase transition in the knapsack problem

We examine the phase transition phenomenon for the Knapsack problem from...
08/05/2021

### Applying the Information Bottleneck Principle to Prosodic Representation Learning

This paper describes a novel design of a neural network-based speech gen...
01/03/2018

### Phase Transition of Convex Programs for Linear Inverse Problems with Multiple Prior Constraints

A sharp phase transition emerges in convex programs when solving the lin...
02/09/2021

### A Provably Convergent Information Bottleneck Solution via ADMM

The Information bottleneck (IB) method enables optimizing over the trade...
02/15/2021

### Scalable Vector Gaussian Information Bottleneck

In the context of statistical learning, the Information Bottleneck metho...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tishby et al. (2000) introduced the Information Bottleneck (IB) objective function which learns a representation of observed variables that retains as little information about as possible, but simultaneously captures as much information about as possible:

 minIBβ(X,Y;Z)=min[I(X;Z)−βI(Y;Z)] (1)

is the mutual information. The hyperparameter

controls the trade-off between compression and prediction, in the same spirit as Rate-Distortion Theory (Shannon, 1948), but with a learned representation function that automatically captures some part of the “semantically meaningful” information, where the semantics are determined by the observed relationship between and .

The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables (Chechik et al. (2005)), meta-Gaussians (Rey and Roth (2012)), continuous variables via variational methods (Alemi et al. (2016); Chalk et al. (2016); Fischer (2018)), deterministic scenarios (Strouse and Schwab (2017a); Kolchinsky et al. (2019)), geometric clustering (Strouse and Schwab (2017b)), and is used for learning invariant and disentangled representations in deep neural nets (Achille and Soatto (2018a, b)). However, a core issue remains: how should we set a good ? In the original work, the authors recommend sweeping , which can be prohibitively expensive in practice, but also leaves open interesting theoretical questions around the relationship between , , and the observed data, .

This work begins to answer some of those questions by characterizing the onset of learning. Specifically:

• We show that improperly chosen may result in a failure to learn: the trivial solution becomes the global minimum of the IB objective, even for (Section 1.1).

• We introduce the concept of IB-Learnability, and show that when we vary , the IB objective will undergo a phase transition from the inability to learn to the ability to learn (Section 3).

• Using the second-order variation, we derive sufficient conditions for IB-Learnability, which provide theoretical guidance for choosing a good (Section 4).

• We show that IB-Learnability is determined by the largest confident, typical, and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane vs. , and discuss its relation to model capacity (Section 5).

• We additionally prove a deep relationship between IB-Learnability, the hypercontractivity coefficient, the contraction coefficient, and the maximum correlation (Section 5).

We also present an algorithm for estimating the onset of IB-Learnability and the conspicuous subset, and demonstrate that it does a good job of approximating both the theoretical predictions and the empirical results (Section 6). Finally, we use our main results to demonstrate on synthetic datasets, MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky and Hinton, 2009) that the theoretical prediction for IB-Learnability closely matches experiment (Section 7).

### 1.1 A Motivating Example

How can we choose a good ? To gain intuition, consider learning multiple Variational Information Bottleneck (VIB) representations (Alemi et al., 2016) of MNIST (LeCun et al., 1998) at different . We select the digits 0 and 1 for binary classification, and add class-conditional noise (Angluin and Laird, 1988)

to the labels with flip probability 0.2, which simulates a general scenario where the data may be noisy and the dependence of

on is not deterministic. The algorithm only sees the corrupted labels. Fig. 1 shows the converged accuracy on the true labels for the VIB models plotted against . We see clearly that when , no learning happens, and the accuracy is the same as random guessing. Beginning with , there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a non-trivial representation. This kind of phase transition is typical in our experiments in Section 7. When the noise rate is high, the transition can happen at ; i.e., we need a large “ force” to extract relevant information from to predict . In that case, an improperly-chosen in the unlearnable region will preclude learning a useful representation.

## 2 Related Work

The original IB work (Tishby et al., 2000) provides a tabular method for exactly computing the optimal encoder distribution for a given and cardinality of the discrete representation, . Thus, the search for the desired model involves not only sweeping , but also considering different representation dimensionalities. These restrictions were lifted somewhat by Chechik et al. (2005), which presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation of , assuming that both and are also multivariate Gaussians. They also note the presence of the trivial solution not only when , but also depending on the eigenspectrum of the observed variables. However, the restriction to multivariate Gaussian datasets limits the generality of the analysis. Another analytic treatment of IB is given in Rey and Roth (2012)

, which reformulates the objective in terms of the copula functions. As with the GIB approach, this formulation restricts the form of the data distributions – the copula functions for the joint distribution

are assumed to be known, which is unlikely in practice.

Strouse and Schwab (2017a) presents the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, , rather than the transmission cost, as in IB. This approach learns hard clusterings with different code entropies that vary with . In this case, it is clear that a hard clustering with minimal will result in a single cluster for all of the data, which is the DIB trivial solution. No analysis is given beyond this fact to predict the actual onset of learnability, however.

The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. (2016)

. VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution (

) and marginal distribution (). This approach cleanly permits learning a stochastic encoder, , that is applicable to any , rather than just the particular seen at training time. The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method. Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.

Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) (Fischer, 2018). CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where . The MNI point may not be achievable even in principle for a particular dataset. However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when . The CEB objective is equivalent to IB at , so our analysis of IB-Learnability applies equally to CEB.

Kolchinsky et al. (2019) presents analytic and empirical results about trivial solutions in the particular setting of being a deterministic function of in the observed sample. However, their use of the term “trivial solution” is distinct from ours. They are referring to the observation that

will demonstrate trivial interpolation between two different but valid solutions on the optimal frontier, rather than demonstrating a non-trivial trade-off between compression and prediction as expected when varying the IB Lagrangian. Our use of “trivial” refers to whether IB is capable of learning at all given a certain dataset and value of

.

Achille and Soatto (2018b)

apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout. In

Achille and Soatto (2018a), the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations. They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a non-trivial representation. More recently, Achille et al. (2018) repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previously-trained network can be fine-tuned for a new task. While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.

Our work is also closely related to the hypercontractivity coefficient (Anantharam et al. (2013); Polyanskiy and Wu (2017)), defined as , which by definition equals the inverse of , our IB-learnability threshold. In Anantharam et al. (2013), the authors prove that the hypercontractivity cofficient equals the contraction coefficient , and Kim et al. (2017) propose a practical algorithm to estimate , which provides a measure for potential influence in the data. Although our goal is different, the sufficient conditions we provide for IB-Learnability are also lower bounds for the hypercontractivity coefficient.

## 3 Ib-Learnability

We are given instances of drawn from a distribution with probability (density) , where unless otherwise stated, both and can be discrete or continuous variables. is our training data, and may be characterized by different types of noise. The nature of this training data and the choice of will be sufficient to predict the transition from unlearnable to learnable.

We can learn a representation of with conditional probability111We use capital letters

for random variables and lowercase

to denote the instance of variables, with and denoting their probability or probability density, respectively. , such that

obey the Markov chain

. Eq. 1 above gives the IB objective with Lagrange multiplier , , which is a functional of : . The IB learning task is to find a conditional probability that minimizes . The larger , the more the objective favors making a good prediction for . Conversely, the smaller , the more the objective favors learning a concise representation.

How can we select such that the IB objective learns a useful representation? In practice, the selection of is done empirically. Indeed, Tishby et al. (2000) recommends “sweeping ”. In this paper, we provide theoretical guidance for choosing by introducing the concept of IB-Learnability and providing a series of IB-learnable conditions.

###### Definition 1.

is -learnable if there exists a given by some , such that , where characterizes the trivial representation where is independent of .

If is -learnable, then when is globally minimized, it will not learn a trivial representation. On the other hand, if is not -learnable, then when is globally minimized, it may learn a trivial representation.

##### Trivial solutions.

Definition 1 defines trivial solutions in terms of representations where . Another type of trivial solution occurs when but . This type of trivial solution is not directly achievable by the IB objective, as is minimized, but it can be achieved by construction or by chance. It is possible that starting learning from could result in access to non-trivial solutions not available from . We do not attempt to investigate this type of trivial solution in this work.

##### Necessary condition for IB-Learnability.

From Definition 1, we can see that -Learnability for any dataset requires . In fact, from the Markov chain , we have via the data-processing inequality. If , then since and , we have that . Hence is not -learnable for .

Due to the reparameterization invariance of mutual information, we have the following theorem for -Learnability:

###### Theorem 1.

Let be an invertible map (if is a continuous variable, is additionally required to be continuous). Then and have the same -Learnability.

The proof for Theorem 1 is in Appendix B. Theorem 1 implies a favorable property for any condition for -Learnability: the condition should be invariant to invertible mappings of . We will inspect this invariance in the conditions we derive in the following sections.

## 4 Sufficient Conditions for Ib-Learnability

Given , how can we determine whether it is -learnable? To answer this question, we derive a series of sufficient conditions for -Learnability, starting from its definition. The conditions are in increasing order of practicality, while sacrificing as little generality as possible.

Firstly, Theorem 2 characterizes the -Learnability range for , with proof in Appendix C:

###### Theorem 2.

If is -learnable, then for any , it is -learnable.

Based on Theorem 2, the range of such that is -learnable has the form . Thus, is the threshold of IB-Learnability.

###### Lemma 2.1.

is a stationary solution for .

The proof in Appendix F shows that both first-order variations and vanish at the trivial representation , so at the trivial representation.

Lemma 2.1 yields our strategy for finding sufficient conditions for learnability: find conditions such that is not a local minimum for the functional . Based on the necessary condition for the minimum (Appendix D), we have the following theorem 222The theorems in this paper deal with learnability w.r.t. true mutual information. If parameterized models are used to approximate the mutual information, the limitation of the model capacity will translate into more uncertainty of given , viewed through the lens of the model.:

###### Theorem 3 (Suff. Cond. 1).

A sufficient condition for to be -learnable is that there exists a perturbation function333so that the perturbed probability (density) is . Also, for integrals, whenever a variable is discrete, we can simply replace the integral by summation . with , such that the second-order variation at the trivial representation .

The proof for Theorem 3 is given in Appendix D. Intuitively, if , we can always find a in the neighborhood of the trivial representation , such that , thus satisfying the definition for -Learnability.

To make Theorem 3 more practical, we perturb around the trivial solution , and expand to the second order of . We can then prove Theorem 4:

###### Theorem 4 (Suff. Cond. 2).

A sufficient condition for to be -learnable is and are not independent, and

 β>infh(x)β0[h(x)] (2)

where the functional is given by

Moreover, we have that is a lower bound of the slope of the Pareto frontier in the information plane vs. at the origin.

The proof is given in Appendix G, which also shows that if in Theorem 4 is satisfied, we can construct a perturbation function with , for some , such that satisfies Theorem 3. It also shows that the converse is true: if there exists such that the condition in Theorem 3 is true, then Theorem 4 is satisfied444We do not claim that any satisfying Theorem 3 can be decomposed to at the onset of learning. But from the equivalence of Theorems 3 and 4 as explained above, when there exists an such that Theorem 3 is satisfied, we can always construct an that also satisfies Theorem 3. , i.e. . Moreover, letting the perturbation function at the trivial solution, we have

 pβ(y|x)=p(y) +ϵ2Cz(h∗(x)−¯¯¯h∗x) ⋅∫p(x,y)(h∗(x)−¯¯¯h∗x)dx (3)

where is the estimated by IB for a certain , , and is a constant. This shows how the by IB explicitly depends on at the onset of learning. The proof is provided in Appendix H.

Theorem 4 suggests a method to estimate : we can parameterize e.g. by a neural network, with the objective of minimizing . At its minimization, provides an upper bound for , and provides a soft clustering of the examples corresponding to a nontrivial perturbation of at that minimizes .

Alternatively, based on the property of , we can also use a specific functional form for in Eq. (2), and obtain a stronger sufficient condition for -Learnability. But we want to choose as near to the infimum as possible. To do this, we note the following characteristics for the R.H.S of Eq. (2):

• We can set to be nonzero if for some region and 0 otherwise. Then we obtain the following sufficient condition:

 β>infh(x),Ωx∈XEx∼p(x),x∈Ωx[h(x)2](Ex∼p(x),x∈Ωx[h(x)])2−1∫dyp(y)(Ex∼p(x),x∈Ωx[p(y|x)h(x)]Ex∼p(x),x∈Ωx[h(x)])2−1 (4)
• The numerator of the R.H.S. of Eq. (4) attains its minimum when is a constant within . This can be proved using the Cauchy-Schwarz inequality: , setting , , and defining the inner product as . Therefore, the numerator of the R.H.S. of Eq. (4) , and attains equality when is constant.

Based on these observations, we can let be a nonzero constant inside some region and 0 otherwise, and the infimum over an arbitrary function is simplified to infimum over , and we obtain a sufficient condition for -Learnability, which is a key result of this paper:

###### Theorem 5 (Conspicuous Subset Suff. Cond.).

A sufficient condition for to be -learnable is and are not independent, and

 β>infΩx⊂Xβ0(Ωx) (5)

where

 β0(Ωx)=1p(Ωx)−1Ey∼p(y|Ωx)[p(y|Ωx)p(y)−1]

denotes the event that , with probability .

gives a lower bound of the slope of the Pareto frontier in the information plane vs. at the origin.

The proof is given in Appendix I. In the proof we also show that this condition is invariant to invertible mappings of .

## 5 Discussion

##### The conspicuous subset determines β0.

From Eq. (5), we see that three characteristics of the subset lead to low : (1) confidence: is large; (2) typicality and size: the number of elements in is large, or the elements in are typical, leading to a large probability of ; (3) imbalance: is small for the subset , but large for its complement. In summary, will be determined by the largest confident, typical and imbalanced subset of examples, or an equilibrium of those characteristics. We term at the minimization of the conspicuous subset.

##### Multiple phase transitions.

Based on this characterization of , we can hypothesize datasets with multiple learnability phase transitions. Specifically, consider a region that is small but “typical”, consists of all elements confidently predicted as by , and where is the least common class. By construction, this will dominate the infimum in Eq. (5), resulting in a small value of . However, the remaining effectively form a new dataset, . At exactly , we may have that the current encoder, , has no mutual information with the remaining classes in ; i.e., . In this case, Definition 1 applies to with respect to . We might expect to see that, at , learning will plateau until we get to some that defines the phase transition for . Clearly this process could repeat many times, with each new dataset being distinctly more difficult to learn than .

##### Similarity to information measures.

The denominator of in Eq. (5) is closely related to mutual information. Using the inequality for , it becomes:

 Ey∼p(y|Ωx)[p(y|Ωx)p(y)−1] ≥Ey∼p(y|Ωx)[logp(y|Ωx)p(y)] =~I(Ωx;Y)

where is the mutual information “density” at . Of course, this quantity is also , so we know that the denominator of Eq. (5) is non-negative. Incidentally, is the density of “rational mutual information” (Lin and Tegmark (2016)) at .

Similarly, the numerator of is related to the self-information of :

 1p(Ωx)−1≥log1p(Ωx)=−log p(Ωx)=h(Ωx)

so we can estimate the phase transition as:

 β⪆infΩx⊂Xh(Ωx)~I(Ωx;Y) (6)

Since Eq. (6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on .

##### Estimating model capacity.

The observation that a model can’t distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IB-Learnability to measure the capacity of a set of models relative to the task they are being used to solve.

##### Learnability and the Information Plane.

Many of our results can be interpreted in terms of the geometry of the Pareto frontier illustrated in Fig. 2, which describes the trade-off between increasing and decreasing . At any point on this frontier that minimizes , the frontier will have slope if it is differentiable. If the frontier is also concave (has negative second derivative), then this slope will take its maximum at the origin, which implies -Learnability for , so that the threshold for -Learnability is simply the inverse slope of the frontier at the origin. More generally, as long as the Pareto frontier is differentiable, the threshold for IB-learnability is the inverse of its maximum slope. Indeed, Theorem 4 and Theorem 5 give lower bounds of the slope of the Pareto frontier at the origin.

##### IB-Learnability, hypercontractivity, and maximum correlation.

IB-Learnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation:

 1β0 =ξ(X;Y)=ηKL≥suph(x)1β0[h(x)]=ρ2m(X;Y) (7)

which we prove in Appendix K. Here s.t. and is the maximum correlation (Hirschfeld, 1935; Gebelein, 1941), is the hypercontractivity coefficient, and is the contraction coefficient. Our proof relies on Anantharam et al. (2013)’s proof . Our work reveals the deep relationship between IB-Learnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of .

## 6 Estimating the Ib-Learnability Condition

Theorem 5 not only reveals the relationship between the learnability threshold for and the least noisy region of , but also provides a way to practically estimate , both in the general classification case, and in more structured settings.

### 6.1 Estimation Algorithm

Based on Theorem 5, for general classification tasks we suggest Algorithm 1 to empirically estimate an upper-bound , as well as discovering the conspicuous subset that determines .

We approximate the probability of each example by its empirical probability, . E.g., for MNIST, , where is the number of examples in the dataset. The algorithm starts by first learning a maximum likelihood model of

, using e.g. feed-forward neural networks. It then constructs a matrix

and a vector

to store the estimated and for all the examples in the dataset. To find the subset such that the is as small as possible, by previous analysis we want to find a conspicuous subset such that its is large for a certain class (to make the denominator of Eq. (5) large), and containing as many elements as possible (to make the numerator small).

We suggest the following heuristics to discover such a conspicuous subset. For each class

, we sort the rows of according to its probability for the pivot class by decreasing order, and then perform a search over for . Since is large when contains too few or too many elements, the minimum of for class will typically be reached with some intermediate-sized subset, and we can use binary search or other discrete search algorithm for the optimization. The algorithm stops when does not improve by tolerance . The algorithm then returns the as the minimum over all the classes , as well as the conspicuous subset that determines this .

After estimating , we can then use it for learning with IB, either directly, or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have. This may be particularly important for very noisy datasets, where can be very large.

### 6.2 Special Cases for Estimating β0

Theorem 5 may still be challenging to estimate, due to the difficulty of making accurate estimates of and searching over . However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.

##### Class-conditional label noise.

Classification with noisy labels is a common practical scenario. An important noise model is that the labels are randomly flipped with some hidden class-conditional probabilities and we only observe the corrupted labels. This problem has been studied extensively (Angluin and Laird, 1988; Natarajan et al., 2013; Liu and Tao, 2016; Xiao et al., 2015; Northcutt et al., 2017). If IB is applied to this scenario, how large do we need? The following corollary provides a simple formula.

###### Corollary 5.1.

Suppose that the true class labels are , and the input space belonging to each has no overlap. We only observe the corrupted labels with class-conditional noise , and is not independent of . We have that a sufficient condition for -Learnability is:

 β>infy∗1p(y∗)−1∑yp(y|y∗)2p(y)−1 (8)

We see that under class-conditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates and the true class probability , which can be accurately estimated via e.g. Northcutt et al. (2017). Additionally, if we know that the noise is class-conditional, but the observed is greater than the R.H.S. of Eq. (8), we can deduce that there is overlap between the true classes. The proof of Corollary 5.1 is provided in Appendix J.

##### Deterministic relationships.

Theorem 5 also reveals that relates closely to whether is a deterministic function of , as shown by Corollary 5.2:

###### Corollary 5.2.

Assume that contains at least one value such that its probability . If is a deterministic function of and not independent of , then a sufficient condition for -Learnability is .

The assumption in the corollary 5.2 is satisfied by classification, and certain regression problems. Combined with the necessary condition for any dataset to be -learnable (Section 3), we have that under the assumption, if is a deterministic function of , then a necessary and sufficient condition for -learnability is ; i.e., its is 1. The proof of Corollary 5.2 is provided in Appendix J.

Therefore, in practice, if we find that , we may infer that is not a deterministic function of . For a classification task, we may infer that either some classes have overlap, or the labels are noisy. However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed , even when learning deterministic functions.

## 7 Experiments

To test how the theoretical conditions for -learnability match with experiment, we apply them to synthetic data with varying noise rates and class overlap, MNIST binary classification with varying noise rates, and CIFAR10 classification, comparing with the found experimentally. We also compare with the algorithm in Kim et al. (2017) for estimating the hypercontractivity coefficient (=) via the contraction coefficient . Experiment details are in Section L.

### 7.1 Synthetic Dataset Experiments

We construct a set of datasets from 2D mixtures of 2 Gaussians as and the identity of the mixture component as . We simulate two practical scenarios with these datasets: (1) noisy labels with class-conditional noise, and (2) class overlap. For (1), we vary the class-conditional noise rates. For (2), we vary class overlap by tuning the distance between the Gaussians. For each experiment, we sweep with exponential steps, and observe and . We then compare the empirical indicated by the onset of above-zero with predicted values for .

##### Classification with class-conditional noise.

In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix

. The two components have distance 16 (hence virtually no overlap) and equal mixture weight. For each , the label is the identity of which component it belongs to. We create multiple datasets by randomly flipping the labels with a certain noise rate . For each dataset, we train VIB models across a range of , and observe the onset of learning via random (Observed). To test how different methods perform in estimating , we apply the following methods: (1) Corollary 5.1, since this is classification with class-conditional noise, and the two true classes have virtually no overlap; (2) Alg. 1 with true ; (3) The algorithm in Kim et al. (2017) that estimates , provided with true ; (4) in Eq. (2); (2) Alg. 1 with estimated by a neural net; (3) with the same as in (2). The results are shown in Fig. 3 and in Appendix L.1.

From Fig. 3 we see the following. (A) When using the true , both Alg. 1 and generally upper bound the empirical , and Alg. 1 is generally tighter. (B) When using the true , Alg. 1 and Corollary 5.1 give the same result. (C) Comparing Alg. 1 and both of which use the same empirically estimated , both approaches provide good estimation in the low-noise region; however, in the high-noise region, Alg. 1 gives more precise values than , indicating that Alg. 1 is more robust to the estimation error of . (D) Eq. (2) empirically upper bounds the experimentally observed , and gives almost the same result as theoretical estimation in Corollary 5.1 and Alg. 1 with the true . In the classification setting, this approach doesn’t require any learned estimate of , as we can directly use the empirical and from SGD mini-batches.

This experiment also shows that for dataset where the signal-to-noise is small, can be very high. Instead of blindly sweeping , our result can provide guidance for setting so learning can happen.

##### Classification with class overlap.

In this experiment, we test how different amounts of overlap among classes influence . We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix . The two components have weights 0.6 and 0.4. We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the . Since we don’t add noise to the labels, if there were no overlap and a deterministic map from to , we would have by Corollary 5.2. The more overlap between the two classes, the more uncertain is given . By Eq. 5 we expect to be larger, which is corroborated in Fig. 4.

### 7.2 MNIST Experiments

We perform binary classification with digits 0 and 1, and as before, add class-conditional noise to the labels with varying noise rates

. To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with

neurons, the other with neurons. As we describe in Section 4, insufficient capacity will result in more uncertainty of given from the point of view of the model, so we expect the observed for the model to be larger. This result is confirmed by the experiment (Fig. 5). Also, in Fig. 5 we plot given by different estimation methods. We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.

### 7.3 MNIST Experiments using Equation 2

To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Eq. (2) w.r.t. the full MNIST dataset, and visualize the clustering of digits by . Eq. (2) can be optimized using SGD using any differentiable parameterized mapping . In this case, we chose to parameterize with a PixelCNN++ architecture (van den Oord et al., 2016; Salimans et al., 2017)

, as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as

). Eq. (2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of correspond to the subset of the data that is easiest to learn. Fig. 6 shows two strongly separated clusters, as well as the threshold we choose to divide them. Fig. 8 shows the first 5,776 MNIST training examples as sorted by our learned , with the examples above the threshold highlighted in red. We can clearly see that our learned has separated the “easy” one (1) digits from the rest of the MNIST training set.

### 7.4 CIFAR10 Forgetting Experiments

For CIFAR10 (Krizhevsky and Hinton, 2009), we study how forgetting varies with . In other words, given a VIB model trained at some high , if we anneal it down to some much lower , what does the model converge to? Using Alg. 1, we estimated on a version of CIFAR10 with 20% label noise, where the is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB. For the VIB models, the lowest with performance above chance was , a very tight match with the estimate from Alg. 1. See Appendix L.2 for details.

## 8 Conclusion

In this paper, we have presented theoretical results for predicting the onset of learning, and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset, and showed that those predictions are accurate, even in cases of extreme label noise. We believe these results will provide theoretical and practical guidance for choosing in the IB framework for balancing prediction and compression. Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified. We hope to address some of those questions in future work.

#### Acknowledgements

Tailin Wu’s work was supported by the The Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. He thanks the Center for Brains, Minds, and Machines (CBMM) for hospitality.

## Appendix A Preliminaries: first-order and second-order variations

Let functional be defined on some normed linear space . Let us add a perturbative function to , and now the functional can be expanded as

 ΔF[f(x)] =F[f(x)+ϵ⋅h(x)]−F[f(x)] =φ1[f(x)]+φ2[f(x)]+O(ϵ3||h||2)

where denotes the norm of , is a linear functional of , and is called the first-order variation, denoted as . is a quadratic functional of , and is called the second-order variation, denoted as .

If , we call a stationary solution for the functional .

If for all such that is at the neighborhood of , we call a (local) minimum of .

## Appendix B Proof of Theorem 1

###### Proof.

If is -learnable, then there exists given by some such that , where satisfies . Since is a invertible map (if is continuous variable, is additionally required to be continuous), and mutual information is invariant under such an invertible map (Kraskov et al. (2004)), we have that , so is -learnable. On the other hand, if is not -learnable, then , we have . Again using mutual information’s invariance under , we have for all , , leading to that is not -learnable. Therefore, we have that and have the same -learnability.

## Appendix C Proof of Theorem 2

###### Proof.

At the trivial representation , we have , and due to the Markov chain, so for any . Since is -learnable, there exists a given by a such that . Since , and , we have