Efforts to understand the empirical success of deep learning have followed two main lines: Representation learning and optimization. In optimization, a deep network is treated as a black-box family of functions for which we want to find parameters (weights
) that yield good generalization. Aside from the difficulties due to the non-convexity of the loss function, the fact that deep networks are heavily over-parametrized presents a theoretical challenge: The bias-variance tradeoff suggests they should overfit; yet, even without explicit regularization, they perform remarkably well in practice. Recent work suggests that this is related to properties of the loss landscape and to the implicit regularization performed by stochastic gradient descent (SGD), but the overall picture is still hazy(Zhang et al., 2017).
Representation learning, on the other hand, focuses on the properties of the representation learned by the layers of the network (the activations) while remaining largely agnostic to the particular optimization process used. In fact, the effectiveness of deep learning is often ascribed to the ability of deep networks to learn representations that are insensitive (invariant) to nuisances such as translations, rotations, occlusions, and also “disentangled,” that is, separating factors in the high-dimensional space of data (Bengio et al., 2009). Careful engineering of the architecture plays an important role in achieving insensitivity to simple geometric nuisance transformations, like translations and small deformations; however, more complex and dataset-specific nuisances still need to be learned. This poses a riddle: If neither the architecture nor the loss function explicitly enforce invariance and disentangling, how can these properties emerge consistently in deep networks trained by simple generic optimization?
In this work, we address these questions by establishing information theoretic connections between these concepts. In particular, we show that: (a) a sufficient representation of the data is invariant if and only if it is minimal, i.e., it contains the smallest amount of information; (b) the information in the representation, along with its total correlation (a measure of disentanglement) are tightly bounded by the information that the weights retain about the dataset; (c) the information in the weights, which is related to overfitting (Hinton and Van Camp, 1993), flat minima (Hochreiter and Schmidhuber, 1997), and a PAC-Bayes upper-bound on the test error (Section 6), can be controlled by implicit or explicit regularization. Moreover, we show that adding noise during the training is a simple and natural way of biasing the network towards invariant representations.
Finally, we perform several experiments with realistic architectures and datasets to validate the assumptions underlying our claims. In particular, we show that using the information in the weights to measure the complexity of a deep neural network (DNN), rather than the number of its parameters, leads to a sharp and theoretically predicted transition between overfitting and underfitting regimes for random labels, shedding light on the questions of Zhang et al. (2017).
1.1 Related work
The Information Bottleneck (IB) was introduced by Tishby et al. (1999) as a generalization of minimal sufficient statistics that allows trading off fidelity (sufficiency) and complexity of a representation. In particular, the introduction of the IB Lagrangian reduces finding a minimal sufficient representation of the data to a variational optimization problem. Later, Tishby and Zaslavsky (2015) and Shwartz-Ziv and Tishby (2017) advocated using the Information Bottleneck between the test data and the activations of a deep neural network, to study the sufficiency and minimality of the resulting representation. In parallel developments, the IB Lagrangian was used as a regularized loss function for learning representation, leading to new information theoretic regularizers (Achille and Soatto, 2016; Alemi et al., 2017).
In this paper, we introduce the use of the IB Lagrangian between the weights of a network and the training data, as opposed to the traditional one between the activations and the test datum. We show that this can be seen both as a generalization of Variational Inference, related to Hinton and Van Camp (1993), and as a special case of the more general PAC-Bayes framework (McAllester, 2013)
, that can be used to compute high-probability upper-bounds on the test error of the network. One of our main contributions is then to show that, due to a particular duality induced by the architecture of deep networks, minimality of the weights (a function of the training dataset) and of the learned representation (a function of the test input) are connected: in particular we show that networks regularized either explicitly, or implicitly by SGD, are biased toward learning invariant and disentangled representations. The theory we develop could be used to explain the phenomena described in small-scale experiments inShwartz-Ziv and Tishby (2017), whereby the initial fast convergence of SGD is related to sufficiency of the representation, while the later asymptotic phase is related to compression of the activations: While SGD is seemingly agnostic to the property of the learned representation, we show that it does minimize the information in the weights, from which the compression of the activations follows as a corollary of our bounds. Practical implementation of this theory on real large scale problems is made possible by advances in Stochastic Gradient Variational Bayes (Kingma and Welling, 2014; Kingma et al., 2015).
Representations learned by deep networks are observed to be insensitive to complex nuisance transformations of the data. To a certain extent, this can be attributed to the architecture. For instance, the use of convolutional layers and max-pooling can be shown to yield insensitivity to local group transformations(Bruna and Mallat, 2011; Anselmi et al., 2016; Soatto and Chiuso, 2016). But for more complex, dataset-specific, and in particular non-local, non-group transformations, such insensitivity must be acquired as part of the learning process, rather than being coded in the architecture. We show that a sufficient representation is maximally insensitive to nuisances if and only if it is minimal, allowing us to prove that a regularized network is naturally biased toward learning invariant representations of the data.
Efforts to develop a theoretical framework for representation learning include Tishby and Zaslavsky (2015) and Shwartz-Ziv and Tishby (2017), who consider representations as stochastic functions that approximate minimal sufficient statistics, different from Bruna and Mallat (2011) who construct representations as (deterministic) operators that are invertible in the limit, while exhibiting reduced sensitivity (“stability”) to small perturbations of the data. Some of the deterministic constructions are based on the assumption that the underlying data is spatially stationary, and therefore work best on textures and other visual data that are not subject to occlusions and scaling nuisances. Anselmi et al. (2016) develop a theory of invariance to locally compact groups, and aim to construct maximal (“distinctive”) invariants, like Sundaramoorthi et al. (2009) that, however, assume nuisances to be infinite-dimensional groups (Grenander, 1993). These efforts are limited by the assumption that nuisances have a group structure. Such assumptions were relaxed by Soatto and Chiuso (2016) who advocate seeking for sufficient invariants, rather than maximal ones. We further advance this approach, but unlike prior work on sufficient dimensionality reduction, we do not seek to minimize the dimension of the representation, but rather its information content, as prescribed by our theory. Recent advances in Deep Learning provide us with computationally viable methods to train high-dimensional models and predict and quantify observed phenomena such as convergence to flat minima and transitions from overfitting to underfitting random labels, thus bringing the theory to fruition. Other theoretical efforts focus on complexity considerations, and explain the success of deep networks by ways of statistical or computational efficiency (Lee et al., 2017; Bengio et al., 2009; LeCun, 2012). “Disentanglement” is an often-cited property of deep networks (Bengio et al., 2009), but seldom formalized and studied analytically, although Ver Steeg and Galstyan (2015) has suggested studying it using the Total Correlation of the representation, also known as multi-variate mutual information, which we also use.
We also connect invariance properties of the representation to the geometry of the optimization residual, and to the phenomenon of flat minima (Dinh et al., 2017).
Following suggestions by David McAllester, we have also explored relations between our theory and the PAC-Bayes framework (McAllester, 2013). As we show, our theory can also be derived in the PAC-Bayes framework, without resorting to information quantities and the Information Bottleneck, thus providing both an independent and alternative derivation, and a theoretically rigorous and computationally easy way to upper-bound the optimal loss function.
A training dataset is comprised of independent and identically distributed (IID) samples from an unknown distribution parametrized by . Following a Bayesian approach, we also consider
to be a random variable, sampled from an unknown distribution. Unless specified otherwise, we denote by the measured data, while
are (usually discrete) labels associated with the data; a test datum is a random variable (or random vector). Given a sample of , our goal is to infer the random variable , which is therefore referred to as our task.
We will make frequent use of the following standard information theoretic quantities (Cover and Thomas, 2012): the Shannon entropy , the conditional entropy , the (conditional) mutual information , the Kullbach-Liebler (KL) divergence , the cross-entropy , and the total correlation , which is defined as
where are the marginal distribution of the components of . Recall that the KL divergence between two distribution is always non-negative and zero if and only if they are equal. In particular is zero if and only if the components of are independent, in which case we say that is disentangled. We often use of the following identity:
We say that
form a Markov chain, indicated with, if . The Data Processing Inequality (DPI) for a Markov chain ensures that .
2.1 Information Bottleneck
We say that is a representation of if is a stochastic function of , or equivalently if the distribution of is fully described by the conditional . In particular we have the Markov chain . We say that a representation of is sufficient for if , or equivalently if ; it is minimal when is smallest among sufficient representations. To study the trade-off between sufficiency and minimality, Tishby et al. (1999) introduces the Information Bottleneck Lagrangian
where trades off sufficiency (first term) and minimality (second term); in the limit , the IB Lagrangian is minimized when is minimal and sufficient.
2.2 Nuisances for a task
A nuisance is any random variable that affects the observed data , but is not informative to the task we are trying to solve. More formally, a random variable is a nuisance for the task if , or equivalently . Similarly, we say that the representation is invariant to the nuisance if , or . When is not strictly invariant but minimizes among all sufficient representations, we say that the representation is maximally insensitive to .
One typical example of nuisance is a group , such as translation or rotation, acting on the data. In this case, a deterministic representation is invariant to the nuisances if and only if for all we have . Our definition however is more general in that it is not restricted to deterministic functions, nor to group nuisances. An important consequence of this generality is that the observed data can always be written as a deterministic function of the task and of all nuisances affecting the data, as explained by the following proposition.
3 Properties of representations
To simplify the inference process, instead of working directly with the observed high dimensional data, we want to use a representation that captures and exposes only the relevant information about the task . Ideally, such a representation should be (a) sufficient for the task , i.e. , so that information about is not lost; among all sufficient representations, it should be (b) minimal, i.e. is minimized, so that it retains as little about
as possible, simplifying the role of the classifier; finally, it should be (c)invariant to the effect of nuisances , so that the final classifier will not overfit to spurious correlations present in the training dataset between nuisances and labels . Such a representation, if it exists, would not be unique, since any bijective mapping preserves all these properties. We can use this to our advantage and further aim to make the representation (d) maximally disentangled, i.e., such that is minimal. This simplifies the classifier rule, since no information will be present in the higher-order correlations between the components of .
Inferring a representation that satisfies all these properties may seem daunting. However, in this section we show that we only need to enforce (a) sufficiency and (b) minimality, from which invariance and disentanglement follow naturally thanks to the stacking of noisy layers of computation in deep networks. We will then show that sufficiency and minimality of the learned representation can also be promoted easily through implicit or explicit regularization during the training process.
[Invariance and minimality] Let be a nuisance for the task and let be a sufficient representation of the input . Suppose that depends on only through (i.e., ). Then,
Moreover, there is a nuisance such that equality holds up to a (generally small) residual
where . In particular , and whenever is a deterministic function of . Under these conditions, a sufficient statistic is invariant (maximally insensitive) to nuisances if and only if it is minimal. Since , and usually , we can generally ignore the extra term.
An important consequence of this proposition is that we can construct invariants by simply reducing the amount of information contains about , while retaining the minimum amount that we need for the task . This provides the network a way to automatically learn invariance to complex nuisances, which is complementary to the invariance imposed by the architecture. Specifically, one way of enforcing minimality explicitly, and hence invariance, is through the IB Lagrangian. [Invariants from the Information Bottleneck] Minimizing the IB Lagrangian
in the limit , yields a sufficient invariant representation of the test datum for the task . Remarkably, the IB Lagrangian can be seen as a the standard cross-entropy loss, plus a regularizer that promotes invariance. This fact, without proof, is implicitly used in Achille and Soatto (2016), that also provides an efficient algorithm to perform the optimization. Alemi et al. (2017) also propose a related algorithm and shows improved resistance to adversarial nuisances. In addition to modifying the cost function, invariance can also be fostered by choice of architecture: [Bottlenecks promote invariance] Suppose we have the Markov chain of layers
and suppose that there is a communication or computation bottleneck between and such that . Then, if is still sufficient, it is more invariant to nuisances than . More precisely, for all nuisances we have . Such a bottleneck can happen for example because , e.g., after a pooling layer, or because the channel between and is noisy, e.g., because of dropout.
[Stacking increases invariance] Assume that we have the Markov chain of layers
and that the last layer is sufficient of for . Then is more insensitive to nuisances than all the preceding layers. Notice, however, that the above corollary does not simply imply that the more layers the merrier, as it assumes that one has successfully trained the network ( is sufficient), which becomes increasingly difficult as the size grows. Also note that in some architectures, such as ResNets (He et al., 2016), the layers do not necessarily form a Markov chain because of skip connections; however, their “blocks” still do.
4 Learning minimal weights
In this section, we let be an (unknown) data distribution from which we randomly sample a dataset . The parameter of the distribution is also assumed to be a random variable with an (unknown) prior distribution . For example can be a fairly general generative model for natural images, and be the parameters of the model that generated our dataset.
We then consider a deep neural network that implements a map from an input to a class distribution .111
We use to denote the real (and unknown) data distribution, while denotes approximate distributions that are optimized during training.
In full generality, and following a Bayesian approach, we let the weights of the network be sampled from a parametrized distribution ,whose parameters are optimized during training.222Note that, while the two are somewhat related, here by we denote the output distribution of the weights after training with our choice algorithm on the dataset , and not the Bayesian posterior of the weights given the dataset, which would be denoted . When is a Dirac delta at a point, we recover the standard loss function for a MAP estimate of the weights.
is a Dirac delta at a point, we recover the standard loss function for a MAP estimate of the weights.The network is then trained in order to minimize the expected cross-entropy loss333 Note that for generality here we treat the dataset as a random variable. In practice, when a single dataset is given, the expectation w.r.t. the dataset can be ignored.
in order for to approximate .
One of the main problems in optimizing a DNN is that the cross-entropy loss in notoriously prone to overfitting. In fact, one can easily minimize it even for completely random labels (Zhang et al. (2017) and Figure 1). The fact that, somehow, such highly over-parametrized functions manage to generalize when trained on real labels has puzzled theoreticians and prompted some to wonder whether this may be inconsistent with the intuitive interpretation of the bias-variance tradeoff theorem, whereby unregularized complex models should overfit wildly. However, as we show next, there is no inconsistency if one measures complexity by the information content, and not the dimensionality, of the weights.
To gain some insights about the possible causes of over-fitting, we can use the following decomposition of the cross-entropy loss (we refer to Appendix C for the proof and the precise definition of each term):
The first term of the right-hand side of (8) relates to the intrinsic error that we would commit in predicting the labels even if we knew the underlying data distribution ; the second term measures how much information that the dataset has about the parameter is captured by the weights, the third term relates to the efficiency of the model and the class of functions with respect to which the loss is optimized. The last, and only negative, term relates to how much information about the labels, but uninformative of the underlying data distribution, is memorized in the weights. Unfortunately, without implicit or explicit regularization, the network can minimize the cross-entropy loss (LHS), by just maximizing the last term of eq. 8, i.e., by memorizing the dataset, which yields poor generalization.
To prevent the network from doing this, we can neutralize the effect of the negative term by adding it back to the loss function, leading to a regularized loss . However, computing, or even approximating, the value of is at least as difficult as fitting the model itself.
We can, however, add an upper bound to to obtain the desired result. In particular, we explore two alternate paths that lead to equivalent conclusions under different premises and assumptions: In one case, we use a PAC-Bayes upper-bound, which is . In the other, we use the IB lagrangian and upper-bound it with . We discuss this latter approach next, and look at the PAC-Bayes approach in Section 6.
For the latter approach, notice that to successfully learn the distribution , we only need to memorize in the information about the latent parameters , that is we need , which is bounded above by a constant. On the other hand, to overfit, the term needs to grow linearly with the number of training samples . We can exploit this fact to prevent overfitting by adding a Lagrange multiplier to make the amount of information a constant with respect to , leading to the regularized loss function
which is, remarkably, the same IB Lagrangian in (1), but now interpreted as a function of rather than . This use of the IB Lagrangian is, to the best of our knowledge, novel, as the role of the Information Bottleneck has thus far been confined to characterizing the activations of the network, and not as a learning criterion. Equation 3 can be seen as a generalization of other suggestions in the literature:
IB Lagrangian, Variational Learning and Dropout.
Minimizing the information stored at the weights was proposed as far back as Hinton and Van Camp (1993) as a way of simplifying neural networks, but no efficient algorithm to perform the optimization was known at the time. For the particular choice , the IB Lagrangian reduces to the variational lower-bound (VLBO) of the marginal log-likelihood . Therefore, minimizing eq. 3 can also be seen as a generalization of variational learning. A particular case of this was studied by Kingma et al. (2015), who first showed that a generalization of Dropout, called Variational Dropout, could be used in conjunction with the reparametrization trick Kingma and Welling (2014) to minimize the loss efficiently.
Information in the weights as a measure of complexity.
Just as Hinton and Van Camp (1993) suggested, we also advocate using the information regularizer as a measure the effective complexity of a network, rather than the number of parameters , which is merely an upper bound on the complexity. As we show in experiments, this allows us to recover a version of the bias-variance tradeoff where networks with lower information complexity underfit the data, and networks with higher complexity overfit. In contrast, there is no clear relationship between number of parameters and overfitting (Zhang et al., 2017). Moreover, for random labels the information complexity allows us to precisely predict the overfitting and underfitting behavior of the network (Section 7).
4.1 Computable upper-bound to the loss
Unfortunately, computing is still too complicated, since it requires us to know the marginal over all possible datasets and trainings of the network. To avoid computing this term, we can use the more general upper-bound
where is any fixed distribution of the weights. Assuming for simplicity sake that the dataset is fixed, so we can ignore the expectation over , this gives us the following upper bound to the optimal loss function
Generally, we want to pick in order to give the sharpest upper-bound, and to be a fully factorized distribution, i.e., a distribution with independent components, in order to make the computation of the KL term easier. The sharpest upper-bound to that can be obtained using a factorized distribution is obtained when where denotes the marginal distributions of the components of . With this choice of prior, our final loss function becomes
for some fixed distribution that approximates the real marginal distribution . The IB Lagrangian for the weights in eq. 3 can be seen as a generally intractable special case of eq. 5 that gives the sharpest upper-bound to our desired loss in this family of losses.
In the following, to keep the notation uncluttered, we will denote our upperbound to the mutual information simply by , where
4.2 Close form expression for the loss
To derive precise and empirically verifiable statements about , we need an analytical expression for it. To this end, following Kingma et al. (2015), we make the following modeling assumptions.
Let denote the vector containing all the parameters (weights) in the network, and let denote the weight matrix at layer . We assume an improper log-uniform prior on , that is . Notice that this is the only scale-invariant prior Kingma et al. (2015), and closely matches the real marginal distributions of the weights in a trained network. Then, we parametrize the weight distribution during training as
where is a learned mean, and is IID multiplicative log-normal noise with mean 1 and variance .444For a log-normal mean and variance are respectively and . Note that while Kingma et al. (2015) uses this parametrization as a local approximation of the Bayesian posterior for a given (log-uniform) prior, we rather define the distribution of the weights after training on the dataset to be .
[Information in the weights] Under the previous modeling assumptions, the upper-bound to the information that the weights contain about the dataset is
where the constant is arbitrary due to the improper prior.
[On the constant ] To simplify the exposition, since the optimization is unaffected by any additive constant, in the following we abuse the notation and, under the modeling assumptions stated above, we rather define . Neklyudov et al. (2017) also suggest a principled way of dealing with the arbitrary constant by using a proper log-uniform prior.
4.3 Flat minima have low information
Thus far we have suggested that adding the explicit information regularizer prevents the network from memorizing the dataset and thus avoid overfitting, which we also confirm empirically in Section 7. However, real networks are not commonly trained with this regularizer, thus seemingly undermining the theory. However, we claim that, even when not explicitly controlled, is implicitly regularized by the use of SGD. In particular, empirical evidence (Chaudhari et al., 2017)
suggests that SGD biases the optimization toward “flat minima”, that are local minima whose Hessian has mostly small eigenvalues. These minima can be interpreted exactly as having low information, as suggested early on by Hochreiter and Schmidhuber (1997): Intuitively, since the loss landscape is locally flat, the weights may be stored at lower precision without incurring in excessive inference error. As a consequence of previous claims, we can then see flat minima as having better generalization properties and, as we will see in Section 5, the associated representation of the data is more invariant and disentangled. For completeness, here we derive a more precise relationship between flatness (measured by the nuclear norm of the loss Hessian), and the information content based on our model. [Flat minima have low information] Let be a local minimum of the cross-entropy loss , and let be the Hessian at that point. Then, for the optimal choice of the posterior centered at that optimizes the IB Lagrangian, we have
where and denotes the nuclear norm.
Notice that a converse inequality, that is, low information implies flatness, needs not hold, so there is no contradiction with the results of Dinh et al. (2017). Also note that for to be invariant to reparametrization one has to consider the constant , which we have ignored (Section 4.2).
In the next section, we prove one of our main results, that networks with low information in the weights realize invariant and disentangled representations. Therefore, invariance and disentanglement emerge naturally when training a network with implicit (SGD) or explicit (IB Lagrangian) regularization, and are related to flat minima.
5 Duality of the Bottleneck
The following proposition gives the fundamental link in our model between information in the weights, and hence flatness of the local minima, minimality of the representation, and disentanglement. Let , and assume as before , with . Further assume that the marginals of and are both approximately Gaussian (which is reasonable for large
by the Central Limit Theorem). Then,
where denotes the -th row of the matrix , and is the noise variance . In particular, is a monotone decreasing function of the weight variances .
The above identity is difficult to apply in practice, but with some additional hypotheses, we can derive a cleaner uniform tight bound on .
[Uniform bound for one layer] Let , where , where ; assume that the components of
are uncorrelated, and that their kurtosis is uniformly bounded.555 This is a technical hypothesis, always satisfied if the components are IID, (sub-)Gaussian, or with uniformly bounded support. Then, there is a strictly increasing function s.t. we have the uniform bound
where , and is related to by . In particular, is tightly bounded by and increases strictly with it. The above theorems tells us that whenever we decrease the information in the weights, either by explicit regularization, or by implicit regularization (e.g., using SGD), we automatically improve the minimality, and hence, by Section 3, the invariance, and the disentanglement of the learner representation. In particular, we obtain as a corollary that SGD is biased toward learning invariant and disentangled representations of the data. Using the Markov property of the layers, we can easily extend this bound to multiple layers: [Multi-layer case] Let for be weight matrices, with and , and let , where and is any nonlinearity. Then,
[Tightness] While the bound in Section 5 is tight, the bound in the multilayer case needs not be. This is to be expected: Reducing the information in the weights creates a bottleneck, but we do not know how much information about will actually go through this bottleneck. Often, the final layers will let most of the information through, while initial layers will drop the most.
6 Connection with PAC-Bayes bounds
In this section we show that using a PAC-Bayes bound, we arrive at the same regularized loss function eq. 5 we obtained using the Information Bottleneck, without the need of any approximation. By Theorem 2 of McAllester (2013), we have that for any fixed , prior , and any weight distribution , the test error that the network commits using the weight distribution is upper-bounded in expectation by
Now, recall that since we have
the sharpest PAC-Bayes upper-bound to the test error is obtained when , in which case eq. 7 reduces (modulo a multiplicative constant) to the IB Lagrangian of the weights. That is, the IB Lagrangian for the weights can be considered as a special case of PAC-Bayes giving the sharpest bound.
Unfortunately, as we noticed in Section 4, the joint marginal of the weights is not tractable. To circumvent the problem, we can instead consider that the sharpest PAC-Bayes upper-bound that can be obtained using a tractable factorized prior , which is obtained exactly when is the product of the marginals, leading again to our practical loss eq. 5.
On a last note, recall that under our modeling assumptions the marginal
is assumed to be an improper log-uniform distribution. While this has the advantage of being a non-informative prior that closely matches the real marginal of the weights of the network, it also has the disadvantage that it is only defined modulo an additive constant, therefore making the bound on the test error vacuous under our model. The problem of computing non vacuous bounds for real deep neural networks has been addressed byDziugaite and Roy (2017).
7 Empirical validation
As pointed out by Zhang et al. (2017)
, when a standard convolutional neural network (CNN) is trained on CIFAR-10 to fit random labels, the network is able to (over)fit them perfectly. This is easily explained in our framework: It simply means that the network is complex enough to overfit but, as we show here, it has to pay a steep price in terms of information complexity of the weights (Figure 2). On the other hand, information regularization prevents overfitting in exactly the way predicted by the theory.
In particular, in the case of completely random labels, we have , since is by construction random. Therefore, eq. 3 is an optimal regularizer: Regardless of the dataset size , for it should completely prevent memorization and hence overfitting, while for overfitting is possible. The empirical behavior of the network, shown in Figure 1, closely follows this prediction. For real labels, the model is still able to overfit when , but importantly there is a large interval of where the model fits the data without overfitting. Indeed, as soon as is larger than , the model trained on real data fits real labels without excessive overfitting (Figure 1).
In Figure 2, we measure the quantity information in the weights for different levels of corruption of the labels. To do this, we fix so that the network is able to overfit, and for various level of corruption we train until convergence, and then compute for the trained model. As expected, increasing the randomness of the labels increases the quantity of information we need to fit the dataset. For completely random labels, increases by nats/sample, which the same order of magnitude as the quantity required to memorize a 10-class labels ( nats/sample), as shown in Figure 2.
7.1 Nuisance invariance
Section 5 shows that by decreasing the information in the weights , which can be done for example using eq. 3, the learned representation will be increasingly minimal, and therefore insensitive to nuisance factors , as measured by . Here, we adapt a technique from the GAN literature Sønderby et al. (2016) that allows us to explicitly measure and validate this effect, provided we can sample from the nuisance distribution and from ; that is, if given a nuisance we can generate data affected by that nuisance. Recall that by definition we have
To approximate the expectations via sampling we need a way to approximate the likelihood ratio . This can be done as follows: Let be a binary discriminator that given the representation and the nuisance tries to decide whether is sampled from the posterior distribution or from the prior . Since by hypothesis we can generate samples from both distributions, we can generate data to train this discriminator. Intuitively, if the discriminator is not able to classify, it means that is insensitive to changes of . Precisely, since the optimal discriminator is
if we assume that is close to the optimal discriminator , we have
therefore we can use to estimate the log-likelihood ratio, and so also the mutual information . Notice however that this comes with no guarantees on the quality of the approximation.
To test this algorithm, we add random occlusion nuisances to MNIST digits (Figure 3). In this case, the nuisance is the occlusion pattern, while the observed data is the occluded digit. For various values of , we train a classifier on this data in order to learn a representation , and, for each representation obtained this way, we train a discriminator as described above and we compute the resulting approximation of . The results in Figure 3 show that decreasing the information in the weights makes the representation increasingly more insensitive to .
7.2 Visualizing the representation
Even when we cannot generate data affected by nuisances like in the previous section, we can still visualize the information content of to learn what nuisances are discarded in the representation. To this end, given a representation , we want to learn to sample from a distribution of images that are maximally likely to have as their representation. Formally, this means that we want a distribution
that maximizes the amortized maximum a posteriori estimate of:
Unfortunately, the term in the expression is difficult to estimate. However, Sønderby et al. (2016) notice that the modified gain function
differs from the amortizes MAP only by a term , which has the positive effect of improving the exploration of the reconstruction, and contains the term , which can be estimated easily using the discriminator network of a GAN Sønderby et al. (2016). To maximize this gain, we can simply train a GAN with an additional reconstruction loss .
To test this algorithm, we train a representation to classify the 40 binary attributes in the CelebA face dataset Yang et al. (2015), and then use the above loss function to train a GAN network to reconstruct an input image from the representation . The results in Figure 4 show that, as expected, increasing the value of , and therefore reducing , generates samples that have increasingly more random backgrounds and hair style (nuisances), while retaining facial features. In other words, the representation is increasingly insensitive to nuisances affecting the data, while information pertaining the task is retained in the reconstruction .
8 Discussion and conclusion
In this work, we have presented bounds, some of which tight, that connect the amount of information in the weights, the amount of information in the activations, the invariance property of the network, and the geometry of the residual loss. These results leverage the structure of deep networks, in particular the multiplicative action of the weights, and the Markov property of the layers. This leads to the surprising result that reducing information stored in the weights about the past (dataset) results in desirable properties of the learned interal representation of the test datum.
Our notion of representation is intrinsically stochastic. This simplifies the computation as well as the derivation of information-based relations. However, note that even if we start with a deterministic representation , Section 4.3 gives us a way of converting it to a stochastic representation whose quality depends on the flatness of the minimum. Our theory leverages heavily on the Information Bottleneck Principle, which dates back to over two decades ago, but that until recently was under-utilized because of the lack of tools to efficiently approximate and optimize the Information Bottleneck Lagrangian.
This work focuses on the inference and learning of optimal representations, that seek to get the most out of the data we have for a specific task. This does not guarantee a good outcome since, due to the Data Processing Inequality, the representation can be easier to use but ultimately no more informative than the data themselves. An orthogonal but equally interesting issue is how to get the most informative data possible, which is the subject of active learning, experiment design, and perceptual exploration.
Supported by ONR N00014-17-1-2072, ARO W911NF-17-1-0304, AFOSR FA9550-15-1-0229 and FA8650-11-1-7156. We wish to thank David McAllester, Kevin Murphy, Alessandro Chiuso for insightful comments and suggestions.
Appendix A Details of the experiments
a.1 Random labels
We use a similar experimental setup as Zhang et al. (2017). In particular, we train a small version of AlexNet on a 2828 central crop of CIFAR-10 with completely random labels. The dataset is normalized using the global channel-wise mean and variance, but no additional data augmentation is performed. The exact structure of the network is in Table 1and pick the best performing network of the two. Generally, we found that a higher learning rate is needed to overfit when the number of training samples is small, while a lower learning rate is needed for larger . We train with SGD with momentum for epochs reducing the learning rate by a factor of every epochs. We use a large batch-size of to minimize the noise coming from SGD. No weight decay or other regularization methods are used.
The final plot is obtained by triangulating the convex envelope of the data points, and by interpolating their value on the resulting simplexes. Outside of the convex envelope (where the accuracy is mostly constant), the value was obtained by inpainting.
To measure the information content of the weights as the percentage of corrupted labels varies, we fix , and and train the network on different corruption levels with the same settings as before.
a.2 Nuisance invariance
The cluttered MNIST dataset is generated by adding 10 squares uniformly at random on the digits of the MNIST dataset (LeCun et al., 1998). For each level of , we train the classifier in Table 1 on this dataset. The weights of all layers, excluding the first and last one, are threated as a random variable with multiplicative gaussian noise (Appendix B) and optimized using the local reparameterization trick of Kingma et al. (2015). We use the last convolutional layer before classification as representation .
The discriminator network used to estimate the log-likelihood ratio is constructed as follows: the inputs are the nuisance pattern , which is a image containing 10 random occluding squares, and the 77192 representation obtained from the classifier. First we preprocess using the following network: conv 48 conv 48 conv 96 s2 conv 96 conv 96 conv 96 s2, where each conv block is a 33 convolution followed by batch normalization and ReLU. Then, we concatenate the result with along the feature maps, and the final discriminator output is obtained by applying the following network: conv 192 conv 192 conv 11192 conv 111 AvgPooling 77 sigmoid.
|conv 64 + BN|
|FC 3136x384 + BN|
|FC 384x192 + BN|
|conv 96 + BN + ReLU|
|conv 96 + BN + ReLU|
|conv 192 s2 + BN + ReLU|
|conv 192 + BN + ReLU|
|conv 192 + BN + ReLU|
|conv 192 s2 + BN + ReLU|
|conv 192 + BN + ReLU|
|conv 192 + BN + ReLU|
|Average pooling 7x7|
kernel, “s2” denotes a convolution with stride 2. The final representation we use are the activations of the last “conv 192” layer.
a.3 Visualizing the representation
We train a classifier on the images from the CelebA datasets resized to 3232. The task is to recover the 40 binary attributes associated to each image. The classifier network is the same as the one in Table 1 with the following modifications: we use Exponential Linear Units (ELU) (Clevert et al., 2015)
for the activations, instead of ReLU, since invertible activations generally perform better when training a GAN, and we divide by two the number of filters in all layers to reduce the training time. A sigmoid nonlinearity is applied to the final 40-way output of the network.
To generate the image given the 8896 representation computed by the classifier, we use a similar structure to DCGAN Radford et al. (2016), namely conv 256 ConvT 256s2 ConvT 128s2 conv 3 tanh, where ConvT 256s2 denotes a transpose convolution with 256 feature maps and stride 2. All convolutions have a batch normalization layer before the activations.
Finally, the discriminator network is conv 64s2 conv 128s2 ConvT 256s2 conv 1 sigmoid. Here, all convolutions use batch normalization followed by Leacky ReLU activations.
In this experiment, we use Gaussian multiplicative noise which is slightly more stable during training (Appendix B). To stabilize the training of the GAN, we found useful to (1) scale down the “reconstruction error” term in the loss function and (2) slowly increase the weight of the reconstruction error up to the desired value during training.
Appendix B Gaussian multiplicative noise
In developing the theory, we chose to use log-normal multiplicative noise for the weights: The main benefit is that with this choice the information in the weights can be expressed in closed form, up to an arbitrary constant which does not matter during the optimization process (but see also Neklyudov et al. (2017) for a principled approach to this problem that uses a proper log-uniform prior). Another possibility, suggested by Kingma et al. (2015) is to use Gaussian multiplicative noise with mean 1. Unfortunately, there is no analytical expression for when using Gaussian noise, but can still be approximated numerically with high precision (Molchanov et al., 2017), and it makes the training process slightly more stable. All our theory holds with minimal changes also in this case, and we use this choice in some experiments.
Appendix C Proofs of theorems
[Task-nuisance decomposition] Given a joint distribution , where a discrete random variable, we can always find a random variable independent of such that , for some deterministic function . Fix to be the uniform distribution on . We claim that, for a fixed value of , there is a function such that , where denotes the push-forward map of measures. Given the claim, let . Since is a discrete random variable, is easily seen to be a measurable function and by construction . To see the claim, notice that, since there exists a measurable isomorphism between and (Theorem 3.1.1 of Berberian (1988)), we can assume without loss of generality that . In this case, by definition, we can take where
is the cumulative distribution function of.
[Invariance and minimality] Let be a nuisance for the task and let be a sufficient representation of the input . Suppose that depends on only through (i.e., ). Then,
Moreover, there exists a nuisance such that equality holds up to a (generally small) residual
where . In particular , and whenever is a deterministic function of . Under these conditions, a sufficient statistic is invariant (maximally insensitive) to nuisances if and only if it is minimal. By hypothesis, we have the Markov chain ; therefore, by the DPI, we have
. The first term can be rewritten using the chain rule as, giving us
Now, since and are independent, . In fact,
Substituting in the inequality above, and using the fact that is sufficient, we finally obtain
Moreover, let be as in Section 2.2. Then, since is a deterministic function of and , we have
with defined as above. Using the sufficiency of , the previous inequality for , the DPI, we get the chain of inequalities
from which we obtain the desired bounds for .
While the proof of the following theorem is quite simple, some clarifications on the notation are in order: We assume, following a Bayesian perspective, that the data is generated by some generative model , where the parameters of the model are sampled from some (unknown) prior . Given the parameters , the training dataset is composed of i.i.d. samples from the unknown distribution . The output of the training algorithm on the dataset is a (generally simple, e.g.
, normal or log-normal) distributionover the weights. Putting everything together, we have a well-defined joint distribution .
Given the weights , the network then defines an inference distribution , which we know and can compute explicitly. Another distribution, which instead we do not know, is , which is obtained from and express the optimal inference we could perform on the labels using the information contained in the weights. In a well trained network, we want the distribution approximated by the network to match the optimal distribution .
Finally, recall that the conditional entropy is defined as
where can be one random variable or a tuple of random variables. When not specified, it is assumed that the cross-entropy is computed with respect to unknown underlying data distribution . Similarly, the conditional cross-entropy is defined as
[Information Decomposition] Let denote the training dataset, then for any training procedure, we have
Recall that cross-entropy can be written as
so we only have to prove that
which is easily done using the following identities:
[Information in the weights] Under the previous modeling assumptions, the upper-bound to the information that the weights contain about the dataset is
where the constant is arbitrary due to the improper prior. Recall that we defined the upperbound as
where is a factorized log-uniform prior. Since the KL divergence is reparametrization invariant, we have:
where we have used the formula for the entropy of a Gaussian and the fact that the KL divergence of a distribution from the uniform prior is the entropy of the distribution modulo an arbitrary constant.
[Flat minima have low information] Let be a local minimum of the cross-entropy loss , and let be the Hessian at that point. Then, for the optimal choice of the posterior centered at that optimizes the IB Lagrangian, we have
where and denotes the nuclear norm. First, we switch to a logarithmic parametrization of the weights, and let (we can ignore the sign of the weights since it is locally constant). In this parametrization, we can approximate the IB Lagrangian to second order as
where . Now, notice that since