Several pleasant features underlay the success of deep learning: The scarcity of bad minima encountered in their optimization Draxler et al. (2018), their ability to generalize well despite being heavily over-parametrized Neyshabur et al. (2018, 2014) and expressive Zhang et al. (2016), and their ability to generate internal representations which generalize across different domains and tasks Yosinski et al. (2014); Sermanet et al. (2013)
. Our current understanding of these features is however largely empirical. Thus the important task of designing more robust DNNs which train faster and allow for transfer of knowledge across domains (transfer learning), involves various ad-hoc choices, trial-and-error, and hard to teach craftsmanship.
While training and generalization can be analyzed in the limit of wide networks Mei et al. (2018); De Palma et al. (2018), the issue of internal representations and transfer learning is inherently depth related. The practical setup of transfer learning Yosinski et al. (2014); Sermanet et al. (2013)
(and some semi-supervised learning schemesKingma et al. (2014)) typically involves training a DNN on task with large amounts data (say image classification) cutting and freezing several of the lowest layers of that DNN, adding a smaller DNN on top the these frozen layers, and training it for task (say localization of objects in images) with a smaller dataset. The fact that transfer learning often works quite well implies that, to some degree, layers in a DNN learn data representations which are ”useful”, or aid in solving the task, even without very specific knowledge on the particular weights of subsequent layers.
A way of formalizing usefulness of internal representations is to consider layer-wise greedy optimization. Indeed, if weights-specific knowledge of subsequent layers is unimportant, it should be possible to train each layer of the DNN individually. Such optimization should use, at most, knowledge about the architecture of subsequent layers and the task at hand. Thus a set of layer-wise loss functions should exist which depend, at most, on the architecture and the task. Such loss functions would quantify what is the set of useful representations each layer should aim for, thereby quantifying the role of a layer in a DNN. Furthermore, such loss functions would in principal allow one to determine whether a layer can be successfully transferred by measuring how well these useful representations agree between different architectures and tasks. The ability to draw analytic insights from such layer-wise loss functions depends heavily on how explicit they are. While many ideas for such layer loss functions have been proposed, to the best of our knowledge, the ones which are explicit do not yield state-of-the-art performance and the ones which yield state-of-the-art performance are not explicit (see review below).
How to obtain such explicit loss functions? Here we turn to a recent work analyzing very wide networks Lee et al. (2018). This work capitalizes on the fact that in the infinite width limit fully connected DNNs, when marginalized over their internal parameters, behave as Gaussian Processes (GPs). This GP is fully characterized by the covariance-function () which measures how two different inputs (
) correlate at the output of the DNN, when marginalized over the weights’ distribution. As Bayesian Inference can be carried exactly on GPsRasmussen and Williams (2005), one has an analytic handle on Bayesian Inference in wide DNNs. Notably one may worry that infinitely wide DNNs with an infinite number of parameters will be of little use due to over-fitting, however that is not the case in practice. In fact various works show that the wider the network, the better it seems to generalize Neyshabur et al. (2018, 2014). Interestingly there is also evidence that GP predictions remain a good approximation even for networks of depth 10000 Xiao et al. (2018) at initialization. Here we seek to leverage this GP viewpoint for constructing layer-wise loss functions.
1. We derive and test a novel set of explicit supervised layer-wise loss functions, Deep Gaussian Layer-wise losses (DGLs), for fully connected DNNs. The DGLs lead to state-of-the-art performance on MNIST and CIFAR10 when used in LEGO and can also be used to monitor standard end-to-end optimization. The DGLs are architecture dependent but only through a few effective parameters. 2. We analyze in depth the DGL of a pre-classifier layer and use this analysis to shed light on issues with the Information Bottleneck (IB) approach to DNNs. 3. We show that
of finite width DNNs agree with those of very wide network quite well and suggest a fitting ansatz which makes this agreement even tighter. 4. We provide strong evidence that the GP approach to DNNs is an excellent approximation for the behavior of DNNs with widths as small as 20 neurons.
Related work: The idea of analyzing DNNs layer by layer has a long history. Several early successes of deep networks were obtained using LEGO strategies. In particular good generative models of hand-written digits Hinton et al. (2006) and phonetics classifiers Mohamed et al. (2012)
were trained using an unsupervised (i.e. label unaware) LEGO strategy which for the latter work was supplemented by stochastic gradient descent (SGD) fine-tuning. Following some attempts to perform supervised LEGOBengio et al. (2006), the common practice became to use LEGO as a pre-training initialization protocol LeCun et al. (2015). As simpler initialization protocols came alone Glorot and Bengio (2010) SGD on the entire network (end-to-end) became the common practice. More recent works include several implicit loss function based on IB all having in common that an auxiliary DNN has to be trained in order to evaluate the loss as well as . More analytic approaches include a unsupervised LEGO trainingKadmon and Sompolinsky (2016); Meir and Domany (1988) algorithm followed by a classifier for datasets resembling Gaussian mixtures, a biologically inspired unsupervised algorithm, and target methods where layers are trained to fit to specific targets chosen by a backwards pass on the network Lee et al. (2014).
Ii Gaussian Processes and finite-width DNNs
Here we briefly survey relevant results on GPs Rasmussen and Williams (2005)
and their covariance functions. Gaussian Processes are a generalization of multi-variable Gaussian distributions to a distribution of functions () Rasmussen and Williams (2005)
. Being Gaussian they are completely defined by the first and second moments. The first is typically taken to be zero and second is known as the covariance function (, where denote expectation under the GP distribution). In addition GPs allow for exact Bayesian Inference. An important conceptual step here is to view the function as an (infinite-dimensional/non-parametric) representation of the model-parameters.
The equivalence between GPs and very wide DNNs stems from the fact that in the infinite width (channel) limit, fully-connected (convolutional) DNNs with uncorrelated prior on the weights () are equivalent to GPs Cho and Saul (2009); Novak et al. (2018). Here becomes the DNN’s output and
denotes the input to the DNN. Alternatively stated, the probability distribution on the space of functions generated by a DNN with random weights, is a Gaussian one. Consequently exact Bayesian Inference on such DNNs is possibleLee et al. (2018); Cho and Saul (2009) and explicitly given by
where is a new datapoint,are the training targets, are the training data-points, is the covariance-matrix (the covariance-function projected on the training dataset ()), is a regulator corresponding to a noisy measurement of , and
is the identity matrix. Notably while not written explicitly,is dependent. Some intuition for this formula can be gained by verifying that yields .
Implicit in the above matrix-inversion is the full Bayesian integration () over all DNNs weights, weighted by their likelihood given the dataset (). Gaussian Processes in which the covariance-function is derived as above from infinite width DNNs are called Lee et al. (2018).
Several works have noted quantitative and qualitative similarities between end-to-end SGD training and Bayesian Inference Welling and Teh (2011); Mandt et al. (2017); Jacot et al. (2018); Chaudhari and Soatto (2018); Lee et al. (2018). In particular for the CIFAR-10 and MNIST datasets, Bayesian predictions and SGD predictions were shown to be in tight agreement Lee et al. (2018). Thus we shall assume that NNGP performance is a monotonic function of the average SGD performance for wide DNNs.
Turning away from the infinite width limit, a finite width DNN cannot be viewed as a GP, at least not strictly. However one may still attempt to approximate it using a GP in what can be thought of as a mean-field/Gaussian approximation. The covariance-function of this GP-approximation is , as defined above. If non-Gaussian corrections are small, the performance of Bayesian inference using the GP-approximation would be a monotonic function of SGD performance. We shall assume from now on that with interpreted as a monotonic relation between performances. This assumption would be supported below by our numerical results.
Still an important problem remains which is how to calculate this GP-approximation. Indeed the space of all possible covariance-functions () for a high dimensional is huge thus requiring large amounts of data for a proper fit. Here we make the simplifying assumption, inspired by recent results Xiao et al. (2018), that the approximating GPs have the functional form of the infinite width NNGP with renormalized prior parameters. Conveniently NNGP covariance-functions can be written using an explicit formula which involves the non-linearity of the network, the prior on the weights and biases () at all layers (, where
is the DNN depth). Focusing on the case of the commonly used ReLU activations, the resulting approximate covariance-function of a depthnetwork at infinite width () is given by the following recursive relation Cho and Saul (2009)
where . As shown in Fig. (1) where are taken from their microscopic values (MF) or from a fit (FIT), agrees well with the empirical (sampled) covariance-function. We note by passing that similar explicit formula exist also for error-function activations.
Iii Deriving the Deep Gaussian Layer-wise Loss functions
To derive the DGL functions let us start with a LEGO strategy which should be optimal in terms of performance yet highly non-explicit (See Fig.2): We begin from the input layer and consider it as our current trainee layer (). For every set of its parameters () we perform standard end-to-end training of the entire network between the trainee layer and the classifier (the top-network) with kept frozen. Next we repeat this training infinitely many times and treat the average performance () as a loss function for the trainee layer. Then we optimize the parameters such that . Subsequently we act on the dataset using to obtain the representation of the dataset in activation space (). We then repeat the process for the layer with as inputs. This process continues until . The last classifier layer is then trained using MSE Loss.
Provided that freezing the parameters of the trainee-layer does not induce optimization issues in the top-network SGD, the above procedure would yield the same performance as average end-to-end SGD. Such optimization issues in the top-network, more well known as co-adaptation issues Yosinski et al. (2014), arise from tight coupling between top-network and trainee layer weights. They imply that the trainee layer representation learned by standard end-to-end training, is highly correlated with the top-network and thus inadequate for transfer learning Yosinski et al. (2014).
Co-adaptation is considered adversarial to learning also outside the scope of LEGO and transfer learning. Indeed the success of dropout regularization is partially associated with its ability to mitigate co-adaptation Srivastava et al. (2014). Additionally, co-adaptation being a local minima issue, is more likely to occur away from the over-parametrized regime where modern practical interest lies. We thus make a second assumption which is that co-adaptation effects are small. Note that if co-adaptation is unavoidable one may still group the co-adapting layers into a block of layers and treat this block as an effective layer in the algorithm discussed below.
Assuming no co-adaptation as well as we shall now derive the DGL functions by approximating using NNGP Bayesian prediction. To this end let us either consider a regression problem with data-point () and targets () or rephrase a classification problem as a regression problem by taking to be a one-hot encoding of the categorical labels. For concreteness we focus on the bottom/input layer (see Fig. 2) which acts on the and maps each to its value in the activation space of the input layer (). Consider training the top-network on the dataset represented by (, ). Taking the GP approximation we consider Eq. (1) with replaced by and replaced by (the covariance-function of the top-network). The resulting equation now describes how an unseen activation would be classified () by a trained top-network. To make this into a loss function for the training dataset, rather than for an unseen point, we adopt a leave-one-out cross-validation strategy: We iterate over all data-points, take each one out in turn, treat it is an unseen point, and measure how well we predict its label using the mean Bayesian NNGP prediction.
Assuming has no kernel, taking , and performing some straightforward algebra (see App. I.) the MSE loss of the leave-one-out predictions can be expressed using the inverse of over the training dataset ()
A few technical comments are in order. The DGL is a function of the trainee layer’s parameters via which enter whose inverse is . Apart from the need to determine the top-networks effective parameters () numerically or through meta-optimization, the DGL is an explicit function of all the points in the dataset. For the case of ReLU networks without biases, it can be seen from Eq. 2 that all of collapse into one scale parameter. Lastly we stress that this loss gives a score to a full dataset rather than to points in the dataset.
We turn to discuss the structure and symmetries of . As depends on only through which in turn depends only on , it inherits all the symmetries of the latter. For fully connected top-networks it is thus invariant under any orthogonal transformation (, where is the dimension of the vector () of the ’th layer-representation (). An additional structure is that depends on the targets only through the dot-product of the targets which, for the one-hot encoded case, means it is zero unless the labels are equal. The -dependent central piece () is “unsupervised” or unaware of the labels. It is a negative definite matrix ensuring that the optimal DGL is zero as one expects from a proxy to the MSE loss. One can think of as a measure of sample-similarity-bias of the DNN (more specifically the top-network): when is small (large) for two data-points the networks tends to associate different (similar) targets to them in any classification task. Crucially is not a simple pairwise dependence on , but rather depends on the entire dataset through the covariance-matrix inversion. The DGL function can thus be interpreted as the sample similarity (in the context of the dataset) weighted by the fixed-target similarity.
Iv The case of a depth one network
It is illustrative to demonstrate our approach on a case where the inversion of the covariance-matrix can be carried out explicitly. To this end we consider a DNN consisting of a fully-connected or convolutional bottom/input layer () with weights and any type of activation. This layer outputs a dimensional activation vector () which is fed into a linear layer with two outputs . We consider a binary regression task with two targets which can also be thought of as a binary classification task.
Notably in this shallow scenario many of the assumptions we made in deriving the DGL functions are exact. A linear layer of any width, when marginalized over its weights, is a Gaussian Process whose covariance-function is that of the infinite-width limit. Moreover the loss landscape of an MSE-classifier is convex and therefore co-adaptation effects are absent.
To express the DGL function for this input layer (
), our first task is to find the covariance-function of the top-network namely, the linear layer. Assuming a Gaussian prior of varianceon each of the linear layer’s matrix weights and zero bias it is easy to show (App. II.) that
where is an by matrix given by .
To facilitate the analysis we next make the assumption that the number of data-points () is much larger than the number of labels and also take a vanishing regulator (). As a result we find that the covariance-matrix has a kernel whose dimension is at least . To leading order in one finds that , where is the projector onto the kernel of . This projector is given by (see App. 3)
Indeed one can easily verify that and that as required. Plugging these results into Eq. 3 one finds that to leading order in
The above equation tell us how to train a layer whose output () gets fed into a linear classifier. Let us first discuss its symmetry properties. The first term in this equation is constant under the optimization of hence we may discard it. The second term is invariant under any rotation () of the dataset in activation space (). Indeed such transformations can be carried by the classifier itself and hence such changes to the dataset should not affect the performance of the classifier. A bit unexpected is that
is also invariant under the bigger group of invertible linear transformation (
). While a generic classifier can indeed undo any linear transformation, the prior we put on its weights limits the extent to which it can undo a transformation with vanishing eigenvalues. This enhanced symmetry is a result of taking thelimit, which allows the Gaussian Process to distinguish vanishingly small difference in . In practice finite is often needed for numerical stability and this breaks the symmetry down to an symmetry.
Next we discuss how sees the geometry of the dataset. Notably is the covariance matrix of the dataset in activation space. Since it is positive definite we can write , and therefore . We then define as the normalized dataset. Indeed its covariance matrix () is the identity.Thus, . In these coordinates the loss is a simple pairwise interaction between normalized datapoints which tends to make points with equal (opposite) labels closer (far-apart). It thus favors a formation of separate droplets in the normalized representation as illustrated in Fig. 3 panel (B).
The fact that it encourages droplets in the normalized representation rather than in the representation itself is very sensible. Indeed the classifier’s performance is a measure of the linear separability of the dataset. This means that points with opposite labels should be on opposite sides of a hyper-plane. However no further improvement in train performance is gained by making equal label points closer in Euclidean distance. Hence pairwise interaction (encouraging high dot-product between similar labels) without normalization is unlikely to be faithful measure of linear separability. Once the dataset is normalized this spread over the directions along the hyper-plane is made to be of order one, hence equal label points do look like they bunch together into droplets (See Fig. 3). In fact based on the above symmetry discussion one finds that favors any geometry given by any invertible linear transformation acting on a dataset representation consisting of two well separated droplets. This is a sensible measure of linear separability for generic datasets.
V Contrast with Information Bottleneck approaches
It is interesting to compare with a different loss function drawn from recent works on the information bottleneck (IB) Tishby and Zaslavsky (2015); Shwartz-Ziv and Tishby (2017). In those works it was argued that the role of a layer was to compress the layers representation while maintaining the information on the labels. Formally this means minimizing the mutual information quantity
for large . A subtle yet important issue here is the fact that for deterministic networks these mutual information quantities are either constant or infinite depending how one views the entropy of a point. To overcome this the original works used binning of some linear dimension () in activation space and other works added a Gaussian noise of variance to Saxe et al. (2018). In case in which three data-points becoming - or -close are rare, both regularization schemes effectively lead to a pairwise interaction between data-points Kolchinsky and Tracey (2017); Goldfeld et al. (2018) (see also App. 2.). Notably this is almost always the case at high dimension or when the regulator is taken to zero. For the Gaussian regulator the resulting loss is particularly simple and given by (see App. 2.)
where is an Gaussianly decaying interaction on the scale of given explicitly be the difference in entropy between two dimensional Gaussian distribution of variance and a mixture of such Gaussians as distance .
At the input of the linear classifier, one can easily see the differences between the layer-representations favored by and by (see Fig. 3). The former, being unaware of the classifier or the architecture, simply encourages the formation of droplets which, as argued previously, are not a faithful measure of linear separability. To achieve this unnecessary goal it is likely to compromise on the margin. The latter being aware of the classifier, encourages linear separability. We conclude that is unlikely to be a good layer-wise loss function close to the classifier. This lack of architecture awareness of IB (regulated using binning or Gaussian noise) is generally concerning.
Vi Numerical tests
Here we report several numerical experiments aimed at testing whether the DGLs can monitor standard end-to-end optimization and measure the effectiveness of the DGL functions in LEGO. Experiments were conducted on three datasets: MNIST with 10k training samples randomly selected from the full MNIST training set and balanced to have an equal number of samples from each label (), CIFAR10 with 10k training samples similarly selected and balanced in terms of labels (). Binary MNIST with only the digits 1 and 7 and 2k training samples (), similarly selected and balanced in terms of labels. For each dataset, an additional validation set of size equal to the training set, was randomly selected from the full respective training set, excluding the samples selected for the training set. The validation set was balanced in terms of labels. For and the reported test set was the respective standard test-set and for the reported test set was the samples from the standard test-set with labels 1 and 7. The test sets were not balanced in terms of labels.
All experiments were conducted using fully-connected DNNs, with depth , consisting of activated layers with fixed width () and a linear classifier layer with output dimension given by the number of classes. The targets were zero-mean one-hot encoded in all experiments except for , where the labels were one-hot encoded. The loss function for all non-DGL training was MSE loss.
For each dataset we conducted the following procedure: 1. End-to-end SGD training under MSE loss 2. Evaluation of the mean-field covariance function of the end-to-end-trained network 3. DGL-Monitored end-to-end SGD training under MSE loss with the same hyperparameters as in step 1. and with the mean-field covariance function evaluated at step 2. 4. LEGO training of all activated layers under DGL, using the mean-field covariance function evaluated at step 2. The activated layers were optimized sequentially, starting from the inputs layer. Each layer was optimized once, then kept frozen during the optimization of subsequent layers. 5. Training of the linear classifier layer only, under MSE loss, with the activated layers frozen, either at the DGL-optimized weights or at the randomly-initialized values.
End-to-end training was done using either vanilla SGD optimizer () or Adam optimizer (, ) with standard internal parameter. All DGL training was done using the Adam optimizer with standard internal parameters. All training was done with fixed learning rates and weight decay, . and were manually selected for each step in each dataset. The best hyper parameters for each step were selected for minimal loss on the validation set.
DGL Monitoring. Figure (4) shows DGL monitoring of end-to-end training (step 2.) of a network with . Even at this small width, DGL tracks end-to-end training very well. Various finer details are discussed in the caption.
DGL LEGO. Table 1 shows the test performances of steps 4.& 5., on the three aforementioned datasets for several and choices. End-to-end test accuracy is taken from Ref. [Lee et al., 2018] apart from where we report the test accuracy obtained at step 1. end-to-end training. The Random column serves as a simple base-line for the effect of depth where we take the randomly initialized network and freeze the weights of all layers apart from the linear classifier.
Following recent works on the Information Bottleneck (IB) theory of deep learning Tishby and Zaslavsky (2015); Shwartz-Ziv and Tishby (2017) there has been a surge of works analyzing the layer representations generated by deep neural networks from both a geometrical and an information theory viewpoint. In this work we argued, both theoretically and numerically, that one can formalize what constitutes a good layer representation explicitly using a set of loss functions— the DGL functions. These loss functions differ from the losses implied by IB in many aspects but mainly, in the fact that they are aware of the architecture of the network. We argued that this is essential, at least close to the classifier.
The DGL functions are well capable of monitoring the optimization of end-to-end training in a layer-wise fashion. Moreover they enable a competitive layer by layer optimization of the network. Although such training is admittedly slower, it has the advantage of generating layer representations with no co-adaptation effects which are likely to be better for transfer learning Yosinski et al. (2014).
To the best of our knowledge our LEGO approach outperforms all other explicit LEGO approaches (i.e ones which do not require auxiliary DNNs). Nevertheless our aim here is not to provide a more powerful algorithm for optimization. Rather, we wish to open an analytic window to the incremental role of DNN layers and the representations they learn. Indeed the explicit nature of the DGL functions combined with their high level of structure, symmetry, and empirical accuracy, invites further study regarding the interpretability of layer representations. Understanding these representations would help unravel the inner workings of DNNs and facilitate their use in a more modular fashion across different domains.
- Draxler et al. (2018) F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht, arXiv e-prints arXiv:1803.00885 (2018), eprint 1803.00885.
- Neyshabur et al. (2018) B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, arXiv e-prints arXiv:1805.12076 (2018), eprint 1805.12076.
- Neyshabur et al. (2014) B. Neyshabur, R. Tomioka, and N. Srebro, arXiv e-prints arXiv:1412.6614 (2014), eprint 1412.6614.
- Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, arXiv e-prints arXiv:1611.03530 (2016), eprint 1611.03530.
- Yosinski et al. (2014) J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (MIT Press, Cambridge, MA, USA, 2014), NIPS’14, pp. 3320–3328, URL http://dl.acm.org/citation.cfm?id=2969033.2969197.
- Sermanet et al. (2013) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, arXiv e-prints arXiv:1312.6229 (2013), eprint 1312.6229.
- Mei et al. (2018) S. Mei, A. Montanari, and P.-M. Nguyen, Proceedings of the National Academy of Sciences 115, E7665 (2018), ISSN 0027-8424, eprint https://www.pnas.org/content/115/33/E7665.full.pdf, URL https://www.pnas.org/content/115/33/E7665.
- De Palma et al. (2018) G. De Palma, B. Toussi Kiani, and S. Lloyd, arXiv e-prints arXiv:1812.10156 (2018), eprint 1812.10156.
- Kingma et al. (2014) D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014), pp. 3581–3589, URL http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf.
- Lee et al. (2018) J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=B1EA-M-0Z.
Rasmussen and Williams (2005)
C. E. Rasmussen
and C. K. I.
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)(The MIT Press, 2005), ISBN 026218253X.
- Xiao et al. (2018) L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington, arXiv e-prints arXiv:1806.05393 (2018), eprint 1806.05393.
- Hinton et al. (2006) G. E. Hinton, S. Osindero, and Y.-W. Teh, Neural Computation 18, 1527 (2006), pMID: 16764513, eprint https://doi.org/10.1162/neco.2006.18.7.1527, URL https://doi.org/10.1162/neco.2006.18.7.1527.
- Mohamed et al. (2012) A. Mohamed, G. E. Dahl, and G. Hinton, IEEE Transactions on Audio, Speech, and Language Processing 20, 14 (2012), ISSN 1558-7916.
- Bengio et al. (2006) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, in Proceedings of the 19th International Conference on Neural Information Processing Systems (MIT Press, Cambridge, MA, USA, 2006), NIPS’06, pp. 153–160, URL http://dl.acm.org/citation.cfm?id=2976456.2976476.
- LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 EP (2015), URL http://dx.doi.org/10.1038/nature14539.
Glorot and Bengio (2010)
X. Glorot and
Y. Bengio, in
Proceedings of the thirteenth international conference on artificial intelligence and statistics(2010), pp. 249–256.
- Kadmon and Sompolinsky (2016) J. Kadmon and H. Sompolinsky, in Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc., 2016), pp. 4781–4789.
- Meir and Domany (1988) R. Meir and E. Domany, Phys. Rev. A 37, 608 (1988), URL https://link.aps.org/doi/10.1103/PhysRevA.37.608.
- Lee et al. (2014) D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio, arXiv e-prints arXiv:1412.7525 (2014), eprint 1412.7525.
- Cho and Saul (2009) Y. Cho and L. K. Saul, in Proceedings of the 22Nd International Conference on Neural Information Processing Systems (Curran Associates Inc., USA, 2009), NIPS’09, pp. 342–350, ISBN 978-1-61567-911-9, URL http://dl.acm.org/citation.cfm?id=2984093.2984132.
- Novak et al. (2018) R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein, arXiv e-prints arXiv:1810.05148 (2018), eprint 1810.05148.
- Welling and Teh (2011) M. Welling and Y. W. Teh, in Proceedings of the 28th International Conference on International Conference on Machine Learning (Omnipress, USA, 2011), ICML’11, pp. 681–688, ISBN 978-1-4503-0619-5, URL http://dl.acm.org/citation.cfm?id=3104482.3104568.
- Mandt et al. (2017) S. Mandt, M. D. Hoffman, and D. M. Blei, ArXiv e-prints (2017), eprint 1704.04289.
- Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler, ArXiv e-prints (2018), eprint 1806.07572.
- Chaudhari and Soatto (2018) P. Chaudhari and S. Soatto, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=HyWrIgW0W.
- Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, J. Mach. Learn. Res. 15, 1929 (2014), ISSN 1532-4435, URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
- Tishby and Zaslavsky (2015) N. Tishby and N. Zaslavsky, ArXiv e-prints (2015), eprint 1503.02406.
- Shwartz-Ziv and Tishby (2017) R. Shwartz-Ziv and N. Tishby, ArXiv e-prints (2017), eprint 1703.00810.
- Saxe et al. (2018) A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=ry_WPG-A-.
- Kolchinsky and Tracey (2017) A. Kolchinsky and B. D. Tracey, Entropy 19 (2017), ISSN 1099-4300, URL http://www.mdpi.com/1099-4300/19/7/361.
- Goldfeld et al. (2018) Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, arXiv e-prints arXiv:1810.05728 (2018), eprint 1810.05728.
- Rifkin and Klautau (2004) R. Rifkin and A. Klautau, J. Mach. Learn. Res. 5, 101 (2004), ISSN 1532-4435, URL http://dl.acm.org/citation.cfm?id=1005332.1005336.
- (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018) A. (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018), URL https://openreview.net/forum?id=r1Nb5i05tX.
Appendix A Derivation of the DGL functions
Here we consider a multi-label classification dataset () consisting of data points each described by a dimensional vector and a ”one-hot“ two dimensional label (target) vector () for each class. As in Rifkin and Klautau (2004); Lee et al. (2018) we treat classification as a regression task where the network’s outputs for a given class are optimized to be close to the one-hot label (MSE loss).
Next we define the -left-out dataset () consisting of all points except the point . Our starting point for defining the DGL is the Bayesian prediction formula for the label vector () of an unseen datapoint () (unseen with respect to )
where is the covariance function projected on the dataset , where is the -minor of or equivalently the covariance-function projected onto , and is the identity matrix in an dimensional space. Note that we choose indices to remain faithful to data-points, so that the indices of are chosen to be the set rather than .
It would be convenient both analytically and numerically to relate and . To this end we employ a relation between inverse of a positive definite matrix () and its minor ()
Notably since is positive definite and bounded, is also positive definite and so the above denominator is always nonzero. Note that since is semi-positive-definite is positive-definite. The difference on the r.h.s. of both of the above two equations lays solely in allowed values of ( for the first Eq. and for second).
Following this one can show that
where is the projector onto the subspace , is the kernel subspace of , and is the image subspace of .
Turning to the variance in the predicted target vector () the standard formula gives Rasmussen and Williams (2005)
which using the above relations gives
note that since is positive definite with maximal eigenvalue of we get that and therefore the variance is non-negative as required.
We next define the DGL function as the MSE loss of the Bayesian prediction
Notably one can also add the variance () to this expression making it a more accurate measure of the expected MSE loss. For simplicity and since we found that it makes little difference in practice we did do so in the text. The Github repository we opened has this option available. In the generic case in which the covariance-matrix has no kernel and taking the limit of zero we obtain
Appendix B Information Bottleneck from the Pair Distribution Function.
The Information Bottleneck (IB) approach asserts Tishby and Zaslavsky (2015); Shwartz-Ziv and Tishby (2017) that each layer, having activations , minimizes the loss function , where () is the mutual information between the activations and the input (label) and is an undetermined layer specific constant which is usually order of a 100 Shwartz-Ziv and Tishby (2017). Notably IB was proposed for deterministic network in which is a deterministic function of . As commented in many works Saxe et al. (2018); Kolchinsky and Tracey (2017), in such settings mutual information quantities are ill defined and require a regulator. The regulator defines how much information is in one data-point and how close two points have to be to collapse into one point. One type of regulator several authors recommend Saxe et al. (2018); (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018), consists of adding a very small Gaussian random noise to and using that perturbed in the above loss.
For much smaller than the typical inter-datapoint spacing and at high dimension, one can fairly assume that pairs of data-points coming close in the space of activations cause the vast majority of information loss whereas triplets of the datapoints coming close are far more rare. Clearly for low enough (i.e the deterministic limit) it would always be true unless three points happen to collapse exactly on one another. Taking this as our prescription for determining , we show below that mutual information becomes a property of the pair-distribution-function (PDF) of the dataset (defined below) and as a result the IB compression can be measured only through knowledge of the pair-wise distances between all points. Such PDFs were analyzed in Ref. Goldfeld et al. (2018) and indeed compression (following auxiliary noise addition) was linked to reduction of pairwise distance in these PDFs.
We turn to establish the mapping between mutual information with a small -noise regulator and the pair-distribution function. For brevity we focus only on . We make the reasonable assumption that data-points () have no repetitions and are all equality likely. Using we first find that the second contribution is just the entropy of (). The latter is dimensional Gaussian distribution with variance , which we denote by . The former is the entropy of . In cases where all data-points in space () are much further apart on the scale of entropy becomes that of choosing a data-point (, where is the number of datapoints) plus that a single datapoint . This implies that as expected in this limit. Next consider the case where some points are far apart but some point are bounded to pairs. The entropy is now given by
where runs over all pairs, is the distance between members of the pair, and is the entropy of mixture of two d-dimensional Gaussians with variance at distance . Noting that decays as one can just as well extend this sum over pairs to a sum over all points finally arriving at
A summation of two particles/data-points terms as the one above can always be expressed using the pair-distribution-function (PDF) whose standard definition is
it is then easy to verify that
Similarly can be expressed using the opposite-label PDF given by
where and scan data-points with opposite labels. We thus conclude that optimization the IB functional following -noise regularization, either in the limit of or in the limit where three points reaching a distance of are rare, is simply a particular type of label dependent pairwise interaction.
Appendix C DGL for the pre-classifier layer
Here we derive in detail the DGL of pre-classifier layer. The inverse of . This matrix is defined by
where we recall that is an by given by . Taking the limit of one immediately has that
Without fine tuning is positive-definite. Notably this statement is equivalent to saying that the matrix has linearly independent columns. Notably when having two linearly dependent coloumns requires fine-tunning of parameters, hence when this becomes extremely unlikely under any reasonable ensemble for .
In this case one can show that . Indeed
This equation implies that is a projector (in fact an Hermitian projector as is easy to verify). The second that its image is in the kernel of . The third that its kernel is in the image of . All in all it implies that it is a projector whose image coincides with the kernel of as required.
Next we consider Eqs. (11). The fact that the kernel is non-trivial adds several complicated terms to our loss. These all term depend on which we next expand as
we in the right hand side we noted that , the image of , is of dimension , consequently the norm of the operator is , while the norm of the . Notably this statement is only accurate element-wise when we assume that has no particular relation with the basis on which the matrix is written on. For this not to hold it would require that at least one dimensional row of is orthogonal to all the remaining rows. This is again exponentially unlikely in the limit of under any reasonable ensemble for .
Accordingly we treat the expansion in the as an expansion in . For instance we can then expand
Plugging this into Eq. (15) we obtain
as in the main text.