I Introduction
Several pleasant features underlay the success of deep learning: The scarcity of bad minima encountered in their optimization Draxler et al. (2018), their ability to generalize well despite being heavily overparametrized Neyshabur et al. (2018, 2014) and expressive Zhang et al. (2016), and their ability to generate internal representations which generalize across different domains and tasks Yosinski et al. (2014); Sermanet et al. (2013)
. Our current understanding of these features is however largely empirical. Thus the important task of designing more robust DNNs which train faster and allow for transfer of knowledge across domains (transfer learning), involves various adhoc choices, trialanderror, and hard to teach craftsmanship.
While training and generalization can be analyzed in the limit of wide networks Mei et al. (2018); De Palma et al. (2018), the issue of internal representations and transfer learning is inherently depth related. The practical setup of transfer learning Yosinski et al. (2014); Sermanet et al. (2013)
(and some semisupervised learning schemes
Kingma et al. (2014)) typically involves training a DNN on task with large amounts data (say image classification) cutting and freezing several of the lowest layers of that DNN, adding a smaller DNN on top the these frozen layers, and training it for task (say localization of objects in images) with a smaller dataset. The fact that transfer learning often works quite well implies that, to some degree, layers in a DNN learn data representations which are ”useful”, or aid in solving the task, even without very specific knowledge on the particular weights of subsequent layers.A way of formalizing usefulness of internal representations is to consider layerwise greedy optimization. Indeed, if weightsspecific knowledge of subsequent layers is unimportant, it should be possible to train each layer of the DNN individually. Such optimization should use, at most, knowledge about the architecture of subsequent layers and the task at hand. Thus a set of layerwise loss functions should exist which depend, at most, on the architecture and the task. Such loss functions would quantify what is the set of useful representations each layer should aim for, thereby quantifying the role of a layer in a DNN. Furthermore, such loss functions would in principal allow one to determine whether a layer can be successfully transferred by measuring how well these useful representations agree between different architectures and tasks. The ability to draw analytic insights from such layerwise loss functions depends heavily on how explicit they are. While many ideas for such layer loss functions have been proposed, to the best of our knowledge, the ones which are explicit do not yield stateoftheart performance and the ones which yield stateoftheart performance are not explicit (see review below).
How to obtain such explicit loss functions? Here we turn to a recent work analyzing very wide networks Lee et al. (2018). This work capitalizes on the fact that in the infinite width limit fully connected DNNs, when marginalized over their internal parameters, behave as Gaussian Processes (GPs). This GP is fully characterized by the covariancefunction () which measures how two different inputs (
) correlate at the output of the DNN, when marginalized over the weights’ distribution. As Bayesian Inference can be carried exactly on GPs
Rasmussen and Williams (2005), one has an analytic handle on Bayesian Inference in wide DNNs. Notably one may worry that infinitely wide DNNs with an infinite number of parameters will be of little use due to overfitting, however that is not the case in practice. In fact various works show that the wider the network, the better it seems to generalize Neyshabur et al. (2018, 2014). Interestingly there is also evidence that GP predictions remain a good approximation even for networks of depth 10000 Xiao et al. (2018) at initialization. Here we seek to leverage this GP viewpoint for constructing layerwise loss functions.Our contributions:
1. We derive and test a novel set of explicit supervised layerwise loss functions, Deep Gaussian Layerwise losses (DGLs), for fully connected DNNs. The DGLs lead to stateoftheart performance on MNIST and CIFAR10 when used in LEGO and can also be used to monitor standard endtoend optimization. The DGLs are architecture dependent but only through a few effective parameters. 2. We analyze in depth the DGL of a preclassifier layer and use this analysis to shed light on issues with the Information Bottleneck (IB) approach to DNNs. 3. We show that
of finite width DNNs agree with those of very wide network quite well and suggest a fitting ansatz which makes this agreement even tighter. 4. We provide strong evidence that the GP approach to DNNs is an excellent approximation for the behavior of DNNs with widths as small as 20 neurons.
Related work: The idea of analyzing DNNs layer by layer has a long history. Several early successes of deep networks were obtained using LEGO strategies. In particular good generative models of handwritten digits Hinton et al. (2006) and phonetics classifiers Mohamed et al. (2012)
were trained using an unsupervised (i.e. label unaware) LEGO strategy which for the latter work was supplemented by stochastic gradient descent (SGD) finetuning. Following some attempts to perform supervised LEGO
Bengio et al. (2006), the common practice became to use LEGO as a pretraining initialization protocol LeCun et al. (2015). As simpler initialization protocols came alone Glorot and Bengio (2010) SGD on the entire network (endtoend) became the common practice. More recent works include several implicit loss function based on IB all having in common that an auxiliary DNN has to be trained in order to evaluate the loss as well as . More analytic approaches include a unsupervised LEGO trainingKadmon and Sompolinsky (2016); Meir and Domany (1988) algorithm followed by a classifier for datasets resembling Gaussian mixtures, a biologically inspired unsupervised algorithm, and target methods where layers are trained to fit to specific targets chosen by a backwards pass on the network Lee et al. (2014).Ii Gaussian Processes and finitewidth DNNs
Here we briefly survey relevant results on GPs Rasmussen and Williams (2005)
and their covariance functions. Gaussian Processes are a generalization of multivariable Gaussian distributions to a distribution of functions (
) Rasmussen and Williams (2005). Being Gaussian they are completely defined by the first and second moments. The first is typically taken to be zero and second is known as the covariance function (
, where denote expectation under the GP distribution). In addition GPs allow for exact Bayesian Inference. An important conceptual step here is to view the function as an (infinitedimensional/nonparametric) representation of the modelparameters.The equivalence between GPs and very wide DNNs stems from the fact that in the infinite width (channel) limit, fullyconnected (convolutional) DNNs with uncorrelated prior on the weights () are equivalent to GPs Cho and Saul (2009); Novak et al. (2018). Here becomes the DNN’s output and
denotes the input to the DNN. Alternatively stated, the probability distribution on the space of functions generated by a DNN with random weights, is a Gaussian one. Consequently exact Bayesian Inference on such DNNs is possible
Lee et al. (2018); Cho and Saul (2009) and explicitly given by(1)  
where is a new datapoint,
is the target vector typically chosen as a onehot encoding of the categorical label,
are the training targets, are the training datapoints, is the covariancematrix (the covariancefunction projected on the training dataset ()), is a regulator corresponding to a noisy measurement of , andis the identity matrix. Notably while not written explicitly,
is dependent. Some intuition for this formula can be gained by verifying that yields .Implicit in the above matrixinversion is the full Bayesian integration () over all DNNs weights, weighted by their likelihood given the dataset (). Gaussian Processes in which the covariancefunction is derived as above from infinite width DNNs are called Lee et al. (2018).
Several works have noted quantitative and qualitative similarities between endtoend SGD training and Bayesian Inference Welling and Teh (2011); Mandt et al. (2017); Jacot et al. (2018); Chaudhari and Soatto (2018); Lee et al. (2018). In particular for the CIFAR10 and MNIST datasets, Bayesian predictions and SGD predictions were shown to be in tight agreement Lee et al. (2018). Thus we shall assume that NNGP performance is a monotonic function of the average SGD performance for wide DNNs.
Turning away from the infinite width limit, a finite width DNN cannot be viewed as a GP, at least not strictly. However one may still attempt to approximate it using a GP in what can be thought of as a meanfield/Gaussian approximation. The covariancefunction of this GPapproximation is , as defined above. If nonGaussian corrections are small, the performance of Bayesian inference using the GPapproximation would be a monotonic function of SGD performance. We shall assume from now on that with interpreted as a monotonic relation between performances. This assumption would be supported below by our numerical results.
Still an important problem remains which is how to calculate this GPapproximation. Indeed the space of all possible covariancefunctions () for a high dimensional is huge thus requiring large amounts of data for a proper fit. Here we make the simplifying assumption, inspired by recent results Xiao et al. (2018), that the approximating GPs have the functional form of the infinite width NNGP with renormalized prior parameters. Conveniently NNGP covariancefunctions can be written using an explicit formula which involves the nonlinearity of the network, the prior on the weights and biases () at all layers (, where
is the DNN depth). Focusing on the case of the commonly used ReLU activations, the resulting approximate covariancefunction of a depth
network at infinite width () is given by the following recursive relation Cho and Saul (2009)(2)  
where . As shown in Fig. (1) where are taken from their microscopic values (MF) or from a fit (FIT), agrees well with the empirical (sampled) covariancefunction. We note by passing that similar explicit formula exist also for errorfunction activations.
Iii Deriving the Deep Gaussian Layerwise Loss functions
To derive the DGL functions let us start with a LEGO strategy which should be optimal in terms of performance yet highly nonexplicit (See Fig.2): We begin from the input layer and consider it as our current trainee layer (). For every set of its parameters () we perform standard endtoend training of the entire network between the trainee layer and the classifier (the topnetwork) with kept frozen. Next we repeat this training infinitely many times and treat the average performance () as a loss function for the trainee layer. Then we optimize the parameters such that . Subsequently we act on the dataset using to obtain the representation of the dataset in activation space (). We then repeat the process for the layer with as inputs. This process continues until . The last classifier layer is then trained using MSE Loss.
Provided that freezing the parameters of the traineelayer does not induce optimization issues in the topnetwork SGD, the above procedure would yield the same performance as average endtoend SGD. Such optimization issues in the topnetwork, more well known as coadaptation issues Yosinski et al. (2014), arise from tight coupling between topnetwork and trainee layer weights. They imply that the trainee layer representation learned by standard endtoend training, is highly correlated with the topnetwork and thus inadequate for transfer learning Yosinski et al. (2014).
Coadaptation is considered adversarial to learning also outside the scope of LEGO and transfer learning. Indeed the success of dropout regularization is partially associated with its ability to mitigate coadaptation Srivastava et al. (2014). Additionally, coadaptation being a local minima issue, is more likely to occur away from the overparametrized regime where modern practical interest lies. We thus make a second assumption which is that coadaptation effects are small. Note that if coadaptation is unavoidable one may still group the coadapting layers into a block of layers and treat this block as an effective layer in the algorithm discussed below.
Assuming no coadaptation as well as we shall now derive the DGL functions by approximating using NNGP Bayesian prediction. To this end let us either consider a regression problem with datapoint () and targets () or rephrase a classification problem as a regression problem by taking to be a onehot encoding of the categorical labels. For concreteness we focus on the bottom/input layer (see Fig. 2) which acts on the and maps each to its value in the activation space of the input layer (). Consider training the topnetwork on the dataset represented by (, ). Taking the GP approximation we consider Eq. (1) with replaced by and replaced by (the covariancefunction of the topnetwork). The resulting equation now describes how an unseen activation would be classified () by a trained topnetwork. To make this into a loss function for the training dataset, rather than for an unseen point, we adopt a leaveoneout crossvalidation strategy: We iterate over all datapoints, take each one out in turn, treat it is an unseen point, and measure how well we predict its label using the mean Bayesian NNGP prediction.
Assuming has no kernel, taking , and performing some straightforward algebra (see App. I.) the MSE loss of the leaveoneout predictions can be expressed using the inverse of over the training dataset ()
(3)  
A few technical comments are in order. The DGL is a function of the trainee layer’s parameters via which enter whose inverse is . Apart from the need to determine the topnetworks effective parameters () numerically or through metaoptimization, the DGL is an explicit function of all the points in the dataset. For the case of ReLU networks without biases, it can be seen from Eq. 2 that all of collapse into one scale parameter. Lastly we stress that this loss gives a score to a full dataset rather than to points in the dataset.
We turn to discuss the structure and symmetries of . As depends on only through which in turn depends only on , it inherits all the symmetries of the latter. For fully connected topnetworks it is thus invariant under any orthogonal transformation (, where is the dimension of the vector () of the ’th layerrepresentation (). An additional structure is that depends on the targets only through the dotproduct of the targets which, for the onehot encoded case, means it is zero unless the labels are equal. The dependent central piece () is “unsupervised” or unaware of the labels. It is a negative definite matrix ensuring that the optimal DGL is zero as one expects from a proxy to the MSE loss. One can think of as a measure of samplesimilaritybias of the DNN (more specifically the topnetwork): when is small (large) for two datapoints the networks tends to associate different (similar) targets to them in any classification task. Crucially is not a simple pairwise dependence on , but rather depends on the entire dataset through the covariancematrix inversion. The DGL function can thus be interpreted as the sample similarity (in the context of the dataset) weighted by the fixedtarget similarity.
Iv The case of a depth one network
It is illustrative to demonstrate our approach on a case where the inversion of the covariancematrix can be carried out explicitly. To this end we consider a DNN consisting of a fullyconnected or convolutional bottom/input layer () with weights and any type of activation. This layer outputs a dimensional activation vector () which is fed into a linear layer with two outputs . We consider a binary regression task with two targets which can also be thought of as a binary classification task.
Notably in this shallow scenario many of the assumptions we made in deriving the DGL functions are exact. A linear layer of any width, when marginalized over its weights, is a Gaussian Process whose covariancefunction is that of the infinitewidth limit. Moreover the loss landscape of an MSEclassifier is convex and therefore coadaptation effects are absent.
To express the DGL function for this input layer (
), our first task is to find the covariancefunction of the topnetwork namely, the linear layer. Assuming a Gaussian prior of variance
on each of the linear layer’s matrix weights and zero bias it is easy to show (App. II.) that(4)  
where is an by matrix given by .
To facilitate the analysis we next make the assumption that the number of datapoints () is much larger than the number of labels and also take a vanishing regulator (). As a result we find that the covariancematrix has a kernel whose dimension is at least . To leading order in one finds that , where is the projector onto the kernel of . This projector is given by (see App. 3)
(5)  
Indeed one can easily verify that and that as required. Plugging these results into Eq. 3 one finds that to leading order in
(6) 
The above equation tell us how to train a layer whose output () gets fed into a linear classifier. Let us first discuss its symmetry properties. The first term in this equation is constant under the optimization of hence we may discard it. The second term is invariant under any rotation () of the dataset in activation space (). Indeed such transformations can be carried by the classifier itself and hence such changes to the dataset should not affect the performance of the classifier. A bit unexpected is that
is also invariant under the bigger group of invertible linear transformation (
). While a generic classifier can indeed undo any linear transformation, the prior we put on its weights limits the extent to which it can undo a transformation with vanishing eigenvalues. This enhanced symmetry is a result of taking the
limit, which allows the Gaussian Process to distinguish vanishingly small difference in . In practice finite is often needed for numerical stability and this breaks the symmetry down to an symmetry.Next we discuss how sees the geometry of the dataset. Notably is the covariance matrix of the dataset in activation space. Since it is positive definite we can write , and therefore . We then define as the normalized dataset. Indeed its covariance matrix () is the identity.Thus, . In these coordinates the loss is a simple pairwise interaction between normalized datapoints which tends to make points with equal (opposite) labels closer (farapart). It thus favors a formation of separate droplets in the normalized representation as illustrated in Fig. 3 panel (B).
The fact that it encourages droplets in the normalized representation rather than in the representation itself is very sensible. Indeed the classifier’s performance is a measure of the linear separability of the dataset. This means that points with opposite labels should be on opposite sides of a hyperplane. However no further improvement in train performance is gained by making equal label points closer in Euclidean distance. Hence pairwise interaction (encouraging high dotproduct between similar labels) without normalization is unlikely to be faithful measure of linear separability. Once the dataset is normalized this spread over the directions along the hyperplane is made to be of order one, hence equal label points do look like they bunch together into droplets (See Fig. 3). In fact based on the above symmetry discussion one finds that favors any geometry given by any invertible linear transformation acting on a dataset representation consisting of two well separated droplets. This is a sensible measure of linear separability for generic datasets.
V Contrast with Information Bottleneck approaches
It is interesting to compare with a different loss function drawn from recent works on the information bottleneck (IB) Tishby and Zaslavsky (2015); ShwartzZiv and Tishby (2017). In those works it was argued that the role of a layer was to compress the layers representation while maintaining the information on the labels. Formally this means minimizing the mutual information quantity
(7) 
for large . A subtle yet important issue here is the fact that for deterministic networks these mutual information quantities are either constant or infinite depending how one views the entropy of a point. To overcome this the original works used binning of some linear dimension () in activation space and other works added a Gaussian noise of variance to Saxe et al. (2018). In case in which three datapoints becoming  or close are rare, both regularization schemes effectively lead to a pairwise interaction between datapoints Kolchinsky and Tracey (2017); Goldfeld et al. (2018) (see also App. 2.). Notably this is almost always the case at high dimension or when the regulator is taken to zero. For the Gaussian regulator the resulting loss is particularly simple and given by (see App. 2.)
(8) 
where is an Gaussianly decaying interaction on the scale of given explicitly be the difference in entropy between two dimensional Gaussian distribution of variance and a mixture of such Gaussians as distance .
At the input of the linear classifier, one can easily see the differences between the layerrepresentations favored by and by (see Fig. 3). The former, being unaware of the classifier or the architecture, simply encourages the formation of droplets which, as argued previously, are not a faithful measure of linear separability. To achieve this unnecessary goal it is likely to compromise on the margin. The latter being aware of the classifier, encourages linear separability. We conclude that is unlikely to be a good layerwise loss function close to the classifier. This lack of architecture awareness of IB (regulated using binning or Gaussian noise) is generally concerning.
Vi Numerical tests
Here we report several numerical experiments aimed at testing whether the DGLs can monitor standard endtoend optimization and measure the effectiveness of the DGL functions in LEGO. Experiments were conducted on three datasets: MNIST with 10k training samples randomly selected from the full MNIST training set and balanced to have an equal number of samples from each label (), CIFAR10 with 10k training samples similarly selected and balanced in terms of labels (). Binary MNIST with only the digits 1 and 7 and 2k training samples (), similarly selected and balanced in terms of labels. For each dataset, an additional validation set of size equal to the training set, was randomly selected from the full respective training set, excluding the samples selected for the training set. The validation set was balanced in terms of labels. For and the reported test set was the respective standard testset and for the reported test set was the samples from the standard testset with labels 1 and 7. The test sets were not balanced in terms of labels.
All experiments were conducted using fullyconnected DNNs, with depth , consisting of activated layers with fixed width () and a linear classifier layer with output dimension given by the number of classes. The targets were zeromean onehot encoded in all experiments except for , where the labels were onehot encoded. The loss function for all nonDGL training was MSE loss.
For each dataset we conducted the following procedure: 1. Endtoend SGD training under MSE loss 2. Evaluation of the meanfield covariance function of the endtoendtrained network 3. DGLMonitored endtoend SGD training under MSE loss with the same hyperparameters as in step 1. and with the meanfield covariance function evaluated at step 2. 4. LEGO training of all activated layers under DGL, using the meanfield covariance function evaluated at step 2. The activated layers were optimized sequentially, starting from the inputs layer. Each layer was optimized once, then kept frozen during the optimization of subsequent layers. 5. Training of the linear classifier layer only, under MSE loss, with the activated layers frozen, either at the DGLoptimized weights or at the randomlyinitialized values.
Endtoend training was done using either vanilla SGD optimizer () or Adam optimizer (, ) with standard internal parameter. All DGL training was done using the Adam optimizer with standard internal parameters. All training was done with fixed learning rates and weight decay, . and were manually selected for each step in each dataset. The best hyper parameters for each step were selected for minimal loss on the validation set.
DGL Monitoring. Figure (4) shows DGL monitoring of endtoend training (step 2.) of a network with . Even at this small width, DGL tracks endtoend training very well. Various finer details are discussed in the caption.
DGL LEGO. Table 1 shows the test performances of steps 4.& 5., on the three aforementioned datasets for several and choices. Endtoend test accuracy is taken from Ref. [Lee et al., 2018] apart from where we report the test accuracy obtained at step 1. endtoend training. The Random column serves as a simple baseline for the effect of depth where we take the randomly initialized network and freeze the weights of all layers apart from the linear classifier.
Dataset  L/d  Endtoend  DGL  Random 

2/2000  97.71  97.18  94.42  
3/1000  96.59  97.15  92.18  
5/2000  45.40  47.45  34.28  
2/20  98.52  99.26  87.29  
3/20  98.61  99.21  93.52 
Vii Discussion
Following recent works on the Information Bottleneck (IB) theory of deep learning Tishby and Zaslavsky (2015); ShwartzZiv and Tishby (2017) there has been a surge of works analyzing the layer representations generated by deep neural networks from both a geometrical and an information theory viewpoint. In this work we argued, both theoretically and numerically, that one can formalize what constitutes a good layer representation explicitly using a set of loss functions— the DGL functions. These loss functions differ from the losses implied by IB in many aspects but mainly, in the fact that they are aware of the architecture of the network. We argued that this is essential, at least close to the classifier.
The DGL functions are well capable of monitoring the optimization of endtoend training in a layerwise fashion. Moreover they enable a competitive layer by layer optimization of the network. Although such training is admittedly slower, it has the advantage of generating layer representations with no coadaptation effects which are likely to be better for transfer learning Yosinski et al. (2014).
To the best of our knowledge our LEGO approach outperforms all other explicit LEGO approaches (i.e ones which do not require auxiliary DNNs). Nevertheless our aim here is not to provide a more powerful algorithm for optimization. Rather, we wish to open an analytic window to the incremental role of DNN layers and the representations they learn. Indeed the explicit nature of the DGL functions combined with their high level of structure, symmetry, and empirical accuracy, invites further study regarding the interpretability of layer representations. Understanding these representations would help unravel the inner workings of DNNs and facilitate their use in a more modular fashion across different domains.
Bibliography
References
 Draxler et al. (2018) F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht, arXiv eprints arXiv:1803.00885 (2018), eprint 1803.00885.
 Neyshabur et al. (2018) B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, arXiv eprints arXiv:1805.12076 (2018), eprint 1805.12076.
 Neyshabur et al. (2014) B. Neyshabur, R. Tomioka, and N. Srebro, arXiv eprints arXiv:1412.6614 (2014), eprint 1412.6614.
 Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, arXiv eprints arXiv:1611.03530 (2016), eprint 1611.03530.
 Yosinski et al. (2014) J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, in Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2 (MIT Press, Cambridge, MA, USA, 2014), NIPS’14, pp. 3320–3328, URL http://dl.acm.org/citation.cfm?id=2969033.2969197.
 Sermanet et al. (2013) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, arXiv eprints arXiv:1312.6229 (2013), eprint 1312.6229.
 Mei et al. (2018) S. Mei, A. Montanari, and P.M. Nguyen, Proceedings of the National Academy of Sciences 115, E7665 (2018), ISSN 00278424, eprint https://www.pnas.org/content/115/33/E7665.full.pdf, URL https://www.pnas.org/content/115/33/E7665.
 De Palma et al. (2018) G. De Palma, B. Toussi Kiani, and S. Lloyd, arXiv eprints arXiv:1812.10156 (2018), eprint 1812.10156.
 Kingma et al. (2014) D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014), pp. 3581–3589, URL http://papers.nips.cc/paper/5352semisupervisedlearningwithdeepgenerativemodels.pdf.
 Lee et al. (2018) J. Lee, J. Sohldickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=B1EAM0Z.

Rasmussen and Williams (2005)
C. E. Rasmussen
and C. K. I.
Williams,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
(The MIT Press, 2005), ISBN 026218253X.  Xiao et al. (2018) L. Xiao, Y. Bahri, J. SohlDickstein, S. S. Schoenholz, and J. Pennington, arXiv eprints arXiv:1806.05393 (2018), eprint 1806.05393.
 Hinton et al. (2006) G. E. Hinton, S. Osindero, and Y.W. Teh, Neural Computation 18, 1527 (2006), pMID: 16764513, eprint https://doi.org/10.1162/neco.2006.18.7.1527, URL https://doi.org/10.1162/neco.2006.18.7.1527.
 Mohamed et al. (2012) A. Mohamed, G. E. Dahl, and G. Hinton, IEEE Transactions on Audio, Speech, and Language Processing 20, 14 (2012), ISSN 15587916.
 Bengio et al. (2006) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, in Proceedings of the 19th International Conference on Neural Information Processing Systems (MIT Press, Cambridge, MA, USA, 2006), NIPS’06, pp. 153–160, URL http://dl.acm.org/citation.cfm?id=2976456.2976476.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 EP (2015), URL http://dx.doi.org/10.1038/nature14539.

Glorot and Bengio (2010)
X. Glorot and
Y. Bengio, in
Proceedings of the thirteenth international conference on artificial intelligence and statistics
(2010), pp. 249–256.  Kadmon and Sompolinsky (2016) J. Kadmon and H. Sompolinsky, in Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc., 2016), pp. 4781–4789.
 Meir and Domany (1988) R. Meir and E. Domany, Phys. Rev. A 37, 608 (1988), URL https://link.aps.org/doi/10.1103/PhysRevA.37.608.
 Lee et al. (2014) D.H. Lee, S. Zhang, A. Fischer, and Y. Bengio, arXiv eprints arXiv:1412.7525 (2014), eprint 1412.7525.
 Cho and Saul (2009) Y. Cho and L. K. Saul, in Proceedings of the 22Nd International Conference on Neural Information Processing Systems (Curran Associates Inc., USA, 2009), NIPS’09, pp. 342–350, ISBN 9781615679119, URL http://dl.acm.org/citation.cfm?id=2984093.2984132.
 Novak et al. (2018) R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, D. A. Abolafia, J. Pennington, and J. SohlDickstein, arXiv eprints arXiv:1810.05148 (2018), eprint 1810.05148.
 Welling and Teh (2011) M. Welling and Y. W. Teh, in Proceedings of the 28th International Conference on International Conference on Machine Learning (Omnipress, USA, 2011), ICML’11, pp. 681–688, ISBN 9781450306195, URL http://dl.acm.org/citation.cfm?id=3104482.3104568.
 Mandt et al. (2017) S. Mandt, M. D. Hoffman, and D. M. Blei, ArXiv eprints (2017), eprint 1704.04289.
 Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler, ArXiv eprints (2018), eprint 1806.07572.
 Chaudhari and Soatto (2018) P. Chaudhari and S. Soatto, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=HyWrIgW0W.
 Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, J. Mach. Learn. Res. 15, 1929 (2014), ISSN 15324435, URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
 Tishby and Zaslavsky (2015) N. Tishby and N. Zaslavsky, ArXiv eprints (2015), eprint 1503.02406.
 ShwartzZiv and Tishby (2017) R. ShwartzZiv and N. Tishby, ArXiv eprints (2017), eprint 1703.00810.
 Saxe et al. (2018) A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, in International Conference on Learning Representations (2018), URL https://openreview.net/forum?id=ry_WPGA.
 Kolchinsky and Tracey (2017) A. Kolchinsky and B. D. Tracey, Entropy 19 (2017), ISSN 10994300, URL http://www.mdpi.com/10994300/19/7/361.
 Goldfeld et al. (2018) Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, arXiv eprints arXiv:1810.05728 (2018), eprint 1810.05728.
 Rifkin and Klautau (2004) R. Rifkin and A. Klautau, J. Mach. Learn. Res. 5, 101 (2004), ISSN 15324435, URL http://dl.acm.org/citation.cfm?id=1005332.1005336.
 (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018) A. (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018), URL https://openreview.net/forum?id=r1Nb5i05tX.
Appendix A Derivation of the DGL functions
Here we consider a multilabel classification dataset () consisting of data points each described by a dimensional vector and a ”onehot“ two dimensional label (target) vector () for each class. As in Rifkin and Klautau (2004); Lee et al. (2018) we treat classification as a regression task where the network’s outputs for a given class are optimized to be close to the onehot label (MSE loss).
Next we define the leftout dataset () consisting of all points except the point . Our starting point for defining the DGL is the Bayesian prediction formula for the label vector () of an unseen datapoint () (unseen with respect to )
(9)  
where is the covariance function projected on the dataset , where is the minor of or equivalently the covariancefunction projected onto , and is the identity matrix in an dimensional space. Note that we choose indices to remain faithful to datapoints, so that the indices of are chosen to be the set rather than .
It would be convenient both analytically and numerically to relate and . To this end we employ a relation between inverse of a positive definite matrix () and its minor ()
(10)  
Notably since is positive definite and bounded, is also positive definite and so the above denominator is always nonzero. Note that since is semipositivedefinite is positivedefinite. The difference on the r.h.s. of both of the above two equations lays solely in allowed values of ( for the first Eq. and for second).
Following this one can show that
(11)  
where is the projector onto the subspace , is the kernel subspace of , and is the image subspace of .
Turning to the variance in the predicted target vector () the standard formula gives Rasmussen and Williams (2005)
(12) 
which using the above relations gives
(13) 
note that since is positive definite with maximal eigenvalue of we get that and therefore the variance is nonnegative as required.
We next define the DGL function as the MSE loss of the Bayesian prediction
(14) 
Notably one can also add the variance () to this expression making it a more accurate measure of the expected MSE loss. For simplicity and since we found that it makes little difference in practice we did do so in the text. The Github repository we opened has this option available. In the generic case in which the covariancematrix has no kernel and taking the limit of zero we obtain
(15) 
Appendix B Information Bottleneck from the Pair Distribution Function.
The Information Bottleneck (IB) approach asserts Tishby and Zaslavsky (2015); ShwartzZiv and Tishby (2017) that each layer, having activations , minimizes the loss function , where () is the mutual information between the activations and the input (label) and is an undetermined layer specific constant which is usually order of a 100 ShwartzZiv and Tishby (2017). Notably IB was proposed for deterministic network in which is a deterministic function of . As commented in many works Saxe et al. (2018); Kolchinsky and Tracey (2017), in such settings mutual information quantities are ill defined and require a regulator. The regulator defines how much information is in one datapoint and how close two points have to be to collapse into one point. One type of regulator several authors recommend Saxe et al. (2018); (under double blind review https://openreview.net/forum?id=r1Nb5i05tX) (2018), consists of adding a very small Gaussian random noise to and using that perturbed in the above loss.
For much smaller than the typical interdatapoint spacing and at high dimension, one can fairly assume that pairs of datapoints coming close in the space of activations cause the vast majority of information loss whereas triplets of the datapoints coming close are far more rare. Clearly for low enough (i.e the deterministic limit) it would always be true unless three points happen to collapse exactly on one another. Taking this as our prescription for determining , we show below that mutual information becomes a property of the pairdistributionfunction (PDF) of the dataset (defined below) and as a result the IB compression can be measured only through knowledge of the pairwise distances between all points. Such PDFs were analyzed in Ref. Goldfeld et al. (2018) and indeed compression (following auxiliary noise addition) was linked to reduction of pairwise distance in these PDFs.
We turn to establish the mapping between mutual information with a small noise regulator and the pairdistribution function. For brevity we focus only on . We make the reasonable assumption that datapoints () have no repetitions and are all equality likely. Using we first find that the second contribution is just the entropy of (). The latter is dimensional Gaussian distribution with variance , which we denote by . The former is the entropy of . In cases where all datapoints in space () are much further apart on the scale of entropy becomes that of choosing a datapoint (, where is the number of datapoints) plus that a single datapoint . This implies that as expected in this limit. Next consider the case where some points are far apart but some point are bounded to pairs. The entropy is now given by
(16)  
where runs over all pairs, is the distance between members of the pair, and is the entropy of mixture of two ddimensional Gaussians with variance at distance . Noting that decays as one can just as well extend this sum over pairs to a sum over all points finally arriving at
(17)  
A summation of two particles/datapoints terms as the one above can always be expressed using the pairdistributionfunction (PDF) whose standard definition is
(18) 
it is then easy to verify that
(19)  
Similarly can be expressed using the oppositelabel PDF given by
(20)  
where and scan datapoints with opposite labels. We thus conclude that optimization the IB functional following noise regularization, either in the limit of or in the limit where three points reaching a distance of are rare, is simply a particular type of label dependent pairwise interaction.
Appendix C DGL for the preclassifier layer
Here we derive in detail the DGL of preclassifier layer. The inverse of . This matrix is defined by
(21) 
where we recall that is an by given by . Taking the limit of one immediately has that
(22) 
Without fine tuning is positivedefinite. Notably this statement is equivalent to saying that the matrix has linearly independent columns. Notably when having two linearly dependent coloumns requires finetunning of parameters, hence when this becomes extremely unlikely under any reasonable ensemble for .
In this case one can show that . Indeed
(23)  
(24)  
(25) 
This equation implies that is a projector (in fact an Hermitian projector as is easy to verify). The second that its image is in the kernel of . The third that its kernel is in the image of . All in all it implies that it is a projector whose image coincides with the kernel of as required.
Next we consider Eqs. (11). The fact that the kernel is nontrivial adds several complicated terms to our loss. These all term depend on which we next expand as
(26) 
we in the right hand side we noted that , the image of , is of dimension , consequently the norm of the operator is , while the norm of the . Notably this statement is only accurate elementwise when we assume that has no particular relation with the basis on which the matrix is written on. For this not to hold it would require that at least one dimensional row of is orthogonal to all the remaining rows. This is again exponentially unlikely in the limit of under any reasonable ensemble for .
Accordingly we treat the expansion in the as an expansion in . For instance we can then expand
(27)  
Plugging this into Eq. (15) we obtain
(28) 
as in the main text.
Comments
There are no comments yet.