Entropy and mutual information in models of deep neural networks

05/24/2018
by   Marylou Gabrié, et al.
Cole Normale Suprieure
4

We examine a class of deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) We show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/25/2018

The Mutual Information in Random Linear Estimation Beyond i.i.d. Matrices

There has been definite progress recently in proving the variational sin...
01/16/2021

DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks

In this work, we propose an information theory based framework DeepMI to...
02/08/2021

Mutual Information of Neural Network Initialisations: Mean Field Approximations

The ability to train randomly initialised deep neural networks is known ...
05/16/2020

Information-theoretic limits of a multiview low-rank symmetric spiked matrix model

We consider a generalization of an important class of high-dimensional i...
05/13/2018

Doing the impossible: Why neural networks can be trained at all

As deep neural networks grow in size, from thousands to millions to bill...
02/19/2019

Mutual Information for the Stochastic Block Model by the Adaptive Interpolation Method

We rigorously derive a single-letter variational expression for the mutu...
05/15/2020

Broadcasting on trees near criticality

We revisit the problem of broadcasting on d-ary trees: starting from a B...

Code Repositories

learning-synthetic-data

Package to run learning experiments on synthetic datasets using dnner package for replica entropy computations


view repo

dnner

Deep Neural Networks Entropy from Replicas


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Multi-layer model and main theoretical results

A stochastic multi-layer model— We consider a model of multi-layer stochastic feed-forward neural network where each element of the input layer is distributed independently as , while hidden units at each successive layer

(vectors are column vectors) come from

, with and denoting the -th row of the matrix of weights . In other words

(1)

given a set of weight matrices and distributions which encode possible non-linearities and stochastic noise applied to the hidden layer variables, and that generates the visible variables. In particular, for a non-linearity , where is the stochastic noise (independent for each ), we have . Model (1

) thus describes a Markov chain which we denote by

, with ,

, and the activation function

applied componentwise.

Replica formula—

We shall work in the asymptotic high-dimensional statistics regime where all

are of order one while , and make the important assumption that all matrices are orthogonally-invariant random matrices independent from each other; in other words, each matrix can be decomposed as a product of three matrices, , where and are independently sampled from the Haar measure, and

is a diagonal matrix of singular values. The main technical tool we use is a formula for the entropies of the hidden variables,

, and the mutual information between adjacent layers , based on the heuristic replica method [mezard_spin_1987, mezard_information_2009, kabashima_inference_2008, manoel_multi-layer_2017]: [Replica formula] Assume model (1) with layers in the high-dimensional limit with componentwise activation functions and weight matrices generated from the ensemble described above, and denote by

the eigenvalues of

. Then for any the normalized entropy of is given by the minimum among all stationary points of the replica potential:

(2)

which depends on -dimensional vectors , and is written in terms of mutual information and conditional entropies of scalar variables as

(3)

where , , , , and for . In the computation of the conditional entropies in (3), the scalar -variables are generated from and

(4)
(5)

where and are independent random variables. Finally, the function depends on the distribution of the eigenvalues following

(6)

The computation of the entropy in the large dimensional limit, a computationally difficult task, has thus been reduced to an extremization of a function of variables, that requires evaluating single or bidimensional integrals. This extremization can be done efficiently by means of a fixed-point iteration starting from different initial conditions, as detailed in the Supplementary Material. Moreover, a user-friendly Python package is provided [dnner], which performs the computation for different choices of prior , activations and spectra . Finally, the mutual information between successive layers can be obtained from the entropy following the evaluation of an additional bidimensional integral, see Section 1.6.1 of the Supplementary Material.

Our approach in the derivation of (3

) builds on recent progresses in statistical estimation and information theory for generalized linear models following the application of methods from statistical physics of disordered systems

[mezard_spin_1987, mezard_information_2009] in communication [tulino_support_2013], statistics [Donoho2016]

and machine learning problems

[seung_statistical_1992, engel_statistical_2001]. In particular, we use advanced mean field theory [opper2001advanced] and the heuristic replica method [mezard_spin_1987, kabashima_inference_2008], along with its recent extension to multi-layer estimation [manoel_multi-layer_2017, fletcher_inference_2017], in order to derive the above formula (3). This derivation is lengthy and thus given in the Supplementary Material. In a related contribution, Reeves [reeves_additivity_2017] proposed a formula for the mutual information in the multi-layer setting, using heuristic information-theoretic arguments. As ours, it exhibits layer-wise additivity, and the two formulas are conjectured to be equivalent.

Rigorous statement— We recall the assumptions under which the replica formula of Claim 1 is conjectured to be exact: (i) weight matrices are drawn from an ensemble of random orthogonally-invariant matrices, (ii) matrices at different layers are statistically independent and (iii) layers have a large dimension and respective sizes of adjacent layers are such that weight matrices have aspect ratios of order one. While we could not prove the replica prediction in full generality, we stress that it comes with multiple credentials: (i) for Gaussian prior

and Gaussian distributions

, it corresponds to the exact analytical solution when weight matrices are independent of each other (see Section 1.6.2 of the Supplementary Material). (ii) In the single-layer case with a Gaussian weight matrix, it reduces to formula (13) in the Supplementary Material, which has been recently rigorously proven for (almost) all activation functions [barbier_phase_2017]. (iii) In the case of Gaussian distributions , it has also been proven for a large ensemble of random matrices [toappear] and (iv) it is consistent with all the results of the AMP [donoho2009message, zdeborova_statistical_2016, GAMP] and VAMP [rangan_vector_2017] algorithms, and their multi-layer versions [manoel_multi-layer_2017, fletcher_inference_2017], known to perform well for these estimation problems.

In order to go beyond results for the single-layer problem and heuristic arguments, we prove Claim 1 for the more involved multi-layer case, assuming Gaussian i.i.d. matrices and two non-linear layers: [Two-layer Gaussian replica formula] Suppose the input units distribution is separable and has bounded support; the activations and corresponding to and are bounded with bounded first and second derivatives w.r.t their first argument; and the weight matrices , have Gaussian i.i.d. entries. Then for model (1) with two layers the high-dimensional limit of the entropy verifies Claim 1.

The theorem, that closes the conjecture presented in [manoel_multi-layer_2017], is proven using the adaptive interpolation method of [barbier2017stochastic, barbier_phase_2017] in a multi-layer setting, as first developed in [2017arXiv170910368B]. The lengthy proof, presented in details in the Supplementary Material, is of independent interest and adds further credentials to the replica formula, as well as offers a clear direction to further developments. Note that, following the same approximation arguments as in [barbier_phase_2017] where the proof is given for the single-layer case, the hypothesis

can be relaxed to the existence of the second moment of the prior,

can be dropped and extended to matrices with i.i.d. entries of zero mean, variance and finite third moment.

2 Tractable models for deep learning

The multi-layer model presented above can be leveraged to simulate two prototypical settings of deep supervised learning on synthetic datasets amenable to the replica tractable computation of entropies and mutual informations.

Figure 1: Two models of synthetic data

The first scenario is the so-called teacher-student (see Figure 1, left). Here, we assume that the input is distributed according to a separable prior distribution , factorized in the components of , and the corresponding label is given by applying a mapping , called the teacher. After generating a train and test set in this manner, we perform the training of a deep neural network, the student, on the synthetic dataset. In this case, the data themselves have a simple structure given by .

In constrast, the second scenario allows generative models (see Figure 1, right) that create more structure, and that are reminiscent of the generative-recognition

pair of models of a Variational Autoencoder (VAE). A code vector

is sampled from a separable prior distribution and a corresponding data point is generated by a possibly stochastic neural network, the generative model. This setting allows to create input data featuring correlations, differently from the teacher-student scenario. The studied supervised learning task then consists in training a deep neural net, the recognition model, to recover the code from .

In both cases, the chain going from to any later layer is a Markov chain in the form of (1). In the first scenario, model (1) directly maps to the student network. In the second scenario however, model (1) actually maps to the feed-forward combination of the generative model followed by the recognition model. This shift is necessary to verify the assumption that the starting point (now given by ) has a separable distribution. In particular, it generates correlated input data while still allowing for the computation of the entropy of any .

At the start of a neural network training, weight matrices initialized as i.i.d. Gaussian random matrices satisfy the necessary assumptions of the formula of Claim 1. In their singular value decomposition

(7)

the matrices and , are typical independent samples from the Haar measure across all layers. To make sure weight matrices remain close enough to independent during learning, we define a custom weight constraint which consists in keeping and fixed while only the matrix , constrained to be diagonal, is updated. The number of parameters is thus reduced from to . We refer to layers following this weight constraint as USV-layers. For the replica formula of Claim 1 to be correct, the matrices from different layers should furthermore remain uncorrelated during the learning. In Section 3, we consider the training of linear networks for which information-theoretic quantities can be computed analytically, and confirm numerically that with USV-layers the replica predicted entropy is correct at all times. In the following, we assume that is also the case for non-linear networks.

In Section 3.2 of the Supplementary Materialwe train a neural network with USV-layers on a simple real-world dataset (MNIST), showing that these layers can learn to represent complex functions despite their restriction. We further note that such a product decomposition is reminiscent of a series of works on adaptative structured efficient linear layers (SELLs and ACDC) [Moczulski2015, Yang2015] motivated this time by speed gains, where only diagonal matrices are learned (in these works the matrices and are chosen instead as permutations of Fourier or Hadamard matrices, so that the matrix multiplication can be replaced by fast transforms). In Section 3, we discuss learning experiments with USV-layers on synthetic datasets.

While we have defined model (1) as a stochastic model, traditional feed forward neural networks are deterministic. In the numerical experiments of Section 3, we train and test networks without injecting noise, and only assume a noise model in the computation of information-theoretic quantities. Indeed, for continuous variables the presence of noise is necessary for mutual informations to remain finite (see discussion of Appendix C in [saxe_information_2018]). We assume at layer an additive white Gaussian noise of small amplitude just before passing through its activation function to obtain and , while keeping the mapping deterministic. This choice attempts to stay as close as possible to the deterministic neural network, but remains inevitably somewhat arbitrary (see again discussion of Appendix C in [saxe_information_2018]).

Other related works— The strategy of studying neural networks models, with random weight matrices and/or random data, using methods originated in statistical physics heuristics, such as the replica and the cavity methods [mezard_spin_1987] has a long history. Before the deep learning era, this approach led to pioneering results in learning for the Hopfield model [amit1985storing]

and for the random perceptron

[gardner1989three, mezard_space_1989, seung_statistical_1992, engel_statistical_2001].

Recently, the successes of deep learning along with the disqualifying complexity of studying real world problems have sparked a revived interest in the direction of random weight matrices. Recent results –without exhaustivity– were obtained on the spectrum of the Gram matrix at each layer using random matrix theory

[louart2017harnessing, pennington2017nonlinear], on expressivity of deep neural networks [raghu2016expressive], on the dynamics of propagation and learning [saxe2013exact, schoenholz2016deep, Advani2017, Baldassi11079], on the high-dimensional non-convex landscape where the learning takes place [NIPS2014_5486], or on the universal random Gaussian neural nets of [giryes2016deep].

The information bottleneck theory [IB]

applied to neural networks consists in computing the mutual information between the data and the learned hidden representations on the one hand, and between labels and again hidden learned representations on the other hand

[Tishby2015, shwartz-ziv_opening_2017]. A successful training should maximize the information with respect to the labels and simultaneously minimize the information with respect to the input data, preventing overfitting and leading to a good generalization. While this intuition suggests new learning algorithms and regularizers [Chalk2016, Achille2016, Alemi2017, Achille2017, Kolchinsky2017, Belghazi2017, Zhao2017]

, we can also hypothesize that this mechanism is already at play in a priori unrelated commonly used optimization methods, such as the simple stochastic gradient descent (SGD). It was first tested in practice by

[shwartz-ziv_opening_2017]

on very small neural networks, to allow the entropy to be estimated by binning of the hidden neurons activities. Afterwards, the authors of

[saxe_information_2018] reproduced the results of [shwartz-ziv_opening_2017] on small networks using the continuous entropy estimator of [Kolchinsky2017]

, but found that the overall behavior of mutual information during learning is greatly affected when changing the nature of non-linearities. Additionally, they investigate the training of larger linear networks on i.i.d. normally distributed inputs where entropies at each hidden layer can be computed analytically for an additive Gaussian noise. The strategy proposed in the present paper allows us to evaluate entropies and mutual informations in non-linear networks larger than in

[saxe_information_2018, shwartz-ziv_opening_2017].

3 Numerical experiments

We present a series of experiments both aiming at further validating the replica estimator and leveraging its power in noteworthy applications. A first application presented in the paragraph 3.1 consists in using the replica formula in settings where it is proven to be rigorously exact as a basis of comparison for other entropy estimators. The same experiment also contributes to the discussion of the information bottleneck theory for neural networks by showing how, without any learning, information-theoretic quantities have different behaviors for different non-linearities. In the following paragraph 3.2, we validate the accuracy of the replica formula in a learning experiment with USV-layers —where it is not proven to be exact — by considering the case of linear networks for which information-theoretic quantities can be otherwise computed in closed-form. We finally consider in the paragraph 3.3, a second application testing the information bottleneck theory for large non-linear networks. To this aim, we use the replica estimator to study compression effects during learning.

3.1 Estimators and activation comparisons— Two non-parametric estimators have already been considered by [saxe_information_2018] to compute entropies and/or mutual informations during learning. The kernel-density approach of Kolchinsky et. al. [Kolchinsky2017] consists in fitting a mixture of Gaussians (MoG) to samples of the variable of interest and subsequently compute an upper bound on the entropy of the MoG [Kolchinsky2017a]. The method of Kraskov et al. [Kraskov2004] uses nearest neighbor distances between samples to directly build an estimate of the entropy. Both methods require the computation of the matrix of distances between samples. Recently, [Belghazi2017] proposed a new non-parametric estimator for mutual informations which involves the optimization of a neural network to tighten a bound. It is unfortunately computationally hard to test how these estimators behave in high dimension as even for a known distribution the computation of the entropy is intractable (#P-complete) in most cases. However the replica method proposed here is a valuable point of comparison for cases where it is rigorously exact.

In the first numerical experiment we place ourselves in the setting of Theorem 1: a 2-layer network with i.i.d weight matrices, where the formula of Claim 1 is thus rigorously exact in the limit of large networks, and we compare the replica results with the non-parametric estimators of [Kolchinsky2017] and [Kraskov2004]. Note that the requirement for smooth activations of Theorem 1 can be relaxed (see discussion below the Theorem). Additionally, non-smooth functions can be approximated arbitrarily closely by smooth functions with equal information-theoretic quantities, up to numerical precision.

We consider a neural network with layers of equal size that we denote:

. The input variable components are i.i.d. Gaussian with mean 0 and variance 1. The weight matrices entries are also i.i.d. Gaussian with mean 0. Their standard-deviation is rescaled by a factor

and then multiplied by a coefficient varying between and , i.e. around the recommended value for training initialization. To compute entropies, we consider noisy versions of the latent variables where an additive white Gaussian noise of very small variance () is added right before the activation function, and with , which is also done in the remaining experiments to guarantee the mutual informations to remain finite. The non-parametric estimators [Kolchinsky2017, Kraskov2004] were evaluated using 1000 samples, as the cost of computing pairwise distances is significant in such high dimension and we checked that the entropy estimate is stable over independent draws of a sample of such a size (error bars smaller than marker size). On Figure 2, we compare the different estimates of and

for different activation functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear approximation of the tanh,

for , for , and for , for which the integrals in the replica formula can be evaluated faster than for the tanh.

In the linear and hardtanh case, the non-parametric methods are following the tendency of the replica estimate when is varied, but appear to systematically over-estimate the entropy. For linear networks with Gaussian inputs and additive Gaussian noise, every layer is also a multivariate Gaussian and therefore entropies can be directly computed in closed form (exact in the plot legend). When using the Kolchinsky estimate in the linear case we also check the consistency of two strategies, either fitting the MoG to the noisy sample or fitting the MoG to the deterministic part of the and augment the resulting variance with , as done in [Kolchinsky2017] (Kolchinsky et al. parametric in the plot legend). In the network with hardtanh non-linearities, we check that for small weight values, the entropies are the same as in a linear network with same weights (linear approx in the plot legend, computed using the exact analytical result for linear networks and therefore plotted in a similar color to exact). Lastly, in the case of the ReLU-ReLU network, we note that non-parametric methods are predicting an entropy increasing as the one of a linear network with identical weights, whereas the replica computation reflects its knowledge of the cut-off and accurately features a slope equal to half of the linear network entropy (1/2 linear approx in the plot legend). While non-parametric estimators are invaluable tools able to approximate entropies from the mere knowledge of samples,they inevitably introduce estimation errors. The replica method is taking the opposite view. While being restricted to a class of models, it can leverage its knowledge of the neural network structure to provide a reliable estimate. To our knowledge, there is no other entropy estimator able to incorporate such information about the underlying multi-layer model.

Beyond informing about estimators accuracy, this experiment also unveils a simple but possibly important distinction between activation functions. For the hardtanh activation, as the random weights magnitude increases, the entropies decrease after reaching a maximum, whereas they only increase for the unbounded activation functions we consider – even for the single-side saturating ReLU. This loss of information for bounded activations was also observed by [saxe_information_2018], where entropies were computed by discretizing the output as a single neuron with bins of equal size. In this setting, as the tanh activation starts to saturate for large inputs, the extreme bins (at and

) concentrate more and more probability mass, which explains the information loss. Here we confirm that the phenomenon is also observed when computing the entropy of the hardtanh (without binning and with small noise injected before the non-linearity). We check via the replica formula that the same phenomenology arises for the mutual informations

(see Section3.1).

Figure 2: Entropy of latent variables in stochastic networks , with equally sized layers , inputs drawn from , weights from , as a function of the weight scaling parameter . An additive white Gaussian noise is added inside the non-linearity. Left column: linear network. Center column: hardtanh-hardtanh network. Right column: ReLU-ReLU network.

3.2 Learning experiments with linear networks— In the following, and in Section 3.3 of the Supplementary Material, we discuss training experiments of different instances of the deep learning models defined in Section 2. We seek to study the simplest possible training strategies achieving good generalization. Hence for all experiments we use plain stochastic gradient descent (SGD) with constant learning rates, without momentum and without any explicit form of regularization. The sizes of the training and testing sets are taken equal and scale typically as a few hundreds times the size of the input layer. Unless otherwise stated, plots correspond to single runs, yet we checked over a few repetitions that outcomes of independent runs lead to identical qualitative behaviors. The values of mutual informations are computed by considering noisy versions of the latent variables where an additive white Gaussian noise of very small variance () is added right before the activation function, as in the previous experiment. This noise is neither present at training time, where it could act as a regularizer, nor at testing time. Given the noise is only assumed at the last layer, the second to last layer is a deterministic mapping of the input variable; hence the replica formula yielding mutual informations between adjacent layers gives us directly . We provide a second Python package [lsd]

to implement in Keras learning experiments on synthetic datasets, using USV- layers and interfacing the first Python package

[dnner] for replica computations.

To start with we consider the training of a linear network in the teacher-student scenario. The teacher has also to be linear to be learnable: we consider a simple single-layer network with additive white Gaussian noise, , with input of size , teacher matrix i.i.d. normally distributed as , noise , and output of size . We train a student network of three USV-layers, plus one fully connected unconstrained layer on the regression task, using plain SGD for the MSE loss . We recall that in the USV-layers (7) only the diagonal matrix is updated during learning. On the left panel of Figure 3, we report the learning curve and the mutual informations between the hidden layers and the input in the case where all layers but outputs have size . Again this linear setting is analytically tractable and does not require the replica formula, a similar situation was studied in [saxe_information_2018]. In agreement with their observations, we find that the mutual informations keep on increasing throughout the learning, without compromising the generalization ability of the student. Now, we also use this linear setting to demonstrate (i) that the replica formula remains correct throughout the learning of the USV-layers and (ii) that the replica method gets closer and closer to the exact result in the limit of large networks, as theoretically predicted (2). To this aim, we repeat the experiment for varying between and , and report the maximum and the mean value of the squared error on the estimation of the

over all epochs of 5 independent training runs. We find that even if errors tend to increase with the number of layers, they remain objectively very small and decrease drastically as the size of the layers increases.

Figure 3: Training of a 4-layer linear student of varying size on a regression task generated by a linear teacher of output size . Upper-left: MSE loss on the training and testing sets during training by plain SGD for layers of size . Best training loss is 0.004735, best testing loss is 0.004789. Lower-left: Corresponding mutual information evolution between hidden layers and input. Center-left, center-right, right: maximum and squared error of the replica estimation of the mutual information as a function of layers size , over the course of 5 independent trainings for each value of for the first, second and third hidden layer.

3.3 Learning experiments with deep non-linear networks— Finally, we apply the replica formula to estimate mutual informations during the training of non-linear networks on correlated input data.

We consider a simple single layer generative model with normally distributed code of size , data of size generated with matrix i.i.d. normally distributed as and noise . We then train a recognition model to solve the binary classification problem of recovering the label , the sign of the first neuron in , using plain SGD but this time to minimize the cross-entropy loss. Note that the rest of the initial code acts as noise/nuisance with respect to the learning task. We compare two 5-layers recognition models with 4 USV- layers plus one unconstrained, of sizes 500-1000-500-250-100-2, and activations either linear-ReLU-linear-ReLU-softmax (top row of Figure 4) or linear-hardtanh-linear-hardtanh-softmax (bottom row). Because USV-layers only feature parameters instead of we observe that they require more iterations to train in general. In the case of the ReLU network, adding interleaved linear layers was key to successful training with 2 non-linearities, which explains the somewhat unusual architecture proposed. For the recognition model using hardtanh, this was actually not an issue (see Supplementary Material for an experiment using only hardtanh activations), however, we consider a similar architecture for fair comparison. We discuss further the ability of learning of USV-layers in the Supplementary Material.

This experiment is reminiscent of the setting of [shwartz-ziv_opening_2017], yet now tractable for networks of larger sizes. For both types of non-linearities we observe that the mutual information between the input and all hidden layers decrease during the learning, except for the very beginning of training where we can sometimes observe a short phase of increase (see zoom in insets). For the hardtanh layers this phase is longer and the initial increase of noticeable amplitude.

In this particular experiment, the claim of [shwartz-ziv_opening_2017] that compression can occur during training even with non double-saturated activation seems corroborated (a phenomenon that was not observed by [saxe_information_2018]). Yet we do not observe that the compression is more pronounced in deeper layers and its link to generalization remains elusive. For instance, we do not see a delay in the generalization w.r.t. training accuracy/loss in the recognition model with hardtanh despite of an initial phase without compression in two layers. Further learning experiments, including a second run of this last experiment, are presented in the Supplementary Material.

Figure 4: Training of two recognition models on a binary classification task with correlated input data and either ReLU (top) or hardtanh (bottom) non-linearities. Left: training and generalization cross-entropy loss (left axis) and accuracies (right axis) during learning. Best training-testing accuracies are 0.995 - 0.991 for ReLU version (top row) and 0.998 - 0.996 for hardtanh version (bottom row). Remaining colums: mutual information between the input and successive hidden layers. Insets zoom on the first epochs.

4 Conclusion and perspectives

We have presented a class of deep learning models together with a tractable method to compute entropy and mutual information between layers. This, we believe, offers a promising framework for further investigations, and to this aim we provide Python packages that facilitate both the computation of mutual informations and the training, for an arbitrary implementation of the model. In the future, allowing for biases by extending the proposed formula would improve the fitting power of the considered neural network models.

We observe in our high-dimensional experiments that compression can happen during learning, even when using ReLU activations. While we did not observe a clear link between generalization and compression in our setting, there are many directions to be further explored within the models presented in Section 2. Studying the entropic effect of regularizers is a natural step to formulate an entropic interpretation to generalization. Furthermore, while our experiments focused on the supervised learning, the replica formula derived for multi-layer models is general and can be applied in unsupervised contexts, for instance in the theory of VAEs. On the rigorous side, the greater perspective remains proving the replica formula in the general case of multi-layer models, and further confirm that the replica formula stays true after the learning of the USV-layers. Another question worth of future investigation is whether the replica method can be used to describe not only entropies and mutual informations for learned USV-layers, but also the optimal learning of the weights itself.

Acknowledgments

The authors would like to thank Léon Bottou, Antoine Maillard, Marc Mézard, Léo Miolane, and Galen Reeves for insightful discussions. This work has been supported by the ERC under the European Union’s FP7 Grant Agreement 307087-SPARCS and the European Union’s Horizon 2020 Research and Innovation Program 714608-SMiLe, as well as by the French Agence Nationale de la Recherche under grant ANR-17-CE23-0023-01 PAIL. Additional funding is acknowledged by MG from “Chaire de recherche sur les modèles et sciences des données”, Fondation CFM pour la Recherche-ENS; by AM from Labex DigiCosme; and by CL from the Swiss National Science Foundation under grant 200021E-175541. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References

1 Replica formula for the entropy

1.1 Background

The replica method [sherrington_solvable_1975, mezard1990spin] was first developed in the context of disordered physical systems where the strength of interactions are randomly distributed, . Given the distribution of microstates at a fixed temperature , , one is typically interested in the average free energy

(8)

from which typical macroscopic behavior is obtained. Computing (8) is hard in general, but can be done with the use of specific techniques. The replica method in particular employs the following mathematical identity

(9)

Evaluating the average on the r.h.s. leads, under the replica-symmetry assumption, to an expression of the form , where is known as the replica-symmetric free energy, and are order parameters related to macroscopic quantities of the system. We then write , so that computing depends on solving the saddle-point equations .

Computing (8) is of interest in many problems outside of physics [nishimori_statistical_2001, mezard2009information]. Early applications of the replica method in machine learning include the evaluation of the optimal capacity and generalization error of the perceptron [gardner_space_1988, gardner_optimal_1988, mezard_space_1989, seung_statistical_1992, engel_statistical_2001]. More recently it has also been used in the study of problems in telecommunications and signal processing, such as channel divison multiple access [tanaka_statistical-mechanics_2002] and compressed sensing [rangan_asymptotic_2009, kabashima_typical_2009, ganguli_statistical_2010, krzakala_statistical-physics-based_2012]. For a review of these developments see [zdeborova_statistical_2016].

These particular examples all share the following common probabilistic structure

(10)

for fixed and different choices of and ; in other words, they are all specific instances of generalized linear models

(GLMs). Using Bayes theorem, one writes the posterior distribution of

as ; the replica method is then employed to evaluate the average log-marginal likelihood , which gives us typical properties of the model. Note this quantity is nothing but the entropy of given , .

The distribution (or in the notation above) is usually assumed to be i.i.d. on the elements of the matrix . However, one can also use the same techniques to approach belonging to arbitrary orthogonally-invariant ensembles. This approach was pioneered by [marinari_replica_1994, parisi_mean-field_1995, opper_tractable_2001, cherrier_role_2003], and in the context of generalized linear models by [takeda_analysis_2006, muller_vector_2008, kabashima_inference_2008, shinzato_perceptron_2008, shinzato_learning_2009, tulino_support_2013, kabashima_signal_2014].

Generalizing the analysis of (10) to multi-layer models has first been considered by [manoel_multi-layer_2017] in the context of Gaussian i.i.d. matrices, and by [fletcher_inference_2017, reeves_additivity_2017] for orthogonally-invariant ensembles. In particular, [reeves_additivity_2017] has an expression for the replica free energy which should be in principle equivalent to the one we present, although its focus is in the derivation of this expression rather than applications or explicit computations.

Finally, it is worth mentioning that even though the replica method is usually considered to be non-rigorous, its results have been proven to be exact for different classes of models, including GLMs [talagrand_spin_2003, panchenko2013sherrington, barbier_mutual_2016, lelarge_fundamental_2016, reeves_replica-symmetric_2016, BarbierOneLayerGLM], and are widely conjectured to be exact in general. In fact, in section 2 we show how to proove the formula in the particular case of two-layer with Gaussian matrices.

1.2 Entropy in single/multi-layer generalized linear models

1.2.1 Single-layer

For a single-layer generalized linear model

(11)

with and separable in the components of and , and Gaussian i.i.d., , define and . Then the entropy of in the limit is given by [zdeborova_statistical_2016, BarbierOneLayerGLM]

(12)

where

(13)

with , both normally distributed with zero mean and unit variance, and (here denotes integration over a standard Gaussian measure).

This can be adapted to orthogonally-invariant ensembles by using the techniques described in [kabashima_inference_2008]. Let , where is orthogonal, diagonal and arbitrary and is Haar distributed. We denote by the distribution of eigenvalues of , and the second moment of by . The entropy is then written as , where

(14)

and

(15)

If the matrix is Gaussian i.i.d., is Marchenko-Pastur and . Extremizing over gives , so that (13) is recovered. In this precise case, it has been proven rigorously in [BarbierOneLayerGLM].

1.2.2 Multi-layer

Consider the following multi-layer generalized linear model

(16)

where the are fixed, and the index runs from to . Using Bayes’ theorem we can write

(17)

with . Performing posterior inference requires one to evaluate the marginal likelihood

(18)

which is in general hard to do. Our analysis employs the framework introduced in [manoel_multi-layer_2017] to compute the entropy of in the limit with finite for

(19)

with the replica potential given by

(20)

and the normally distributed with zero mean and unit variance. The in the expression above are distributed as

(21)
(22)

where denotes the integration over the standard Gaussian measure.

1.3 A simple heuristic derivation of the multi-layer formula

Formula (20) can be derived using a simple argument. Consider the case , where the model reads

(23)

with and . For the problem of estimating given the knowledge of , we compute using the replica free energy (14)

(24)
(25)

Note that

(26)

Moreover, can be obtained from the replica free energy of another problem: that of estimating given the knowledge of (noisy) , which can again be written using (14)

(27)

with

(28)
(29)

and the noise being integrated in the computation of , see (22). Replacing (26)-(29) in (25) gives our formula (20) for ; further repeating this procedure allows one to write the equations for arbitrary .

1.4 Formulation in terms of tractable integrals

While expression (20) is more easily written in terms of conditional entropies and mutual informations, evaluating it requires us to explicitely state it in terms of integrals, which we do below. We first consider the Gaussian i.i.d. In this case, the multi-layer formula was derived with the cavity and replica method by [manoel_multi-layer_2017], and we shall use their results here. Assuming that such that and using the replica formalism, Claim 1 from the main text becomes, in this case

(30)

with the replica potential evaluated from

(31)

and

(32)

The constants , and are defined as following111

Note that due to the central limit theorem,

can be evaluated from using .: , , . Moreover

(33)

for , and

(34)

where

(35)

and the measures over which expectations are computed are

(36)

We typically pick the likelihoods so that can be computed in closed-form, which allows for a number of activation functions – linear, probit, ReLU etc. However, our analysis is quite general and can be done for arbitrary likelihoods, as long as evaluating (33) and (34) is computationally feasible.

Finally, the replica potential above can be generalized to the orthogonally-invariant case using the framework of [kabashima_inference_2008], which we have described in subsection 1.2

(37)

If the matrix is Gaussian i.i.d., the distribution of eigenvalues of is Marchenko-Pastur and one gets , , , so that (31) is recovered. Moreover, for , one obtains the replica free energy proposed by [kabashima_inference_2008, shinzato_perceptron_2008, shinzato_learning_2009].

1.4.1 Recovering the formulation in terms of conditional entropies

One can rewrite the formulas above in a simpler way. By manipulating the measures (36) one obtains

(38)

for and . Introducing a standard normal variable and using the invariance of mutual informations, this can be written as

(39)

Similarly

(40)

for and . Introducing standard normal

(41)

where

(42)

and