Predicting the outputs of finite networks trained with noisy gradients

04/02/2020
by   Gadi Naveh, et al.
0

A recent line of studies has focused on the infinite width limit of deep neural networks (DNNs) where, under a certain deterministic training protocol, the DNN outputs are related to a Gaussian Process (GP) known as the Neural Tangent Kernel (NTK). However, finite-width DNNs differ from GPs quantitatively and for CNNs the difference may be qualitative. Here we present a DNN training protocol involving noise whose outcome is mappable to a certain non-Gaussian stochastic process. An analytical framework is then introduced to analyze this resulting non-Gaussian process, whose deviation from a GP is controlled by the finite width. Our work extends upon previous relations between DNNs and GPs in several ways: (a) In the infinite width limit, it establishes a mapping between DNNs and a GP different from the NTK. (b) It allows computing analytically the general form of the finite width correction (FWC) for DNNs with arbitrary activation functions and depth and further provides insight on the magnitude and implications of these FWCs. (c) It appears capable of providing better performance than the corresponding GP in the case of CNNs. We are able to predict the outputs of empirical finite networks with high accuracy, improving upon the accuracy of GP predictions by over an order of magnitude. Overall, we provide a framework that offers both an analytical handle and a more faithful model of real-world settings than previous studies in this avenue of research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/03/2020

Wide Neural Networks with Bottlenecks are Deep Gaussian Processes

There is recently much work on the "wide limit" of neural networks, wher...
06/08/2021

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

Deep neural networks (DNNs) in the infinite width/channel limit have rec...
07/11/2020

Bayesian Deep Ensembles via the Neural Tangent Kernel

We explore the link between deep ensembles and Gaussian processes (GPs) ...
12/01/2021

Mixed neural network Gaussian processes

This paper makes two contributions. Firstly, it introduces mixed composi...
06/11/2021

The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective

Large width limits have been a recent focus of deep learning research: m...
06/17/2021

Wide stochastic networks: Gaussian limit and PAC-Bayesian training

The limit of infinite width allows for substantial simplifications in th...
02/07/2021

Infinite-channel deep stable convolutional neural networks

The interplay between infinite-width neural networks (NNs) and classes o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been rapidly advancing the state-of-the-art in machine learning. Their raise to prominence was largely results-driven, with little theoretical support or guarantee. Indeed, their success even defied prevalent notions about over-fitting and over-parameterization

(Zhang et al., 2016) and hardness of high dimensional non-convex optimization (Choromanska et al., 2015).

While the theory of DNNs is still far behind the application forefront, recently several exact results were obtained in the highly over-parameterized regime ( where controls the over-parameterization) (Daniely et al., 2016; Jacot et al., 2018) where the role played by any specific DNN weight is small. This facilitated the derivation of various bounds (Allen-Zhu et al., 2018; Cao and Gu, 2019, 2019) on generalization for shallow networks and, more relevant for this work, an exact correspondence with Gaussian Processes (GPs) known as the Neural Tangent Kernel (NTK) result (Jacot et al., 2018). The latter holds when highly over-parameterized DNNs are trained in a specific manner involving no stochasticity.

The NTK result has provided the first example of a DNN to GP correspondence valid after end-to-end DNN training. This important theoretical advancement allowed one to reason about DNNs using a more developed theoretical framework, that of inference in GPs (Rasmussen and Williams, 2005). For instance, it provided a quantitative account for how fully connected DNNs, trained in this manner, generalize (Cohen et al., 2019; Rahaman et al., 2018) and train (Jacot et al., 2018; Basri et al., 2019). Roughly speaking, highly over-parameterized DNNs generalize because they have a strong implicit bias to simple functions and train well because a variety of useful functions can be reached by changing the weights in an arbitrarily small amount from their initialization values.

Despite its novelty and importance, the NTK correspondence seems to suffer from a few drawbacks: (a) Its deterministic training protocol is qualitatively different from the stochastic ones used in practice; This combined with the need to use vanishing learning rates may increase the tendency of such DNNs to settle at poorer performing regions of the loss landscape (Keskar et al., 2016)

. (b) In its range of validity, it seems to under-perform, often by a large margin, convolutional neural networks (CNNs) trained using standard SGD. (c) While precise in the highly over-parameterized regime, extending it to a theory with predictive power at finite

is challenging (Dyer and Gur-Ari, 2020). It is thus desirable to have other correspondences between end-to-end trained DNNs and probabilistic inference models which may have other merits compared to the NTK.

In this work we prove a simple correspondence between DNNs and Stochastic Processes (SPs) which at

, tend to GPs. These SPs are those of a DNN with random weights drawn from an iid Gaussian distribution with variances determined by the parameters of the training protocols rather than by the DNN’s initialization. At

, these are known as Neural Network Gaussian Processes (NNGPs) while at finite they become generic SPs. In that spirit we call ours the NNSP correspondence. Our proof follows straightforwardly from assuming ergodic training dynamics and recasting the resulting equilibrium distribution of the weights, into that of the DNN’s outputs.

We provide an analytical framework for analyzing the resulting inference problem on these NNSPs and use it to predict the outputs of trained finite-width fully connected DNNs and CNNs. The accuracy at which we can predict the empirical DNNs’ outputs, serves as a strong verification for our aforementioned ergodicity assumption. We also provide explicit expressions, which can be seen as -corrections of the Equivalent Kernel (EK) result from the theory of GPs (Rasmussen and Williams, 2005), for the large-dataset () behavior of the trained DNN.

We further provide a mechanism that can explain why CNNs trained on tasks where weight sharing is beneficial, e.g. image processing, and in the regime of the NNSP correspondence, perform worse at larger width. NNGPs associated with average pooling CNNs are oblivious to the presence or absence of weight sharing across each layer, but NNSPs associated with finite CNNs are different between these two cases, where weight sharing yields enhanced performance.

The NNSP correspondence provides a rich analytical and numerical framework for exploring the theory of deep learning, unique in its ability to incorporate finite over-parameterization, stochasticity, and depth. Looking ahead, this physics-style framework will provide a lab setting where one can quantitatively reason about more realistic DNNs, develop an effective language to describe them, and perform analytical “lab” tests and refinements on new theories and algorithms.

2 Related work

The idea of leveraging the time dynamics of the gradient descent algorithm for approximating Bayesian inference has been considered in various works (Welling and Teh, 2011; Mandt et al., 2017; Teh et al., 2016; Maddox et al., 2019; Ye et al., 2017)

with many practical tools developed. However, a correspondence with a concrete SP or a non-parametric model was not established nor was a comparison made of the DNN’s outputs with analytical predictions.

Finite width corrections have been studied recently by several authors. In Ref. (Mei and Montanari, 2019) a random feature regression model was analyzed analytically where random features are generated by a single fully-connected DNN layer. Various predictions as a function of the width and the number of samples were analytically obtained and tested against such DNNs. Our work differs in several aspects: (a) We use a more realistic training protocol; in particular we train the entire DNN, not just the top layer. (b) Being approximately rather than exactly solvable, our formulation is more flexible and applies, with trivial modification, to large DNNs of any depth as well as CNNs without pooling.

Finite width corrections were also studied very recently in the context of the NTK correspondence in Ref. (Dyer and Gur-Ari, 2020). Field-theory tools were used to predict the scaling behavior with of various quantities. In particular, taking as given the empirical (and weakly random) NTK kernel at initialization, the authors obtained a finite correction to the linear integral equation governing the evolution of the predictions on the training set. Our work differs in several aspects: (a) We derive relatively simple formulae for the outputs which become entirely explicit at large (i.e no matrix inversion or diagonalization needed). (b) We take into account all sources of finite corrections whereas finite NTK randomness remained an empirical source of corrections in Ref. (Dyer and Gur-Ari, 2020). (c) We describe a different correspondence with qualitatively different behavior. (d) Our formalism differs considerably: its statistical mechanical nature enables one to import various standard tools for treating randomness (replicas), ergodicity breaking (replica symmetry breaking), and taking into account non-perturbative effects (mean-field, diagrammatic re-summations). (e) We have no smoothness limitation on our activation functions and provide FWCs on a generic data point and not just on the training set.

During the preparation of this work, a manuscript appeared (Yaida, 2019) studying Bayesian inference with weakly non-Gaussian priors. The focus was on using renormalization group to study the prior induced by deep finite- DNNs. Unlike here, little emphasis was placed on establishing a correspondence with trained DNNs. The formulation presented here has the conceptual advantage of representing a distribution over function space for arbitrary training and test data, rather than over specific draws of data sets. This is useful for studying the large behavior of learning curves, where analytical insights into generalization can be gained ((Cohen et al., 2019)

). Lastly, we further find expressions for the 4th cumulant for ReLU activation for 4 randomly chosen points.

3 The NNSP correspondence

Consider a DNN trained with full-batch Gradient Descent while injecting white Gaussian noise to its gradients and including a weight decay term, so that the discrete time dynamics of each of the network weights read

(1)

where are the weights at time step , is the strength of the weight decay, is the loss as a function of the output, is the temperature (the magnitude of noise), is the step size and

is a standard Gaussian random variable. In the limit

these discrete-time dynamics converge to the continuous-time Langevin equation given by with , so that the equilibrium distribution is (Risken and Frank, 1996)

(2)

thus we identify and . This training protocol resembles SGLD (Welling and Teh, 2011) with two differences: we include a weight decay term that would later scale with and use a constant step size rather than a decaying one as in SGLD. One can also derive finite step size corrections to , as suggested by (Mannella, 2004).

One can recast the above expression in a more Gaussian-Process like manner by going to function space. Namely, we consider the distribution of implied by the above where for concreteness we consider a DNN with a single scalar output . Denoting by the induced measure on function space we formally write

(3)

where denotes an integral over all weights and we denote by a delta-function in function-space. As common in path-integrals or field-theory formalism, such a delta function is understood as a limit procedure where one chooses a suitable basis for function space, trims it to a finite subset, treats as a product of regular delta-functions, and at the end of the computation takes the size of the subset to infinity.

To proceed we further re-write Eq. 3 as where . The integration over weights now receives a clear meaning: it is the distribution over functions induced by such a DNN with random weights chosen according to the ”prior” (), so that we can relate any correlation function in function space and weight space, for instance

(4)

Conveniently, for highly over-parameterized DNNs the above r.h.s. equals the kernel of the NNGP associated with this DNN (). Moreover becomes Gaussian and can be written as

(5)

Combining Eqs. 3, 5, choosing the MSE loss, and taking one finds that training-time averaged outputs of the DNN are given by the predictions of a Gaussian Processes, with measurement noise equal to and a kernel given by the NNGP of that DNN.

We refer to the above expressions for and describing the distribution of outputs of a DNN trained according to our protocol – the NNSP correspondence. Unlike the NTK correspondence, the kernel which appears here is different and no additional initialization dependent terms appear (as should be the case since we assumed ergodicity). Furthermore, given knowledge of at finite , one can predict the DNN’s outputs at finite . Henceforth, we refer to as the prior distribution, as it is the prior distribution of a DNN with random weights drawn from .

The main assumption underlying our derivation is that of ergodicity. The motivation for assuming this is the observation ((Dauphin et al., 2014)) that in the large limit, in particular for , it is unlikely to find local minima (see also (Draxler et al., 2018)), only saddle points. Since our training is noisy, such saddle points cannot cause the above dynamics to stall. In a related manner, optimizing the train loss can be seen as an attempt to find a solution to constraints using far more variables (roughly where is the number of layers) and so the dimension of the solution manifold is very large and likely to percolate throughout weight space. Indeed (Jacot et al., 2018) have shown that wide DNNs can fit the training-data while changing their weights only infinitesimally. From a different angle, in a statistical mechanical description of satisfiability problems, one typically expects ergodic behavior when the ratio of the number of variables to number of constraints becomes much larger than one ((Gardner and Derrida, 1988)). Our numerical results below further validate these qualitative arguments.

4 Inference on the resulting NNSP

Having mapped the time averaged outputs of a DNN to inference on the above NNSP, we turn to analyze the predictions of this NNSP in the case where is large but finite, such that the NNSP is only weakly non-Gaussian.

The main result of this section is to derive leading FWCs to the standard GP results for the posterior mean and variance on an unseen test point (Rasmussen and Williams, 2005)

(6)

where we define

(7)

4.1 Edgeworth series and perturbation theory

Our first task is find how changes compared to the Gaussian () scenario. As the data-dependent part of is independent of the DNNs, this amounts to obtaining corrections to the prior

. One way to characterize this is through cumulants. This is especially convenient here since one can show that for all DNNs with a fully-connected layer on top, all odd cumulants are zero and that the

th cumulant scales as . Consequently at large we can characterize up to by its second and fourth cumulants, and , respectively. Hence we use an Edgeworth series (see e.g. (Mccullagh, 2017)) to obtain the form of the prior from its cumulants (see App. A), the final result being

(8)

The GP action is given by

(9)

and the first FWC action is given by

(10)

where is the 4th functional Hermite polynomial

(11)

using the shorthand notations: and with . is the 4th order functional cumulant, which depends on the choice of the activation function

(12)

where and the pre-activations are . Here we distinguished between the scaled an non-scaled weight variances: . The integers in indicate the number of terms of this form (with all possible index permutations). Note that, Hermite polynomials are an orthogonal set under the Gaussian integration measure, thus they preserve the normalization of the distribution. Our notation for the integration measure means e.g. . In App. B, we carry out these integrals yielding the leading FWC to the posterior mean and variance on a test point

(13)

with and

(14)

where all repeating indices are implicitly summed over the training set, denoting: , and defining the ”discrepancy operator”:

(15)

where (with no hat) is the usual Kronecker delta, and where runs over the training set and the test point .

Note that our procedure for generating network outputs involves averaging over the training dynamics after reaching equilibrium (when the train loss levels off) and also over seeds of random numbers so that we effectively have an ensemble of networks (see App. E). This reduces the noise and allows for a reliable comparison with our FWC theory. In principle, one could use the network outputs at the end of training without this averaging (as common in practice), in which case there will be fluctuations that will scale with . Following this, one finds that the expected MSE test loss after training saturates is

(16)

where is the size of the test set. Thus, is a measure of how much we can decrease the test loss by averaging.

4.2 Large data sets: Corrections to the Equivalent Kernel

The expressions Eq. 14 for the FWC are explicit but only up to a potentially large matrix inversion. These matrices also have a random component related to the largely arbitrary choice of the particular training points used to characterize the function or concept being learned. An insightful tool, used in the context of GPs, which solves both these issues is the Equivalent Kernel (EK) (Rasmussen and Williams, 2005). The EK approximates the GP predictions at large , after averaging on all draws of (roughly) training points representing the target function being learned. Even if one is interested in a particular dataset, due to a self-averaging property, the EK results capture the behavior of specific dataset up to corrections. Here we develop an extension of the EK results for the NNSPs we find at large . In particular, we find the leading non-linear correction to the EK result.

To this end, we consider the average predictions of an NNSP trained on an ensemble of data sets of size , corresponding to independent draws from a distribution over all possible inputs . Following (Malzahn and Opper, 2001; Cohen et al., 2019), we further enrich this ensemble by choosing

randomly from a Poisson distribution with mean

. By a straightforward application of the tools introduced in Ref. (Cohen et al., 2019) (see App. H) we find that the average predictions, to leading order in () are

(17)

where the continuum discrepancy operator acts as

(18)

and an integral is implicit for every product with repeated coordinates. Evidently, the continuum discrepancy operator plays an important role here. Acting on some function, most notably , it yields a function equal to the discrepancy in predicting using a GP based on . The resulting function would thus be large if the GP defined by does a poor job at approximating based on data points.

The above expression is valid for any weakly non-Gaussian process, including ones related to CNNs (where corresponds to the number of channels). It can also be systematically extended to lower values of by taking into account higher terms in , as in Ref. (Cohen et al., 2019). Despite its generality, several universal statements can still be made. At , we obtain a standard result known as the Equivalent Kernel (EK). It shows that the predictions of a Gaussian processes at large capture well features of

that have support on eigenvalues of

larger than . It is basically a high pass linear filter of where features of associated with eigenvalues of that are smaller than

are filtered out. We stress that these eigenvalues and eigenfunctions are independent of any particular size

dataset but rather are a property of the average dataset. In particular, no computationally costly data dependent matrix inversion is needed to evaluate Eq. 17.

Turning to our FWC results, they depend on only via the continuum discrepancy operator . Thus these FWCs would be inversely proportional to the performance of the DNN, at . In particular, perfect performance at

, implies no FWC. Second, the DNN’s average predictions act as a linear transformation on the target function combined with a cubic non-linearity. Third, for

having support only on some finite set of eigenfunctions of , would scale as at very large . Thus the above cubic term would lose its explicit dependence on . In addition, some decreasing behavior with is expected due to the factor which can be viewed as the discrepancy in predicting , at fixed , based on random samples (’s) of .

More detailed statements requires one to commit to a specific data set and DNN architecture. First we consider fully-connected DNNs with quadratic or ReLU activation and a uniform with normalized to the hyper-sphere at dimension . As discussed by (Cohen et al., 2019), the eigenfunctions of here are hyperspherical harmonics (Avery, 2010) with eigenvalues which depend only on and scale as . This follows directly from the symmetry where in any orthogonal transformation of the inputs. For , by virtue of the large gaps in the spectrum, most choices of would imply the existence of a threshold angular momentum such that and . As a result, the associated GP would nearly perfectly predict all components of with and project out all the rest.

Furthermore, the rotational symmetry of implies that it can be expanded as a power series in the dot product . It was further shown in Ref. (Cohen et al., 2019), that trimming this expansion at order while compensating by an increase of , provides an excellent approximation for with an error that scales as , since for typical and . Thus the NNGP kernel of a fully-connected DNN can be approximated by very few effective parameters which are these power series coefficients. Examining , it is also symmetric under a joint orthogonal transformation of all and can be expanded in powers of . While several of the resulting terms, such as , are relatively easy to handle analytically (in the sense of carrying out the integration in Eq. 17), others, like are more difficult. The study of their effect is left for future work. A qualitative discussion on the effect of in CNNs trained on images, is given in Sec. 5.3.

4.3 Fourth cumulant for ReLU activation function

The ’s appearing in our FWC results can be derived for several activations functions, and in our numerical experiments we use a quadratic activation and ReLU. Here we give the result for ReLU, which is similar for any other threshold power law activation (see derivation in App. C), and give the result for quadratic activation in App. D. For simplicity, in this section we focus on the case of a 2-layer fully connected network with no biases, input dimension and neurons in the hidden layer, such that is the activation at the th hidden unit with input sampled with a uniform measure from , where

is a vector of weights of the first layer. This can be generalized to the more realistic settings of deeper nets and un-normalized inputs, where in the former the linear kernel

is replaced by the kernel of the layer preceding the output, and the latter amounts to introducing some scaling factors.

For , (Cho and Saul, 2009) give a closed form expression for the kernel which corresponds to the GP. Here we find

corresponding to the leading FWC by first finding the fourth moment of the hidden layer

(see Eq. 12), taking for simplicity

(19)

where above corresponds to the matrix inverse of the matrix with elements which is the kernel of the previous layer (the linear kernel in the 2-layer case) evaluated on two random points. In App. C we follow the derivation in (Moran, 1948), which yields (with a slight modification noted therein) the following series in the off-diagonal elements of the matrix

(20)

where the coefficients are

(21)

For ReLU activation, these ’s read

(22)

and similar expressions can be derived for other threshold power-law activations of the form . The series Eq. 20 is expected to converge for sufficiently large input dimension since the overlap between random normalized inputs scales as and consequently for two random points from the data sets. However, when we sum over we also have terms with repeating indices and so ’s are equal to . The above Taylor expansion diverges whenever the matrix has eigenvalues larger than . Notably this divergence does not reflect a true divergence of , but rather the failure of representing it using the above expansion. Therefore at large , one can opt to neglect elements of with repeating indices, since there are much fewer of these. Alternatively this can be dealt with by a re-parameterization of the ’s leading to a similar but slightly more involved Taylor series.

5 Numerical experiments

In this section we numerically test our analytical results. We first demonstrate that in the limit the outputs of fully connected DNNs trained in the regime of the NNSP-correspondence converge to a GP with a known kernel, and that the MSE between them scales as which is the scaling of the leading FWC squared. Second, we show that introducing the leading FWC term further reduces this MSE by more than an order of magnitude. Third, we study the generalization gap between CNNs and their NNGPs.

5.1 Fully connected DNNs on synthetic data

We consider training a fully connected network on a quadratic target where the ’s are sampled with a uniform measure from the hyper-sphere with and the matrix elements are sampled as and fixed for all ’s. We use a noise level of , training points and a learning rate of (in App. F we show results for other learning rates, demonstrating convergence). Notice that for any activation , scales linearly with , thus in order to keep constant as we vary we need to scale the weight decay of the last layer as . This is done in order to keep the prior distribution in accord with the typical values of the target as varies, so that the comparison is fair.

5.2 Comparison with NNSP output predictions

In Fig. 1 we highlight some aspects of the training dynamics (panels (A-C) are for ). Panel (A) shows the MSE losses normalized by vs. normalized time . Our settings are such that there are not enough training points to fully learn the target, hence the large gap between training and test loss. Otherwise, the convergence of the network output to NNGP as grows (shown in Fig. 2

) would be less impressive, since all reasonable estimators would be close to the target and hence close to each other. Indeed, panel (C) shows that the time averaged outputs (after reaching equilibrium)

is much closer to the GP prediction than to the ground truth . Panel (B) shows the averaged auto-correlation functions (ACFs) of the outputs (averaged over test points) and of the first and second layer weights resp. (each averaged over weights). Panel (D) shows vs. width where is the auto-correlation time (ACT). We see that they all decrease with and that is always significantly smaller than for all , demonstrating that there are no non-ergodicity issues, at least for ergodicity in the mean, and the faster convergence to equilibrium of the outputs relative to the weights.

Next, in Fig. 2 we plot in log-log scale (with base ) the MSE (normalized by ) between the predictions of the network and the corresponding GP and FWC predictions for quadratic and ReLU activations. We find that indeed for sufficiently large widths () the slope of the GP-DNN MSE approaches (for both ReLU and quadratic), which is expected from our theory, since the leading FWC scales as . For smaller widths, higher order terms (in ) in the Edgeworth series Eq. 8 come into play. For quadratic activation, we find that our FWC result reduces the MSE by more than an order of magnitude relative to the GP theory. Further, we recognize a regime where the GP and FWC MSEs intersect at around , below which our FWC actually increases the MSE, which suggests a scale of how large needs to be for our first order FWC theory to hold.


Figure 1: Training dynamics and auto-correlation functions (ACFs) for a ReLU network with quadratic target, (panels A-C) and . (A) Normalized loss vs. normalized time: note the large generalization gap, due to small . The MSE loss is normalized by and the normalized time is . (B) ACFs of the time series of the 1st and 2nd layer weights, and of the outputs. (C) Network outputs on test points vs. normalized time: note the fluctuations around the time average (dashed line) which is much closer to the GP prediction (dotted line) than to the ground truth (dashed-dotted line). (D) Auto-correlation times (ACTs) of the 1st and 2nd layer weights, and of the outputs: , resp, (vertical axis is in log scale).

Figure 2: Fully connected 2-layer network trained on a regression task. Relative MSE between the network outputs and the labels (triangles), GP predictions (dots), and FWC predictions Eq. 13 (x’s). Shown vs. width on a base log-log scale for quadratic (blue) and ReLU (red) activations. Averaged across seeds and test points with and . For sufficiently large widths () the slope of the GP-DNN MSE approaches and the FWC-DNN MSE is further improved by more than an order of magnitude.

5.3 Performance gap between CNNs and their NNGP or NTK

Several authors have shown that the performance of SGD-trained CNNs surpasses that of the corresponding GPs, be it NTK (Arora et al., 2019) or NNGP (Novak et al., 2018). One notable margin, of about 15% accuracy on CIFAR10, was shown numerically in (Novak et al., 2018) for the case of CNNs with average pooling. It was further pointed out there, that the NNGPs associated with average pooling CNNs, coincide with those of the corresponding Locally Connected Networks (LCNs), the latter being CNNs without weight sharing across each layer. Furthermore, they found the performance of SGD-trained LCNs to be on par with that of their NNGPs.

Since one expects of a LCN to be different than that of a CNN, it should be that higher cumulants of , which come into play at finite , would be different for LCNs and DNNs. In App. G we show that appearing in our FWC corrections, already differentiates between CNNs and LCNs. Common practice in the field strongly suggests that CNNs generate a better prior on the space of images than LCNs. As a result we expect to see a performance which decreases with when training a large CNN in our setting. This is in contrast to SGD behavior reported in some works where the CNN performance seems to saturate as a function of , to some value better than the NNGP (Novak et al., 2018; Neyshabur et al., 2018). Notably those works used maximum over architecture scans, high learning rates, and early stopping, all of which are absent from our training protocol.

To test the above conjecture we trained, according to our protocol, a CNN with six convolutional layers and two fully connected layers on CIFAR10 with two settings: one with training points and test points and the other with train points and

test points. We used MSE loss with a one hot encoding into a

dimensional vector of the categorical label. Further details on the architecture, training, averaging, and error estimation are given in App. F. By comparing different initialization seeds and learning rates, we verified that training was, down to statistical accuracy, ergodic and in the limit of vanishing learning rate (, was sufficient). Results on the larger train-set are shown in Fig. 3. The error bars on the green curves mainly reflect the noise involved in estimating the expected MSE loss using our finite test set. We note that: (a) The CNN can outperform the NNGP by in terms of MSE loss. In particular with channels our accuracy was while the NNGP yielded both should be taken with finite-test-set uncertainty. (b) As the number of channels grows, the CNN predictions slowly approach that of the NNGP. (c) Judging by the slow convergence to the GP, at channels, the CNN is far away from the perturbative limit where our FWCs dominate the discrepancy. Nonetheless the NNGP approximation matches the CNNs’ outputs fairly well, with accuracy equal to about that of the MSE with the target. We further comment that for channels we used a layer-dependent weight-decay between and , learning-rate of

, and the variance of the white noise on the gradients was

(prior to being multiplied by ). In addition we tested the classification accuracy using the same training set but for the full CIFAR-10 test set. For we obtained accuracy where, as before we did not use any data-augmentation, dropout, or pooling.

Turning to the smaller train-set experiment Fig. 4, here we see again that the CNN outperforms its GP when the number of channels is finite, and approaches its GP as the number of channel increase. We note that a similar yet more pronounced trend in performance appears here also when one considers the averaged MSE loss rather the the MSE loss of the average outputs.


Figure 3: CNNs trained on CIFAR10 in the regime of the NNSP correspondence compared with NNGPs, using a larger training set. MSE test loss normalized by target variance of a deep CNN (solid green) and its associated NNGP (dashed green) along with the MSE between the NNGP’s predictions and CNN outputs normalized by the NNGP’s MSE test loss (solid blue, and on a different scale). We used balanced training and test sets of size each. As argued, the performance should deteriorate at large as the NNSP associated with the CNN approaches an NNGP. See further results on CNNs in App. I

Figure 4: CNNs trained on CIFAR10 in the regime of the NNSP correspondence compared with NNGPs, using a smaller training set. The meaning of the curves are the same as in Fig. 3. Notice that the axis is in log scale and so too is the axis for the blue curve. We used balanced training and test sets of sizes and , respectively. For the largest number of channels we reached, the slope of the discrepancy between the CNN’s GP and the trained DNN on the log-log scale was , placing us close to the perturbartive regime where a slope of is expected. Error bars here reflect statistical errors related only to output averaging and not due to the random choice of a test-set.

6 Discussion and future work

In this work we presented a correspondence between DNNs trained at small learning rates, with weight-decay, and with noisy gradients and inference on a certain non-parametric-model/stochastic-process (the NNSP). We provided analytical expressions, involving dataset-size matrix inversion, predicting the test outputs of the underlying DNN at large but finite width, . In the limit of a large number of data points, , explicit analytical expressions for the DNNs’ outputs were given, involving no difficult matrix inversions. Our results were tested empirically for two fully connected networks with power-law and ReLU activations. Turning to CNNs without pooling, we argued that, unlike in many recent works, performance should in fact decrease with as the CNN tends to behave as its NNGP. This is because FWCs reflect the weight-sharing property of CNNs which is ignored at the level of the NNGP.

There are a variety of future directions to study. It would be interesting to explore whether the performance discrepancy between CNNs and their NNGPs can be fully explained with our perturbative approach in or whether non-perturbative effects are needed. Similarly it would be interesting to make more explicit the effect of at large , especially how it augments the NNGP prior of CNNs to represent weight sharing. Along these lines we comment that our formalism fits nicely into that of (Cohen et al., 2019) for predicting learning-curves. Dynamical effects can also be analyzed using the fluctuation-dissipation theorem and can provide estimates on how fast specific features are learned (see e.g. (Bordelon et al., 2020; Rahaman et al., 2018)). Since training at vanishing learning rate is costly, it would be interesting to explore finite learning rate corrections (see e.g. (Lewkowycz et al., 2020)) or alternatively find ways to augment the dataset such that learning rates could be increased without going out of the regime of the NNSP correspondence. Future studies can explore the effects of replacing the white noise in our dynamics with colored noise, characteristic of SGD. Naively, one might imagine that when sufficiently small, these two sources of noise would both generate a similar ergodic dynamics exploring the nearly zero (small ) train-loss manifold. As long as the limit is stable ((Cohen et al., 2019)), the difference between small colored and white noise may prove irrelevant. In conclusion the NNSP correspondence, extended in the above directions, would provide a versatile analytical lab for studying the theory of deep learning.

References

  • Z. Allen-Zhu, Y. Li, and Y. Liang (2018) Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. arXiv e-prints, pp. arXiv:1811.04918. External Links: 1811.04918 Cited by: §1.
  • S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019) On Exact Computation with an Infinitely Wide Neural Net. arXiv e-prints, pp. arXiv:1904.11955. External Links: 1904.11955 Cited by: §5.3.
  • J. S. Avery (2010) Harmonic polynomials, hyperspherical harmonics, and atomic spectra. Journal of Computational and Applied Mathematics 233 (6), pp. 1366 – 1379. Note: Special Functions, Information Theory, and Mathematical Physics. Special issue dedicated to Professor Jesus Sanchez Dehesa on the occasion of his 60th birthday External Links: Document, ISSN 0377-0427, Link Cited by: §4.2.
  • R. Basri, D. Jacobs, Y. Kasten, and S. Kritchman (2019) The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies. arXiv e-prints, pp. arXiv:1906.00425. External Links: 1906.00425 Cited by: §1.
  • B. Bordelon, A. Canatar, and C. Pehlevan (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. External Links: 2002.02561 Cited by: §6.
  • Y. Cao and Q. Gu (2019) Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks. arXiv e-prints, pp. arXiv:1902.01384. External Links: 1902.01384 Cited by: §1.
  • Y. Cao and Q. Gu (2019)

    Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

    .
    arXiv e-prints, pp. arXiv:1905.13210. External Links: 1905.13210 Cited by: §1.
  • Y. Cho and L. K. Saul (2009) Kernel methods for deep learning. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09, USA, pp. 342–350. External Links: ISBN 978-1-61567-911-9, Link Cited by: Appendix A, Appendix D, §4.3.
  • A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun (2015) The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp. 192–204. Cited by: §1.
  • O. Cohen, O. Malka, and Z. Ringel (2019) Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective. arXiv e-prints, pp. arXiv:1906.05301. External Links: 1906.05301 Cited by: Appendix H, §1, §2, §4.2, §4.2, §4.2, §4.2, §6.
  • A. Daniely, R. Frostig, and Y. Singer (2016) Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. ArXiv e-prints. External Links: 1602.05897 Cited by: §1.
  • Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2933–2941. External Links: Link Cited by: §3.
  • F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht (2018) Essentially No Barriers in Neural Network Energy Landscape. arXiv e-prints, pp. arXiv:1803.00885. External Links: 1803.00885 Cited by: §3.
  • E. Dyer and G. Gur-Ari (2020) Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • E. Gardner and B. Derrida (1988) Optimal storage properties of neural network models. Journal of Physics A Mathematical General 21, pp. 271–284. External Links: Document Cited by: §3.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural Tangent Kernel: Convergence and Generalization in Neural Networks. arXiv e-prints, pp. arXiv:1806.07572. External Links: 1806.07572 Cited by: §1, §1, §3.
  • N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §1.
  • A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari (2020) The large learning rate phase of deep learning: the catapult mechanism. External Links: 2003.02218 Cited by: §6.
  • W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson (2019) A Simple Baseline for Bayesian Uncertainty in Deep Learning. arXiv e-prints, pp. arXiv:1902.02476. External Links: 1902.02476 Cited by: §2.
  • D. Malzahn and M. Opper (2001) A variational approach to learning curves. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 463–469. External Links: Link Cited by: §4.2.
  • S. Mandt, M. D. Hoffman, and D. M. Blei (2017) Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv e-prints, pp. arXiv:1704.04289. External Links: 1704.04289 Cited by: §2.
  • R. Mannella (2004) Quasisymplectic integrators for stochastic differential equations. Physical Review E 69 (4), pp. 041107. Cited by: §3.
  • P. Mccullagh (2017) Tensor methods in statistics. Dover Books on Mathematics. Cited by: Appendix A, Appendix A, §4.1.
  • S. Mei and A. Montanari (2019) The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv e-prints, pp. arXiv:1908.05355. External Links: 1908.05355 Cited by: §2.
  • P. A. P. Moran (1948) Rank Correlation and Product-Moment Correlation. Biometrika 35 (1), pp. 203–206. Cited by: Appendix C, §4.3.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2018) Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. arXiv e-prints, pp. arXiv:1805.12076. External Links: 1805.12076 Cited by: §5.3.
  • R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes. arXiv e-prints, pp. arXiv:1810.05148. External Links: 1810.05148 Cited by: §F.2, §5.3, §5.3.
  • N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville (2018) On the Spectral Bias of Neural Networks. arXiv e-prints, pp. arXiv:1806.08734. External Links: 1806.08734 Cited by: §1.
  • N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville (2018) On the spectral bias of neural networks. External Links: 1806.08734 Cited by: §6.
  • C. E. Rasmussen and C. K. I. Williams (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press. External Links: ISBN 026218253X Cited by: §B.1, §F.2, §1, §1, §4.2, §4.
  • H. Risken and T. Frank (1996) The fokker-planck equation: methods of solution and applications. Springer Series in Synergetics, Springer Berlin Heidelberg. External Links: ISBN 9783540615309, LCCN 96033182, Link Cited by: §3.
  • Y. W. Teh, A. H. Thiery, and S. J. Vollmer (2016) Consistency and fluctuations for stochastic gradient langevin dynamics. J. Mach. Learn. Res. 17 (1), pp. 193–225. External Links: ISSN 1532-4435 Cited by: §2.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, USA, pp. 681–688. External Links: ISBN 978-1-4503-0619-5, Link Cited by: §2, §3.
  • S. Yaida (2019) Non-Gaussian processes and neural networks at finite widths. arXiv. External Links: arXiv:1910.00019v1 Cited by: §2.
  • N. Ye, Z. Zhu, and R. K. Mantiuk (2017) Langevin Dynamics with Continuous Tempering for Training Deep Neural Networks. ArXiv e-prints. External Links: 1703.04379 Cited by: §2.
  • A. Zee (2003) Quantum Field Theory in a Nutshell. Nutshell handbook, Princeton Univ. Press, Princeton, NJ. External Links: Link Cited by: §B.1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv e-prints, pp. arXiv:1611.03530. External Links: 1611.03530 Cited by: §1.

Appendix A Edgeworth series

The Central Limit Theorem (CLT) tells us that the distribution of a sum of

independent RVs will tend to a Gaussian as . Its relevancy for wide fully-connected DNNs (or CNNs with many channels) comes from the fact that every pre-activation averages over uncorrelated random variables thereby generating a Gaussian distribution at large (Cho and Saul, 2009), augmented by higher order cumulants which decay as , where is the order of the cumulant. When higher order cumulants are small, an Edgeworth series (see e.g. (Mccullagh, 2017)

) is a useful practical tool for obtaining the probability distribution from these cumulants. Having the probability distribution and interpreting its logarithm as our action, places us closer to standard field-theory formalism.

For simplicity we focus on a 2-layer network, but the derivation generalizes straightforwardly to networks of any depth. We are interested in the finite corrections to the prior distribution , i.e. the distribution of the DNN output , with and . Because has zero mean and a variance that scales as , all odd cumulants are zero and the ’th cumulant scales as . This holds true for any DNN having a fully-connected last layer with variance scaling as . The derivation of the multivariate Edgeworth series can be found in e.g. (Mccullagh, 2017), and our case is similar where instead of a vector-valued RV we have the functional RV , so the cumulants become ”functional tensors” i.e. multivariate functions of the input . Thus, the leading FWC to the prior is

(A.1)

where is as in the main text Eq. 9 and the 4th Hermite functional tensor is

(A.2)

This is the functional analogue of the fourth Hermite polynomial: , which appears in the scalar Edgeworth series expanded about a standard Gaussian.

Appendix B First order correction to posterior mean and variance

b.1 Posterior mean

The posterior mean with the leading FWC action is given by

(B.1)

where

(B.2)

where the implies that we only treat the first order Taylor expansion of , and where are as in the main text Eqs. 9, 10. The general strategy is to bring the path integral to the front, so that we will get just correlation functions w.r.t. the Gaussian theory (including the data term ) , namely the well known results (Rasmussen and Williams, 2005) for and , and then finally perform the integrals over input space. Expanding both the numerator and the denominator of Eq. B.1, the leading finite width correction for the posterior mean reads

(B.3)

This, as standard in field theory, amounts to omitting all terms corresponding to bubble diagrams, namely we keep only terms with a factor of and ignore terms with a factor of , since these will cancel out. This is a standard result in perturbative field theory (see e.g. (Zee, 2003)).

We now write down the contributions of the quartic, quadratic and constant terms in :

  1. For the quartic term in , we have

    (B.4)

    We dub these terms by and to be referenced shortly. We mention here that they are the source of the linear and cubic terms in the target appearing in Eq. 14 in the main text.

  2. For the quadratic term in , we have

    (B.5)

    we note in passing that these cancel out exactly together with similar but opposite sign terms/diagrams in the quartic contribution, which is a reflection of measure invariance. This is elaborated on in Sect. B.3.

  3. For the constant terms in , we will be left only with bubble diagram terms which will cancel out in the leading order of .

b.2 Posterior variance

The posterior variance is given by

(B.6)

Following similar steps as for the posterior mean, the leading finite width correction for the posterior second moment at reads

(B.7)

As for the posterior mean, the constant terms in cancel out and the contributions of the quartic and quadratic terms are

(B.8)

and

(B.9)

b.3 Measure invariance of the result

The expressions derived above may seem formidable, since they contain many terms and involve integrals over input space which seemingly depend on the measure . Here we show how they may in fact be simplified to the compact expressions in the main text Eq. 14 which involve only discrete sums over the training set and no integrals, and are thus manifestly measure-invariant.

For simplicity, we show here the derivation for the FWC of the mean , and a similar derivation can be done for . In the following, we carry out the integrals, by plugging in the expressions from Eq. 6 and coupling them to . As in the main text, we use the Einstein summation notation, i.e. repeated indices are summed over the training set. The contribution of the quadratic terms is