DeepAI
Log In Sign Up

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

02/16/2019
by   Song Mei, et al.
0

We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in R^D (where D is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension D. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds. We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of the mean field analysis.

READ FULL TEXT VIEW PDF
07/12/2022

Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent

The convergence of stochastic interacting particle systems in the mean-f...
10/27/2022

Mean-field neural networks: learning mappings on Wasserstein space

We study the machine learning task for models with operators mapping bet...
10/27/2022

Stochastic Mirror Descent in Average Ensemble Models

The stochastic mirror descent (SMD) algorithm is a general class of trai...
02/16/2021

Analysis of feature learning in weight-tied autoencoders via the mean field lens

Autoencoders are among the earliest introduced nonlinear models for unsu...
11/20/2020

Normalization effects on shallow neural networks and related asymptotic expansions

We consider shallow (single hidden layer) neural networks and characteri...
12/02/2019

On the geometry of Stein variational gradient descent

Bayesian inference problems require sampling or approximating high-dimen...

1 Introduction

Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features. They are effectively trained using stochastic-gradient descent (SGD)

[LBBH98]; their behavior is fairly insensitive to the number of hidden units or to the input dimensions [SHK14]; their number of parameters is often larger than the number of samples.

In this paper consider simple neural networks with one layer of hidden units:

(1)

Here

is a feature vector,

comprises the network parameters, , and is a bounded activation function. The most classical example is , where is a scalar function (and of course ), but our theory covers a broader set of examples. We assume to be given data , with a probability distribution over , and attempt at minimizing the square loss risk:

(2)

The risk function can be either understood as population risk or empirical risk, depending on viewing as a population distribution or assuming is supported on data points. If is understood as the population risk, we can rewrite

(3)

where and is the Bayes error.

Classical theory of universal approximation provides useful insights into the way two-layers networks capture arbitrary input-output relations [Cyb89, Bar93]. In particular, Barron’s theorem [Bar93] guarantees

(4)

where

is the Fourier transform of

, and is the supremum of in the support of . This result is remarkable in that the minimum number of neurons needed to achieve a certain accuracy depends only on intrinsic regularity properties of and not on the dimension . The proof of this and similar results shows that it is more insightful to think of the representation (1) in terms of the empirical distribution of the neurons . With a slight abuse of notation, we have , where, for a general distribution , we define

(5)

The universal approximation property is then related to the fact that an arbitrary distribution can be approximated by one supported on points111Of course, here we are hiding some important technical issues..

Approximation theory provides some insight into the peculiar properties of neural networks. Small population risk is achieved by many networks, since what matters is the distribution , not the parameters . The behavior is insensitive to the number of neurons , as long as this is large enough for to approximate . Finally, the bound (4) is dimension-free.

Of course these insights concern ideal representations, and not necessarily the networks generated by SGD. Recently, an analysis of SGD dynamics has been developed that connects naturally to the theory of universal approximation [MMN18, SS18, RVE18, CB18b]. The main object of study is the empirical distribution after SGD steps. For large , small step size and setting , turns out to be well approximated by a probability distribution . The latter evolves according to the following partial differential equation

(DD)
(6)

(Here is a function that gauges the evolution of step size and will be defined below. In fact, there is little loss to the following discussion in setting .) We will refer to this as the mean field description, or distributional dynamics. This description has the advantage of being explicitly independent of the number of hidden units and hence accounts for one of the empirical findings described above (the insensitivity to the number of neurons). Further, it allows to focus on some key elements of the dynamics (global convergence, typical behavior) neglecting others (local minima, statistical noise).

Several papers used this approach over the last year to analyze learning in two-layers networks: this work will be succinctly reviewed in Section 2.

Of course, a crucial question needs to be answered for this approach to be meaningful: In what regime is the distributional dynamics a good approximation to SGD? Quantitative approximation guarantees were established in [MMN18], under certain regularity conditions on the data distribution , and for activation functions bounded. Under these conditions, and for time bounded, [MMN18] proves that the distributional dynamics solution approximates well the actual empirical distribution , when the number of neurons is much larger than the problem dimensions .

The results of [MMN18] present several limitations, that we overcome in the present paper. We briefly summarize our contributions.

Dimension-free approximation.

As mentioned above, both classical approximation theory and the mean-field analysis of SGD approximate a certain target distribution by the empirical distributions of the network parameters . However, while the approximation bound (4) is dimension-free, the approximation guarantees of [MMN18] are explicitly dimension-dependent. Even for very smooth functions , and well behaved data distributions, the results of [MMN18] require .

Here we prove a new bound that is dimension independent and therefore more natural. The proof follows a coupling argument which is different and more powerful than the one of [MMN18]. A key improvement consists in isolating different error terms, and developing a more delicate concentration-of-measure argument which controls the dependence of the error on .

Let us emphasize that capturing the correct dimension-dependence is an important test of the mean-field theory, and it is crucial in order to compare neural networks to other learning techniques (see Section 4).

Unbounded activations.

The approximation guarantee of [MMN18] only applies to activation functions that are bounded. This excludes the important case of unbounded second-layer coefficients as in Eq. (1). We extend our analysis to that case. This requires to develop an a priori bound on the growth of the coefficients . As in the previous point, our approximation guarantee is dimension-free.

Noisy SGD.

Finally, in some cases it is useful to inject noise into SGD. From a practical perspective this can help avoiding local minima. From an analytical perspective, it corresponds to a modified PDE, which contains an additional Laplacian term . This PDE has smoother solutions that are supported everywhere and converge globally to a unique fixed point [MMN18].

In this setting, we prove a dimension-free approximation guarantee for the case of bounded activations. We also obtain a guarantee for noisy SGD unbounded activations, but the latter is not dimension-free.

Kernel limit.

We analyze the PDE (DD) in a specific short-time limit and show that it is well approximated by a linearized dynamics. This dynamics can be thought as fitting a kernel ridge regression222‘Kernel ridge regression’ and ‘kernel regression’ are used with somewhat different meanings in the literature. Kernel ridge regression uses global information and can be defined as ridge regression in reproducing kernel Hilbert space (RKHS), while kernel regression uses local averages. See Remark H.1 for a definition. model with respect to a kernel corresponding to the initial weight distribution . We thus recover –from a different viewpoint– a connection with kernel methods that has been investigated in several recent papers [JGH18, DZPS18, DLL18, AZLS18]. Beyond the short time scale, the dynamics is analogous to kernel boosting dynamics with a time-varying data-dependent kernel (a point that already appears in [RVE18]).

Mean-field theory allowed to prove global convergence guarantees for SGD in two-layers neural networks [MMN18, CB18b]. Unfortunately, these results do not provide (in general) useful bounds on the network size . We believe that the results in this paper are a required step in that direction.

The rest of this paper is organized as follows. The next section overviews related work, focusing in particular on the distributional dynamics (DD), its variants and applications. In Section 3 we present formal statements of our results. Section 4 develops the connection with kernel methods. Proofs are mostly deferred to the appendices.

2 Related work

As mentioned above, classical approximation theory already uses (either implicitly or explicitly) the idea of lifting the class of -neurons neural networks, cf. Eq. (1), to the infinite-dimensional space (5) parametrized by probability distributions , see e.g. [Cyb89, Bar93, Bar98, AB09]. This idea was exploited algorithmically, e.g. in [BRV06, NS17].

Only very recently (stochastic) gradient descent was proved to converge (for large enough number of neurons) to the infinite-dimensional evolution (DD) [MMN18, RVE18, SS18, CB18b]. In particular, [MMN18] proves quantitative bounds to approximate SGD by the mean-field dynamics. Our work is mainly motivated by the objective to obtain a better scaling with dimension and to allow for unbounded second-layer coefficients.

The mean-field description was exploited in several papers to establish global convergence results. In [MMN18] global convergence was proved in special examples, and in a general setting for noisy SGD. The papers [RVE18, CB18b] studied global convergence by exploiting the homogeneity properties of Eq. (1). In particular, [CB18b] proves a general global convergence result. For initial conditions with full support, the PDE (DD) converges to a global minimum provided activations are homogeneous in the parameters. Notice that the presence of unbounded second layer coefficients is crucial in order to achieve homogeneity. Unfortunately, the results of [CB18b] do not provide quantitative approximation bounds relating the PDE (DD) to finite- SGD. The present paper fills this gap by establishing approximation bounds that apply to the setting of [CB18b].

A different optimization algorithm was studied in [WLLM18] using the mean-field description. The algorithm resamples a positive fraction of the neurons uniformly at random at a constant rate. This allows the authors to establish a global convergence result (under certain assumed smoothness properties on the PDE solution). Again, this paper does not provide quantitative bounds on the difference between PDE and finite- SGD. While our theorems do not cover the algorithm of [WLLM18], we believe that their algorithm could be analyzed using the approach developed here. Exponentially fast convergence to a global optimum was proven in [JMM19]

for certain radial-basis-function networks, using again the mean-field approach. While the setting of

[JMM19] is somewhat different (weights are constrained to a convex compact domain), the technique presented here could be applicable to that problem as well.

Finally, a recent stream of works [JGH18, GJS19, DZPS18, DLL18, AZLS18] argues that, as two-layers networks are actually performing a type of kernel ridge regression. As shown in [CB18a], this phenomenon is not limited to neural network, but generic for a broad class of models. As expected, the kernel regime can indeed be recovered as a special limit of the mean-field dynamics (DD), cf. Section 4. Let us emphasize that here we focus on the population rather than the empirical risk.

A discussion of the difference between the kernel and mean-field regimes was recently presented in [DL19]. However, [DL19] argues that the difference between kernel and mean-field behaviors is due to different initializations of the coefficients ’s. We show instead that, for a suitable scaling of the initialization, kernel and mean field regimes appear at different time scales. Namely, the kernel behavior arises at the beginning of the dynamics, and mean field characterizes longer time scales. It is also worth mentioning that the connection between mean field dynamics and kernel boosting with a time-varying data-dependent kernel was already present (somewhat implicitly) in [RVE18].

3 Dimension-free mean field approximation

3.1 General results

As mentioned above, we assume to be given data , and we run SGD with step size :

(SGD)

We will work under a one-pass model, that is, each data point is visited once.

We also consider a noisy version of SGD, with a regularization term:

(noisy-SGD)

where . The noiseless version is recovered by setting and . The step size is chosen according to : , for a positive function .

The infinite-dimensional evolution corresponding to noisy SGD is given by

(diffusion-DD)
(7)

The function is defined as in (DD). At this point it is important to note that the PDE (DD) has to be interpreted in weak sense, while, for , Eq. (diffusion-DD) has strong solutions i.e. solutions that are (once continuous differentiable in time and twice in space, see [MMN18] and Appendix F).

It is useful to lift the population risk in the space of distributions

(8)

We also note that, given the structure of the activation function in Eq. (1), for , , we can write , , where and .

In order to establish a non-asymptotic guarantee, we will make the following assumptions:

  • is bounded Lipschitz: , .

  • The activation function

    and the response variables are bounded:

    . Furthermore, its gradient is -sub-Gaussian (when ).

  • The functions and are differentiable, with bounded and Lipschitz continuous gradient: , , , .

  • The initial condition is supported on for a constant .

We will consider two different cases for the SGD dynamics:

General coefficients.

We initialize the parameters as . Both the and are updated during the dynamics.

Fixed coefficients.

We use the same initialization as described above, but the coefficients are not updated by SGD. The corresponding PDE is given by Eq. (DD) (or (diffusion-DD)), except that the space derivatives are to be interpreted only with respect to , i.e. replace by , and by .

While the second setting is less relevant in practice, it is at least as interesting from a theoretical point of view, and some of our guarantees are stronger in that case.

Assume that conditions A1-A4 hold, and let . Let be the solution of the PDE (DD) with initialization , and let to be the trajectory of SGD (SGD) with initialization independently.

  • Consider noiseless SGD with fixed coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1-A4) such that

    (9)

    with probability at least .

  • Consider noiseless SGD with general coefficients. Then there exists constants and (depending uniquely on the constants of assumptions A1-A4) such that if , we have

    (10)

    with probability at least .

Remark 3.1.

As anticipated in the introduction, provided , the error terms in Eqs. (9), (10), are small as soon as . In other words, the minimum number of neurons needed for the mean-field approximation to be accurate is independent of the dimension , and only depends on intrinsic features of the activation and data distribution.

On the other hand, the dimension appears explicitly in conjunction with the step size . We need in order for mean field to be accurate. This is the same trade-off between step size and dimension that was already achieved in [MMN18].

We next consider noisy SGD, cf. Eq. (noisy-SGD), and the corresponding PDE in Eq. (diffusion-DD). We need to make additional assumptions on the initialization in this case.

  • The initial condition is such that, for , we have that is -sub-Gaussian.

  • , , and is uniformly bounded for .

Remark 3.2.

The last condition ensures the existence of strong solutions for Eq. (diffusion-DD). The existence and uniqueness of solution of the PDE (DD) and the PDE (diffusion-DD) are discussed in Appendix F.

Assume that conditions A1 - A6 hold. Let be the solution of the PDE (diffusion-DD) with initialization , and let to be the trajectory of noisy SGD (noisy-SGD) with initialization independently. Finally assume that , , .

  • Consider noisy SGD with fixed coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1-A5 and ) such that

    (11)

    with probability at least .

  • Consider noisy SGD with general coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1-A5 and ) such that

    (12)

    with probability at least .

Remark 3.3.

Unlike the other results in this paper, part of Theorem 3.1 does not establish a dimension-free bound. Further, while previous bounds allow to control the approximation error for any , Theorem 3.1. requires . The main difficulty in part is to control the growth of the coefficients . This is more challenging than in the noiseless case, since we cannot give a deterministic bound on .

Despite these drawbacks, Theorem 3.1 is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients. It implies that the mean field theory is accurate when .

3.2 Example: Centered anisotropic Gaussians

To illustrate an application of the theorems, we consider the problem of classifying two Gaussians with the same mean and different covariance. This example was studied in

[MMN18], but we restate it here for the reader’s convenience.

Consider the joint distribution of data

given by the following:

  • With probability : , ,

  • With probability : , ,

where for

to be an unknown orthogonal matrix. In other words, there exists a subspace

of dimension , such that the projection of on the subspace

is distributed according to an isotropic Gaussian with variance

(if ) or (if ). The projection orthogonal to has instead the same variance in the two classes.

We choose an activation function without offset or output weights, namely . While qualitatively similar results are obtained for other choices of

, we will use a simple piecewise linear function (truncated ReLU) as a running example: take

,

We introduce a class of good uninformative initializations for which convergence to the optimum takes place. For , we let

We say that if: is absolutely continuous with respect to Lebesgue measure, with bounded density; .

The following theorem is an improvement of [MMN18, Theorem 2] using Theorem 3.1, whose proof is just by replacing the last step of proof of [MMN18, Theorem 2] using the new bounds developed in 3.1 (A). For any , and , there exists , , and , such that the following holds for the problem of classifying anisotropic Gaussians with , fixed. For any dimension parameters , number of neurons , consider SGD initialized with initialization and step size . Then we have for any with probability at least . Comparing to [MMN18, Theorem 2], here we require neuron rather than previously neurons. The number of data used is still on the optimal order.

4 Connection with kernel methods

As discussed above, mean-field theory captures the SGD dynamics of two layers neural networks when the number of hidden units is large. Several recent papers studied a different description, that approximates the neural network as performing a form of kernel ridge regression [JGH18, DZPS18]. This behavior also arises for large : we will refer to this as to the ‘kernel regime’, or ‘kernel limit’. As shown in [CB18a] the existence of a kernel regime is not specific to neural networks but it is a generic feature of overparameterized models, under certain differentiability assumptions.

4.1 A coupled dynamics

We will focus on noiseless gradient flow, and assume (a general joint distribution over is recovered by setting ). As in [CB18a], we modify the model (1) by introducing an additional scale parameter :

(13)

In the case of general coefficients , this amounts to rescaling the coefficients . Equivalently, this corresponds to a different initialization for the ’s (larger by a factor ).

We first note that the theorems of the previous section obviously hold for the modified dynamics, with the PDE (DD) generalized to

(14)
(15)

where . It is convenient to redefine time units by letting . This satisfies the rescaled distributional dynamics

(Rescaled-DD)

We next consider the residuals which we view as an element of . As first shown in [RVE18], this satisfies the following mean field residual dynamics (for further background, we refer to Appendix H):

(RD)
(16)

Coupling the dynamics (Rescaled-DD) and (RD) suggests the following point of description. Gradient flow dynamics of two-layers neural network is a kernel boosting dynamics with a time-varying kernel. The scaling parameter controls the speed that the kernel evolves.

The mean field residual dynamics (RD) implies that

so that the risk will be non-increasing along the gradient flow dynamics. However, since the kernel is not fixed, it is hard to analyze when the risk converges to (see [MMN18, Theorem 4], [CB18b, Theorem 3.3 and 3.5] for general convergence results).

4.2 Kernel limit of residual dynamics

The kernel regime corresponds to large

and allows for a simpler treatment of the dynamics. Heuristically, the reason for such a simplification is that the time derivative of

is of order , cf. (Rescaled-DD). We are therefore tempted to replace in Eq. (RD) by . Formally, we define the following linearized residual dynamics

(17)

We can also define the corresponding predictors by . The operator is bounded and standard semigroup theory [Eva09] implies the following. We have , where is the orthogonal projector onto the null space of . In particular, if the null space of is empty, then . Correspondingly (where ).

The next theorem shows that the above intuition is correct. For , the linearized dynamics is a good approximation to the mean field dynamics. Below, we denote the population risk by : . Let and be the residues in the mean-field dynamics (RD) and linearized dynamics (17), respectively. Let assumptions A1, A3, A4 hold, and additionally assume the following

  • , , and is differentiable.

  • .

  • .

Then there exists a constant depending on , such that

  • For SGD with fixed coefficients, we have

    (18)
    (19)
  • For SGD with general coefficients, we have

    (20)
    (21)
  • In particular, if under the law , is independent of and . Then is independent of . If the null space of is empty, then under both settings (fixed and variable coefficients)

    (22)
    (23)
Remark 4.1.

Unlike in similar results in the literature, we focus here on the population risk rather than the empirical risk. The recent paper [CB18a] addresses both the overparametrized and the underparametrized regime. The latter result (namely [CB18a, Theorem 3.4]) is of course relevant for the population risk. However, while [CB18a] proves convergence to a local minimum, here we show that the population risk becomes close to .

Remark 4.2.

As stated above, the linearized residual dynamics can be interpreted as performing kernel ridge regression with respect to the kernel , see e.g. [JGH18]. A way to clarify the connection is to consider the case in which is the empirical data distribution. In this case the linearized dynamics converges to

where

For the sake of completeness, we review the connection in Appendix H.7.

References

Appendix A Notations

  • For future reference, we copy the key definitions from the main text:

    where or depending on the context. Further, we will denote for and :

    In particular,

    In the case of fixed coefficients, without loss of generality, we will fix in the proof for notational simplicity and freely denote ,

  • is the Wasserstein distance between probability measures

  • For , we will denote . With a little abuse of notation, for , we will denote , with the time discretization parameter.

  • K will denote a generic constant depending on for , where the ’s are constants that will be specified from the context.

  • In the proof and the statements of the theorems, we will only consider the leading order in . In particular, we freely use that for a constant .

  • For readers convenience, we copy here the two simplified versions of Gronwall’s lemma that will be used extensively in the proof.

    1. [label=()]

    2. Consider an interval and a real-valued function defined on , assume there exists positive constants such that satisfies the integral inequality

      then for all .

    3. Consider a non-negative sequence and assume there exists positive constants such that satisfies the summation inequality