1 Introduction
Multilayer neural networks, and in particular multilayer perceptrons, present a number of remarkable features. They are effectively trained using stochasticgradient descent (SGD)
[LBBH98]; their behavior is fairly insensitive to the number of hidden units or to the input dimensions [SHK14]; their number of parameters is often larger than the number of samples.In this paper consider simple neural networks with one layer of hidden units:
(1) 
Here
is a feature vector,
comprises the network parameters, , and is a bounded activation function. The most classical example is , where is a scalar function (and of course ), but our theory covers a broader set of examples. We assume to be given data , with a probability distribution over , and attempt at minimizing the square loss risk:(2) 
The risk function can be either understood as population risk or empirical risk, depending on viewing as a population distribution or assuming is supported on data points. If is understood as the population risk, we can rewrite
(3) 
where and is the Bayes error.
Classical theory of universal approximation provides useful insights into the way twolayers networks capture arbitrary inputoutput relations [Cyb89, Bar93]. In particular, Barron’s theorem [Bar93] guarantees
(4) 
where
is the Fourier transform of
, and is the supremum of in the support of . This result is remarkable in that the minimum number of neurons needed to achieve a certain accuracy depends only on intrinsic regularity properties of and not on the dimension . The proof of this and similar results shows that it is more insightful to think of the representation (1) in terms of the empirical distribution of the neurons . With a slight abuse of notation, we have , where, for a general distribution , we define(5) 
The universal approximation property is then related to the fact that an arbitrary distribution can be approximated by one supported on points^{1}^{1}1Of course, here we are hiding some important technical issues..
Approximation theory provides some insight into the peculiar properties of neural networks. Small population risk is achieved by many networks, since what matters is the distribution , not the parameters . The behavior is insensitive to the number of neurons , as long as this is large enough for to approximate . Finally, the bound (4) is dimensionfree.
Of course these insights concern ideal representations, and not necessarily the networks generated by SGD. Recently, an analysis of SGD dynamics has been developed that connects naturally to the theory of universal approximation [MMN18, SS18, RVE18, CB18b]. The main object of study is the empirical distribution after SGD steps. For large , small step size and setting , turns out to be well approximated by a probability distribution . The latter evolves according to the following partial differential equation
(DD)  
(6) 
(Here is a function that gauges the evolution of step size and will be defined below. In fact, there is little loss to the following discussion in setting .) We will refer to this as the mean field description, or distributional dynamics. This description has the advantage of being explicitly independent of the number of hidden units and hence accounts for one of the empirical findings described above (the insensitivity to the number of neurons). Further, it allows to focus on some key elements of the dynamics (global convergence, typical behavior) neglecting others (local minima, statistical noise).
Several papers used this approach over the last year to analyze learning in twolayers networks: this work will be succinctly reviewed in Section 2.
Of course, a crucial question needs to be answered for this approach to be meaningful: In what regime is the distributional dynamics a good approximation to SGD? Quantitative approximation guarantees were established in [MMN18], under certain regularity conditions on the data distribution , and for activation functions bounded. Under these conditions, and for time bounded, [MMN18] proves that the distributional dynamics solution approximates well the actual empirical distribution , when the number of neurons is much larger than the problem dimensions .
The results of [MMN18] present several limitations, that we overcome in the present paper. We briefly summarize our contributions.
 Dimensionfree approximation.

As mentioned above, both classical approximation theory and the meanfield analysis of SGD approximate a certain target distribution by the empirical distributions of the network parameters . However, while the approximation bound (4) is dimensionfree, the approximation guarantees of [MMN18] are explicitly dimensiondependent. Even for very smooth functions , and well behaved data distributions, the results of [MMN18] require .
Here we prove a new bound that is dimension independent and therefore more natural. The proof follows a coupling argument which is different and more powerful than the one of [MMN18]. A key improvement consists in isolating different error terms, and developing a more delicate concentrationofmeasure argument which controls the dependence of the error on .
Let us emphasize that capturing the correct dimensiondependence is an important test of the meanfield theory, and it is crucial in order to compare neural networks to other learning techniques (see Section 4).
 Unbounded activations.

The approximation guarantee of [MMN18] only applies to activation functions that are bounded. This excludes the important case of unbounded secondlayer coefficients as in Eq. (1). We extend our analysis to that case. This requires to develop an a priori bound on the growth of the coefficients . As in the previous point, our approximation guarantee is dimensionfree.
 Noisy SGD.

Finally, in some cases it is useful to inject noise into SGD. From a practical perspective this can help avoiding local minima. From an analytical perspective, it corresponds to a modified PDE, which contains an additional Laplacian term . This PDE has smoother solutions that are supported everywhere and converge globally to a unique fixed point [MMN18].
In this setting, we prove a dimensionfree approximation guarantee for the case of bounded activations. We also obtain a guarantee for noisy SGD unbounded activations, but the latter is not dimensionfree.
 Kernel limit.

We analyze the PDE (DD) in a specific shorttime limit and show that it is well approximated by a linearized dynamics. This dynamics can be thought as fitting a kernel ridge regression^{2}^{2}2‘Kernel ridge regression’ and ‘kernel regression’ are used with somewhat different meanings in the literature. Kernel ridge regression uses global information and can be defined as ridge regression in reproducing kernel Hilbert space (RKHS), while kernel regression uses local averages. See Remark H.1 for a definition. model with respect to a kernel corresponding to the initial weight distribution . We thus recover –from a different viewpoint– a connection with kernel methods that has been investigated in several recent papers [JGH18, DZPS18, DLL18, AZLS18]. Beyond the short time scale, the dynamics is analogous to kernel boosting dynamics with a timevarying datadependent kernel (a point that already appears in [RVE18]).
Meanfield theory allowed to prove global convergence guarantees for SGD in twolayers neural networks [MMN18, CB18b]. Unfortunately, these results do not provide (in general) useful bounds on the network size . We believe that the results in this paper are a required step in that direction.
The rest of this paper is organized as follows. The next section overviews related work, focusing in particular on the distributional dynamics (DD), its variants and applications. In Section 3 we present formal statements of our results. Section 4 develops the connection with kernel methods. Proofs are mostly deferred to the appendices.
2 Related work
As mentioned above, classical approximation theory already uses (either implicitly or explicitly) the idea of lifting the class of neurons neural networks, cf. Eq. (1), to the infinitedimensional space (5) parametrized by probability distributions , see e.g. [Cyb89, Bar93, Bar98, AB09]. This idea was exploited algorithmically, e.g. in [BRV06, NS17].
Only very recently (stochastic) gradient descent was proved to converge (for large enough number of neurons) to the infinitedimensional evolution (DD) [MMN18, RVE18, SS18, CB18b]. In particular, [MMN18] proves quantitative bounds to approximate SGD by the meanfield dynamics. Our work is mainly motivated by the objective to obtain a better scaling with dimension and to allow for unbounded secondlayer coefficients.
The meanfield description was exploited in several papers to establish global convergence results. In [MMN18] global convergence was proved in special examples, and in a general setting for noisy SGD. The papers [RVE18, CB18b] studied global convergence by exploiting the homogeneity properties of Eq. (1). In particular, [CB18b] proves a general global convergence result. For initial conditions with full support, the PDE (DD) converges to a global minimum provided activations are homogeneous in the parameters. Notice that the presence of unbounded second layer coefficients is crucial in order to achieve homogeneity. Unfortunately, the results of [CB18b] do not provide quantitative approximation bounds relating the PDE (DD) to finite SGD. The present paper fills this gap by establishing approximation bounds that apply to the setting of [CB18b].
A different optimization algorithm was studied in [WLLM18] using the meanfield description. The algorithm resamples a positive fraction of the neurons uniformly at random at a constant rate. This allows the authors to establish a global convergence result (under certain assumed smoothness properties on the PDE solution). Again, this paper does not provide quantitative bounds on the difference between PDE and finite SGD. While our theorems do not cover the algorithm of [WLLM18], we believe that their algorithm could be analyzed using the approach developed here. Exponentially fast convergence to a global optimum was proven in [JMM19]
for certain radialbasisfunction networks, using again the meanfield approach. While the setting of
[JMM19] is somewhat different (weights are constrained to a convex compact domain), the technique presented here could be applicable to that problem as well.Finally, a recent stream of works [JGH18, GJS19, DZPS18, DLL18, AZLS18] argues that, as twolayers networks are actually performing a type of kernel ridge regression. As shown in [CB18a], this phenomenon is not limited to neural network, but generic for a broad class of models. As expected, the kernel regime can indeed be recovered as a special limit of the meanfield dynamics (DD), cf. Section 4. Let us emphasize that here we focus on the population rather than the empirical risk.
A discussion of the difference between the kernel and meanfield regimes was recently presented in [DL19]. However, [DL19] argues that the difference between kernel and meanfield behaviors is due to different initializations of the coefficients ’s. We show instead that, for a suitable scaling of the initialization, kernel and mean field regimes appear at different time scales. Namely, the kernel behavior arises at the beginning of the dynamics, and mean field characterizes longer time scales. It is also worth mentioning that the connection between mean field dynamics and kernel boosting with a timevarying datadependent kernel was already present (somewhat implicitly) in [RVE18].
3 Dimensionfree mean field approximation
3.1 General results
As mentioned above, we assume to be given data , and we run SGD with step size :
(SGD) 
We will work under a onepass model, that is, each data point is visited once.
We also consider a noisy version of SGD, with a regularization term:
(noisySGD) 
where . The noiseless version is recovered by setting and . The step size is chosen according to : , for a positive function .
The infinitedimensional evolution corresponding to noisy SGD is given by
(diffusionDD)  
(7) 
The function is defined as in (DD). At this point it is important to note that the PDE (DD) has to be interpreted in weak sense, while, for , Eq. (diffusionDD) has strong solutions i.e. solutions that are (once continuous differentiable in time and twice in space, see [MMN18] and Appendix F).
It is useful to lift the population risk in the space of distributions
(8) 
We also note that, given the structure of the activation function in Eq. (1), for , , we can write , , where and .
In order to establish a nonasymptotic guarantee, we will make the following assumptions:

is bounded Lipschitz: , .

The activation function
and the response variables are bounded:
. Furthermore, its gradient is subGaussian (when ). 
The functions and are differentiable, with bounded and Lipschitz continuous gradient: , , , .

The initial condition is supported on for a constant .
We will consider two different cases for the SGD dynamics:
 General coefficients.

We initialize the parameters as . Both the and are updated during the dynamics.
 Fixed coefficients.

We use the same initialization as described above, but the coefficients are not updated by SGD. The corresponding PDE is given by Eq. (DD) (or (diffusionDD)), except that the space derivatives are to be interpreted only with respect to , i.e. replace by , and by .
While the second setting is less relevant in practice, it is at least as interesting from a theoretical point of view, and some of our guarantees are stronger in that case.
Assume that conditions A1A4 hold, and let . Let be the solution of the PDE (DD) with initialization , and let to be the trajectory of SGD (SGD) with initialization independently.

Consider noiseless SGD with fixed coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1A4) such that
(9) with probability at least .

Consider noiseless SGD with general coefficients. Then there exists constants and (depending uniquely on the constants of assumptions A1A4) such that if , we have
(10) with probability at least .
Remark 3.1.
As anticipated in the introduction, provided , the error terms in Eqs. (9), (10), are small as soon as . In other words, the minimum number of neurons needed for the meanfield approximation to be accurate is independent of the dimension , and only depends on intrinsic features of the activation and data distribution.
On the other hand, the dimension appears explicitly in conjunction with the step size . We need in order for mean field to be accurate. This is the same tradeoff between step size and dimension that was already achieved in [MMN18].
We next consider noisy SGD, cf. Eq. (noisySGD), and the corresponding PDE in Eq. (diffusionDD). We need to make additional assumptions on the initialization in this case.

The initial condition is such that, for , we have that is subGaussian.

, , and is uniformly bounded for .
Remark 3.2.
The last condition ensures the existence of strong solutions for Eq. (diffusionDD). The existence and uniqueness of solution of the PDE (DD) and the PDE (diffusionDD) are discussed in Appendix F.
Assume that conditions A1  A6 hold. Let be the solution of the PDE (diffusionDD) with initialization , and let to be the trajectory of noisy SGD (noisySGD) with initialization independently. Finally assume that , , .

Consider noisy SGD with fixed coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1A5 and ) such that
(11) with probability at least .

Consider noisy SGD with general coefficients. Then there exists a constant (depending uniquely on the constants of assumptions A1A5 and ) such that
(12) with probability at least .
Remark 3.3.
Unlike the other results in this paper, part of Theorem 3.1 does not establish a dimensionfree bound. Further, while previous bounds allow to control the approximation error for any , Theorem 3.1. requires . The main difficulty in part is to control the growth of the coefficients . This is more challenging than in the noiseless case, since we cannot give a deterministic bound on .
Despite these drawbacks, Theorem 3.1 is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients. It implies that the mean field theory is accurate when .
3.2 Example: Centered anisotropic Gaussians
To illustrate an application of the theorems, we consider the problem of classifying two Gaussians with the same mean and different covariance. This example was studied in
[MMN18], but we restate it here for the reader’s convenience.Consider the joint distribution of data
given by the following:
With probability : , ,

With probability : , ,
where for
to be an unknown orthogonal matrix. In other words, there exists a subspace
of dimension , such that the projection of on the subspaceis distributed according to an isotropic Gaussian with variance
(if ) or (if ). The projection orthogonal to has instead the same variance in the two classes.We choose an activation function without offset or output weights, namely . While qualitatively similar results are obtained for other choices of
, we will use a simple piecewise linear function (truncated ReLU) as a running example: take
,We introduce a class of good uninformative initializations for which convergence to the optimum takes place. For , we let
We say that if: is absolutely continuous with respect to Lebesgue measure, with bounded density; .
The following theorem is an improvement of [MMN18, Theorem 2] using Theorem 3.1, whose proof is just by replacing the last step of proof of [MMN18, Theorem 2] using the new bounds developed in 3.1 (A). For any , and , there exists , , and , such that the following holds for the problem of classifying anisotropic Gaussians with , fixed. For any dimension parameters , number of neurons , consider SGD initialized with initialization and step size . Then we have for any with probability at least . Comparing to [MMN18, Theorem 2], here we require neuron rather than previously neurons. The number of data used is still on the optimal order.
4 Connection with kernel methods
As discussed above, meanfield theory captures the SGD dynamics of two layers neural networks when the number of hidden units is large. Several recent papers studied a different description, that approximates the neural network as performing a form of kernel ridge regression [JGH18, DZPS18]. This behavior also arises for large : we will refer to this as to the ‘kernel regime’, or ‘kernel limit’. As shown in [CB18a] the existence of a kernel regime is not specific to neural networks but it is a generic feature of overparameterized models, under certain differentiability assumptions.
4.1 A coupled dynamics
We will focus on noiseless gradient flow, and assume (a general joint distribution over is recovered by setting ). As in [CB18a], we modify the model (1) by introducing an additional scale parameter :
(13) 
In the case of general coefficients , this amounts to rescaling the coefficients . Equivalently, this corresponds to a different initialization for the ’s (larger by a factor ).
We first note that the theorems of the previous section obviously hold for the modified dynamics, with the PDE (DD) generalized to
(14)  
(15) 
where . It is convenient to redefine time units by letting . This satisfies the rescaled distributional dynamics
(RescaledDD) 
We next consider the residuals which we view as an element of . As first shown in [RVE18], this satisfies the following mean field residual dynamics (for further background, we refer to Appendix H):
(RD)  
(16) 
Coupling the dynamics (RescaledDD) and (RD) suggests the following point of description. Gradient flow dynamics of twolayers neural network is a kernel boosting dynamics with a timevarying kernel. The scaling parameter controls the speed that the kernel evolves.
The mean field residual dynamics (RD) implies that
so that the risk will be nonincreasing along the gradient flow dynamics. However, since the kernel is not fixed, it is hard to analyze when the risk converges to (see [MMN18, Theorem 4], [CB18b, Theorem 3.3 and 3.5] for general convergence results).
4.2 Kernel limit of residual dynamics
The kernel regime corresponds to large
and allows for a simpler treatment of the dynamics. Heuristically, the reason for such a simplification is that the time derivative of
is of order , cf. (RescaledDD). We are therefore tempted to replace in Eq. (RD) by . Formally, we define the following linearized residual dynamics(17) 
We can also define the corresponding predictors by . The operator is bounded and standard semigroup theory [Eva09] implies the following. We have , where is the orthogonal projector onto the null space of . In particular, if the null space of is empty, then . Correspondingly (where ).
The next theorem shows that the above intuition is correct. For , the linearized dynamics is a good approximation to the mean field dynamics. Below, we denote the population risk by : . Let and be the residues in the meanfield dynamics (RD) and linearized dynamics (17), respectively. Let assumptions A1, A3, A4 hold, and additionally assume the following

, , and is differentiable.

.

.
Then there exists a constant depending on , such that

For SGD with fixed coefficients, we have
(18) (19) 
For SGD with general coefficients, we have
(20) (21) 
In particular, if under the law , is independent of and . Then is independent of . If the null space of is empty, then under both settings (fixed and variable coefficients)
(22) (23)
Remark 4.1.
Unlike in similar results in the literature, we focus here on the population risk rather than the empirical risk. The recent paper [CB18a] addresses both the overparametrized and the underparametrized regime. The latter result (namely [CB18a, Theorem 3.4]) is of course relevant for the population risk. However, while [CB18a] proves convergence to a local minimum, here we show that the population risk becomes close to .
Remark 4.2.
As stated above, the linearized residual dynamics can be interpreted as performing kernel ridge regression with respect to the kernel , see e.g. [JGH18]. A way to clarify the connection is to consider the case in which is the empirical data distribution. In this case the linearized dynamics converges to
where
For the sake of completeness, we review the connection in Appendix H.7.
References
 [AB09] Martin Anthony and Peter L Bartlett, Neural network learning: Theoretical foundations, cambridge university press, 2009.
 [AZLS18] Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via overparameterization, arXiv:1811.03962 (2018).

[Bar93]
Andrew R Barron,
Universal approximation bounds for superpositions of a sigmoidal function
, IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.  [Bar98] Peter L Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory 44 (1998), no. 2, 525–536.
 [BRV06] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte, Convex neural networks, Advances in neural information processing systems, 2006, pp. 123–130.
 [CB18a] Lenaic Chizat and Francis Bach, A note on lazy training in supervised differentiable programming, arXiv:1812.07956 (2018).
 [CB18b] , On the global convergence of gradient descent for overparameterized models using optimal transport, arXiv:1805.09545 (2018).
 [Cyb89] George Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303–314.
 [DL19] Xialiang Dou and Tengyuan Liang, Training neural networks as learning dataadaptive kernels: Provable representation and approximation benefits, arXiv:1901.07114 (2019).
 [DLL18] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv preprint arXiv:1811.03804 (2018).
 [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes overparameterized neural networks, arXiv:1810.02054 (2018).
 [Eva09] Lawrence C. Evans, Partial differential equations, Springer, 2009.
 [GJS19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart, Scaling description of generalization with number of parameters in deep learning, arXiv:1901.01608 (2019).
 [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, arXiv:1806.07572 (2018).
 [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the fokker–planck equation, SIAM journal on mathematical analysis 29 (1998), no. 1, 1–17.
 [JMM19] Adel Javanmard, Marco Mondelli, and Andrea Montanari, Analysis of a twolayer neural network via displacement convexity, arXiv:1901.01375 (2019).
 [LBBH98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (1998), no. 11, 2278–2324.
 [LL18] Yuanzhi Li and Yingyu Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, Advances in Neural Information Processing Systems, 2018, pp. 8168–8177.
 [MMN18] Song Mei, Andrea Montanari, and PhanMinh Nguyen, A mean field view of the landscape of twolayer neural networks, Proceedings of the National Academy of Sciences (2018).
 [NS17] Atsushi Nitanda and Taiji Suzuki, Stochastic particle gradient descent for infinite ensembles, arXiv:1712.05438 (2017).
 [RVE18] Grant M Rotskoff and Eric VandenEijnden, Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, arXiv:1805.00915 (2018).
 [San15] Filippo Santambrogio, Optimal transport for applied mathematicians: Calculus of variations, pdes, and modeling, vol. 87, Birkhäuser, 2015.

[SHK14]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov, Dropout: a simple way to prevent neural networks from
overfitting
, The Journal of Machine Learning Research
15 (2014), no. 1, 1929–1958.  [SS18] Justin Sirignano and Konstantinos Spiliopoulos, Mean field analysis of neural networks, arXiv:1805.01053 (2018).
 [Szn91] AlainSol Sznitman, Topics in propagation of chaos, Ecole d’été de probabilités de SaintFlour XIX—1989, Springer, 1991, pp. 165–251.
 [WLLM18] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma, On the margin theory of feedforward neural networks, arXiv:1810.05369 (2018).
Appendix A Notations

For future reference, we copy the key definitions from the main text:
where or depending on the context. Further, we will denote for and :
In particular,
In the case of fixed coefficients, without loss of generality, we will fix in the proof for notational simplicity and freely denote ,

is the Wasserstein distance between probability measures

For , we will denote . With a little abuse of notation, for , we will denote , with the time discretization parameter.

K will denote a generic constant depending on for , where the ’s are constants that will be specified from the context.

In the proof and the statements of the theorems, we will only consider the leading order in . In particular, we freely use that for a constant .

For readers convenience, we copy here the two simplified versions of Gronwall’s lemma that will be used extensively in the proof.

[label=()]

Consider an interval and a realvalued function defined on , assume there exists positive constants such that satisfies the integral inequality
then for all .

Consider a nonnegative sequence and assume there exists positive constants such that satisfies the summation inequality
