# Can Shallow Neural Networks Beat the Curse of Dimensionality? A mean field training perspective

We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than t^-4/(d-2) under mean field scaling. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks.

## Authors

• 11 publications
• 54 publications
• ### On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime

We describe a necessary and sufficient condition for the convergence to ...
05/27/2020 ∙ by Stephan Wojtowytsch, et al. ∙ 0

• ### Normalization effects on shallow neural networks and related asymptotic expansions

We consider shallow (single hidden layer) neural networks and characteri...
11/20/2020 ∙ by Jiahui Yu, et al. ∙ 0

• ### Agnostic Learning of a Single Neuron with Gradient Descent

We consider the problem of learning the best-fitting single neuron as me...
05/29/2020 ∙ by Spencer Frei, et al. ∙ 4

• ### Learning with Gradient Descent and Weakly Convex Losses

We study the learning performance of gradient descent when the empirical...
01/13/2021 ∙ by Dominic Richards, et al. ∙ 0

• ### How implicit regularization of Neural Networks affects the learned function – Part I

Today, various forms of neural networks are trained to perform approxima...
11/07/2019 ∙ by Jakob Heiss, et al. ∙ 0

• ### Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent

We prove that two-layer (Leaky)ReLU networks initialized by e.g. the wid...
02/12/2020 ∙ by David Holzmüller, et al. ∙ 0

• ### Proximal Mean-field for Neural Network Quantization

Compressing large neural networks by quantizing the parameters, while ma...
12/11/2018 ∙ by Thalaiyasingam Ajanthan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Since Barron’s seminal article barron1993universal , artificial neural networks have been celebrated as a tool to beat the curse of dimensionality. Barron proved that two-layer neural networks with neurons and suitable non-linear activation can approximate a large (infinite-dimensional) class of functions to within an error of order in

on independently of dimension , while any sequence of linear function spaces with suffers from the curse of dimensionality if the data distribution is truly high-dimensional. More specifically

 sup∥ϕ∥X≤1infψ∈Vm∥ϕ−ψ∥L2(P)≥cdm−1/d

for a universal constant if is the uniform measure on and describes the same function class that is approximated well by neural networks with parameters and denotes its natural norm. Thus from the perspective of approximation theory, neural networks leave linear approximation in the dust in high dimensions.

The perspective of approximation theory only establishes the existence of neural networks which approximate a given target function well in some sense, while in applications, it is important to find optimal (or at least reasonably good) parameter values for the network. The most common approach is to initialize the parameters randomly and optimize them by a gradient-descent based method. We focus on the case where the goal is to approximate a target function in for some Radon probability measure on . To optimize the parameters of two-layer network

 fΘ(x)=m∑i=1aiσ(wTix+bi),

we therefore let evolve by the gradient flow of the risk functional

 R(Θ)=12∫[0,1]d(fΘ(x)−f∗(x))2P(dx).

In practice, we only have access to data sampled from an unknown underlying distribution . The approximation therefore takes place in instead of where is the empirical measure of the data samples. If all data points are sampled independently, the empirical measures converge to the underlying distribution

. In this article, we focus on uniform estimates in the number of data samples and population risk.

While the optimization problem is non-convex, gradient flow-based optimization works astonishingly well in applications. The mechanism behind this is not fully understood. In certain scaling regimes in the number of parameters and the number of data points , the empirical risk has been shown to decay exponentially (with high probability over the initialization), even when the target function values are chosen randomly in a bounded interval du2018gradient ; weinan2019comparative .

Networks which easily fit random data can be trusted to have questionable generalization properties. Even at initialization, network parameters are often chosen too large to retain reasonable control of the path norm, which controls the generalization error. This allows the network to fit any data sample with minimal change in the parameters, behaving much like its linearization around the initial configuration (an infinitely wide random feature model), see weinan2019comparative . This approach explains how very wide two-layer networks behave, but it does not explain why neural networks are more powerful in applications than random feature models.

On the opposite side of the spectrum lies the mean field regime chizat2018global ; mei2018mean ; rotskoff2018neural ; sirignano2018mean . Under mean field scaling two-layer network with neurons and parameters is given as

 fΘ(x)=1mm∑i=1aiσ(wTix+bi)rather than fΘ(x)=m∑i=1aiσ(wTix+bi).

Both concepts of neural network are equivalent from the perspective of approximation theory (static), but behave entirely differently under gradient descent training (dynamics), see e.g. chizat2018note . In the mean field regime, parameters may move a significant distance from their initialization, making use of the adaptive feature choice in neural networks compared to random feature models. This regime thus has greater potential to establish the superiority of artificial neural networks over kernel methods.

Mean field gradient flows do not resemble their linearization at the initial condition. The convergence of gradient descent training to minimizers of the often highly non-convex loss functionals is therefore not obvious (and, for poorly chosen initial values, generally not true). Even if empirical and population risk decay to zero along the gradient flow, population risk may do so at very slow rates in high dimension.

###### Theorem 1.

Let

be a Lipschitz-continuous activation function. Consider population and empirical risk expressed by the functionals

 R(Θ)=12∫[0,1]d(fΘ−f∗)2(x)dx,Rn(Θ)=12nn∑i=1(fΘ−f∗)2(xi)

where is a Lipschitz-continuous target function and the points

are iid samples from the uniform distribution on

. There exists with Lipschitz constant and -norm bounded by such that parameters evolving by the gradient flow of either or itself satisfy for all .

Intuitively, this means that the estimate is almost true. The result holds uniformly in and even for infinitely wide networks. An infinitely wide mean field two-layer network (or Barron function) is a function

where is a suitable Radon probability measure on . Networks of finite width are included in this definition by setting It has been observed (see e.g. (chizat2018global, , Proposition B.1)

) that the vectors

move by the usual gradient flow of if and only if the associated measure evolves by the time-rescaled Wasserstein gradient flow of

 R(π):=12∫[0,1]d(fπ−f∗)2(x)P(dx).

We show the following more general result which implies Theorem 1.

###### Theorem 2.

Let be a Lipschitz-continuous activation function. Consider population and empirical risk expressed by the functionals

 R(π)=12∫[0,1]d(fπ−f∗)2(x)dx,Rn(π)=12nn∑i=1(fπ−f∗)2(xi)

where is a Lipschitz-continuous target function and the points are iid samples from the uniform distribution on . There exists with Lipschitz constant and -norm bounded by such that parameter measures evolving by the -Wasserstein gradient flow of either or satisfy

 limsupt→∞[tγR(πt)]=∞

for all .

Theorem 2 provides a more general perspective than Theorem 1. The Wasserstein gradient flow of is given by the continuity equation

 ˙πt=div(πt∇(δπR))where (δπR)(a,w,b)=∫[0,1]d(fπ−f∗)(x)aσ(wTx+b)P(dx)

is the variational gradient of the risk functional. In particular, any other discretization of this PDE experiences the same curse of dimensionality phenomenon. Besides gradient descent training, this also captures stochastic gradient descent with large batch size and small time steps (to leading order). Viewing machine learning through the lens of classical numerical analysis may illuminate the large data and many parameter regime, see

E:2019aa .

The article is structured as follows. In the remainder of the introduction, we discuss some previous works on related questions. In Section 2, we discuss Wasserstein gradient flows for mean-field two-layer neural networks and review a result from approximation theory. Next, we show in Section 3 that Wasserstein gradient flows for two-layer neural network training may experience a curse of dimensionality phenomenon. The analytical result is backed up by numerical evidence in Section 4. We conclude the article by discussing the significance of our result and related open problems in Section 5. In an appendix, we show that a similar phenomenon can be established when training an infinitely wide random feature model on a single neuron target function.

### 1.1 Previous Work

The study of mean field training for neural networks with a single hidden layer has been initiated independently in several works chizat2018global ; rotskoff2018neural ; sirignano2018mean ; mei2018mean . In chizat2018note , the authors compare mean field and classical training. chizat2018global ; Chizat:2020aa ; arbel2019maximum contain an analysis of whether gradient flows starting at a suitable initial condition converge to their global minimum. This analysis is extended to networks with ReLU activation in relutraining .

In hu2019mean , the authors consider a training algorithm where standard Gaussian noise is added to the parameter gradient of the risk functional. The evolution of network parameters is described by the Wasserstein gradient flow of an energy functional which combines the loss functional and an entropy regularization. In this case, the parameter distribution approaches the stationary measure of a Markov process as time approaches infinity, which is close to a minimizer of the mean field risk functional if noise is small. Note, however, that these results do not describe the small batch stochastic gradient descent algorithm used in practice, for which noise may be assumed to be Gaussian, but with a complicated parameter-dependent covariance structure hu2019diffusion ; li2015dynamics .

Some results in chizat2018global

also apply to deeper structures with more than one hidden layer. However, the imposition of a linear structure implies that each neuron in the outer layer has its own set of parameters for the deeper layers. A mean field training theory for more realistic deep networks has been developed heuristically in

nguyen2019mean and rigorously in araujo2019mean ; nguyen2020rigorous ; sirignano2019mean under the assumption that the parameters in different layers are initialized independently. The distribution of parameters remains a product measure for positive time, so that cross-interactions with infinitely many particles in the following layer (as width approaches infinity) are replaced by ensemble averages. This ‘propagation of chaos’ is the key ingredient of the analysis.

In abbe2018provable , the author takes a different approach to establish limitations of neural network models in machine learning, see also shamir2018distribution ; raz2018fast . Our approach is different in that we allow networks of infinite width and infinite amounts of data.

## 2 Background

### 2.1 Why Wasserstein?

Let us quickly summarize the rationale behind studying Wasserstein gradient flows of risk functionals. This section only serves as rough overview, see Chizat:2020aa for a more thorough introduction to Wasserstein gradient flows for machine learning and ambrosio2008gradient ; santambrogio2015optimal ; villani2008optimal for Wasserstein gradient flows and optimal transport in general.

Consider a general function class whose elements can be represented as normalized sums

 f{θ1,…,θm}(x)=1mm∑i=1ϕ(x,θi)or more generally averagesfπ(x)=∫Θϕ(θ,x)π(dθ).

of functions in a parameterized family . In the case of two-layer networks, and . If the activation function is , then for all . Thus is well-defined if

has finite second moments, i.e.

lies in the Wasserstein space . We consider the risk functional

 R(π)=12∫Rd(fπ−f∗)2(x)P(dx)

for some data distribution on . Note that if is compact and the class has the uniform approximation property on compact sets (by which we mean that the class is dense in for all ). This is the case for two-layer networks with non-polynomial activation functions – see e.g. cybenko1989approximation ; hornik1991approximation for continuous sigmoidal activation functions. The same result holds for ReLU activation since is sigmoidal.

###### Lemma 3.

(chizat2018global, , Proposition B.1) The parameters evolve by the time-accelerated gradient flow

 ddtθi(t)=−m∇θiR(Θt)=−∫(fΘ−f∗)(x)∇θϕ(θi,x)P(dx)

of if and only if their distribution evolves by the Wasserstein gradient flow

 ˙πt=div(πt∇θδRδπ(πt;⋅))whereδRδπ(π;θ)=∫(fπ−f∗)(x)ϕ(θ,x)P(dx).

The continuity equation describing the gradient flow is understood in the sense of distributions. By the equivalence in Lemma 3, all results below apply to networks with finitely many neurons as well as infinitely wide mean field networks. In this article, we do not concern ourselves with existence for the gradient flow equations. More details can be found in chizat2018global for general activation functions with a higher degree of smoothness and in relutraining for ReLU activation.

### 2.2 Growth of Second Moments

Denote the second moment of by A direct calculation establishes that , which implies the following.

###### Lemma 4.

(relutraining, , Lemma 3.3) If evolves by the Wasserstein-gradient flow of , then

###### Remark 5.

If is a priori known to decrease at a specific rate, a stronger result holds. Under the fairly restrictive assumption that

 R(πt)−R(πt+1)≤Ct−(1+α),%theestimate√N(πt)≤{C(1+t1−α2)α>1Clog(t+2)α=1.

holds. In particular, if decays like in the convex case, the most natural decay assumption on the derivative is which corresponds to . Thus, in this case we expect the second moments of to blow up at most logarithmically, which agrees with the results of berlyand2018convergence .

### 2.3 Slow Approximation Results in High Dimension

In this section, we recall a result from high-dimensional approximation theory. An infinitely wide two-layer network is a function

 fπ(x)=∫Rd+2aσ(wTx+b)π(da⊗dw⊗db).

The choice of the parameter distribution for is non-unique since for all measures which are invariant under the coordinate reflection . For ReLU activation, further non-uniqueness stems from the fact that

 0=x+1−x−1=σ(x+1)−σ(−(x+1))−σ(x)+σ(−x)−σ(1).

The path-norm or Barron norm of a function is the norm which measures the amount of distortion done to an input along any path which information takes through the network. Due to the non-uniqueness, it is defined as an infimum

The equality is understood in the -almost everywhere sense for the data distribution . A more thorough introduction can be found in weinan2019lei ; E:2018ab or bach2017breaking , where a special instance of the same space is referred to as . Every ReLU-Barron function is -Lipschitz. In high dimensions, the opposite is far from true.

###### Theorem 6.

(approximationarticle, , Corollary 3.4) Let . There exists such that

 |ϕ(x)−ϕ(y)|≤|x−y|∀ x,y∈[0,1]dandlimsupt→∞[tγinf∥ψ∥B≤t∥ϕ−ψ∥L2([0,1]d)]=∞

for all .

This means that

 inf∥ψ∥B≤tk∥ϕ−ψ∥L2([0,1]d)≥t−γkfor all γ>2d−2

and a sequence of scales , i.e. in high dimension there are Lipschitz functions which are poorly approximated by Barron functions of low norm. The proof of Theorem is built on the observation that Monte-Carlo integration converges uniformly on Lipschitz-functions and Barron functions with very different rates, suggesting a scale separation.

## 3 A Dynamic Curse of Dimensionality

###### Proof of Theorem 2.

The path-norm of a two-layer neural network is

 ∥f∥B =inf{∫|a|[|w|ℓ1+|b|]π(da⊗dw⊗db)∣∣∣π∈P2 s.t. f=fπ} ≤cd∫|a|2+|w|2ℓ2+|b|2¯π(da⊗dw⊗db)=cdN(¯π)

for any such that . The dimensional constant arises as we apply Young’s inequality and invoke the equivalence of the Euclidean norm and the -norm on . The result now follows from Lemma 4 and Theorem 6. ∎

###### Remark 7.

The result can be improved under additional assumptions. Like in Remark 5, we assume that the difference quotients of risk satisfy for . Then grows noticeably slower than linearly. If is such that and decays to zero, we find that , so .

## 4 Numerical Results

For , we consider the associated two-layer network with ReLU activation

 fΘ(x)=1mm∑i=1aiσ(wTix+bi)=1mm∑i=1ai(wTix+bi)+.

As risk functional we choose

 R(Θ)=12\fint[−1,1]d∣∣fΘ(x)−f∗(x)∣∣2dx=2−(d+1)∫[−1,1]d∣∣fΘ(x)−f∗(x)∣∣2dx.

The target function in our simulations is either

 f∗(x)=√32[∥x−a∥ℓ2−∥x+a∥ℓ2],ai=2id−1

as an example of a Barron function (which can be represented with , and distributed uniformly on a sphere of radius , but not with finitely many neurons) or

 f∗(x)=√dπ[max1≤i≤d(xi−ai)−max1≤i≤d(−xi−ai)]

as an example of a Lipschitz continuous, non-Barron target function. In both cases, we have

 ∫[−1,1]df∗(x)dx=0,∥f∗∥L2([0,1]d)≈1,[f∗]Lip≤√6d.

For a proof that is not a Barron function, see barron_new . In the first case, also the Barron norm of scales as . The offset from the origin is used to avoid spurious effects since the initial parameter distribution is symmetric around the origin. In simulations, we considered moderately wide networks with neurons. The parameters were initialized iid according to Gaussians with expectation

and variance

for , for , and as constants . They were optimized by (non-stochastic) gradient descent for an empirical risk functional

 Rn(Θ)=12n∑j=1(fΘ(xj)−f∗(xj))2

with independent samples . Population risk was approximated by an empirical risk functional evaluated on independent samples. On the data samples, the mean and variance of the target functions were estimated in the range and respectively for all simulations.

In Figure 1 we see that both empirical and population risk decay very similarly for Barron target functions in any dimension, while the decay of risk becomes significantly slower in high dimension for target functions which are not Barron. The empirical decay rate

 γ(t):=−log(R(Θt))logt(% which satisfies R(Θt)=t−γ(t))

becomes smaller for fixed positive time and non-Barron target functions as , see Figure 3.

Training appears to proceed in two regimes for Barron target functions: An initial phase in which both the Barron norm and risk change rapidly, and a longer phase in which the risk decays gradually and the Barron norm remains roughly constant. In the initial ‘radial’ phase, the vector is subject to a strong radial force driving the parameters towards the origin or away from the origin exponentially fast. Since ReLU is positively -homogeneous, we observe that

 (ai,wi,bi)∥(ai,wi,bi)∥⋅∇(ai,wi,bi)R(Θ) =\fint[−1,1]d(fΘ−f∗(x))(ai,wi,bi)∥(ai,wi,bi)∥⋅∇(ai,wi,bi)fΘ(x)dx =1m\fint[−1,1]d(fΘ−f∗(x))aσ(wTix+bi)∥(a,w,b)∥dx

with a positively one-homogeneous right hand side. Thus while is close to its initialization ( due to symmetry in ), the vector moves towards the origin/away from the origin at an exponential rate, depending on the alignment of with . The exponential growth ceases as becomes sufficiently close to (in the -weak topology).

After the initial strengthening of neurons which are generally aligned with the target function, we reach a more stable state. In the following ‘angular’ phase, the Barron norm remains constant and directional adjustments to parameters dominate over radial adjustments. Using Figures 1 and 2, we can easily spot the transition between the two training regimes at time .

The gap between empirical risk and population risk increases in high dimensions. When training the same networks on the same problems for empirical risk with only 4,000 data points, the results are very similar in dimension 30, but the empirical risk decays very quickly in dimension 250 while the population risk increases rather than decrease when considering a non-Barron target function. This is to be expected since a) the Wasserstein distance between Lebesgue measure and empirical measure increases and b) the number of trainable parameters increases with , making it easier to fit point values. The risk decays approximately like for Barron target functions with (faster in higher dimension). This is faster than expected for generic convex target functions.

## 5 Discussion

In this article, we have shown that in the mean field regime, training a two-layer neural network on empirical or population risk may not decrease population risk faster than for when the data distribution is truly -dimensional, we consider -loss, and the target function is merely Lipschitz-continuous, but not in Barron space. The key ingredient of the result is the slow growth of path norms during gradient flow training and the observation that a Lipschitz function exists which is badly approximable in high dimension.

It is straight-forward to extend the main result to general least-squares minimization. All statements remain true if instead of ‘risk decays to zero’ we substitute ‘risk decays to minimum Bayes risk’.

### 5.1 Interpretation

The curse of dimensionality phenomenon occurs when the target function is not in Barron space, i.e. a minimizer does not exist. In this situation, even gradient flows of smooth convex functions in one dimension may be slow. The gradient flow ODE

 {˙xα(t)=−F′α(xα(t))t>0xα(t)=1t=0ofFα:(0,∞)→R,Fα(x)=x−α

is solved by The energy decays as If , the energy decay is extremely slow. Thus, it should be expected that curse of dimensionality phenomena can occur whenever the risk functional does not have a minimizer in the function space associated with the neural network model under consideration. The numerical evidence of Section 4 suggests that the slow decay phenomenon is visible also in empirical risk if the training sample is large enough (depending on the dimension).

### 5.2 Implications for Machine Learning Theory

Understanding function spaces associated to neural network architectures is of great practical importance. When a minimization problem does not admit a solution in a given function space, gradient descent training may be very slow in high dimension. Unlike the theory of function spaces typically used in low-dimensional problems of elasticity theory, fluid mechanics etc, no comprehensive theory of Banach spaces of neural networks is available except for very special cases E:2019aa ; weinan2019lei . In the light of our result, a convergence proof for mean field gradient descent training of two-layer neural networks must satisfy one of two criteria: It must assume the existence of a minimizer, or it must allow for slow convergence rates in high dimension.

## Appendix A Random Features and Shallow Neural Networks

Lemma 4 applies to general models with an underlying linear structure, in particular random feature models. Both two-layer neural networks and random feature models have the form but in random feature models, is fixed at the (random) initialization. An infinitely wide random feature model is described by

 f(x)=∫Rd+1a(w,b)σ(wTx+b)π0(dw⊗db)

where is a fixed distribution (usually spherical or standard Gaussian) while an infinitely wide two-layer neural network is described by

(approximationarticle, , Example 4.3) establishes a Kolmogorov-width type separation between random feature models and two-layer neural networks of similar form as the separation between two-layer neural networks and Lipschitz functions. Thus a curse of dimensionality also affects the training of infinitely wide random feature models when the target function is a generic Barron function. If is a smooth omni-directional distribution and is a single neuron activation, then must concentrate a large amount of mass, forcing to blow up. In higher dimension, the blow-up is more pronounced since small balls on the sphere around have faster decaying volume.

We train a two-layer neural network and a random feature model with gradient descent to approximate the single neuron activation in . Both models have width . Empirical risk is calculated using independent data samples and population risk is approximated using

data samples. Both networks are intialized according to a Gaussian distribution as above.