# On the convergence of gradient descent for two layer neural networks

It has been shown that gradient descent can yield the zero training loss in the over-parametrized regime (the width of the neural networks is much larger than the number of data points). In this work, combining the ideas of some existing works, we investigate the gradient descent method for training two-layer neural networks for approximating some target continuous functions. By making use the generic chaining technique from probability theory, we show that gradient descent can yield an exponential convergence rate, while the width of the neural networks needed is independent of the size of the training data. The result also implies some strong approximation ability of the two-layer neural networks without curse of dimensionality.

## Authors

• 179 publications
05/30/2019

### Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

We study the training and generalization of deep neural networks (DNNs) ...
05/23/2019

### Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems

Recently, several studies have proven the global convergence and general...
07/30/2020

### On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics

We develop Banach spaces for ReLU neural networks of finite depth L and ...
07/14/2020

### Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The ability of neural networks to provide `best in class' approximation ...
12/04/2020

### When does gradient descent with logistic loss find interpolating two-layer networks?

We study the training of finite-width two-layer smoothed ReLU networks f...
10/22/2020

### Beyond Lazy Training for Over-parameterized Tensor Decomposition

Over-parametrization is an important technique in training neural networ...
08/25/2020

### Stochastic Markov Gradient Descent and Training Low-Bit Neural Networks

The massive size of modern neural networks has motivated substantial rec...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The universal approximation theorem tells us that a two-layer neural networks can approximate a broad class of functions, provided that the width is sufficiently large, i.e. in the over-parametrized regime [1, 2]. Recently, it has been shown that gradient descent can find the optimal parameters for the wide two layer neural networks using optimal transport theory [3, 4, 5]

, and using direct estimation on particular neural network structure

[6, 7, 8]

. All these results are in the over-parametrized regime. We are interested to see whether the loss function can converge to zero in the under-parametrized regime, i.e., the number of data points are larger than the width of the two layer neural networks.

Following [4, 6], we consider the loss function given by the quadratic function

 ℓ(x,w)=12∫D|f(n)(x,w)−f(x)|2dν(x), (1.1)

where is a compact set and , is the dimension, and is some probability measure on . We aim to show that under suitable setup, with high probability, the convergence rate is exponentially fast, and we desire the width of the neural networks is independent of the training data, by using suitable approximation . Note that this does not contradict with the result of [9], which shows that the under-parametrized networks cannot approximate the function well. The reason is that the loss function here only measures the approximation at the given data points, instead of approximation for the whole function.

Here, we mention the two main works that motivate this work. In [4], the following form of approximations were considered inspired by weak approximation of measures

 ~f(n)(x,t):=1nn∑i=1ai(t)φ(x,wi(t)), (1.2)

where represents any unit function. For example, it could be the Gaussian, the one-layer neural networks or deep neural networks. They were able study the dynamics as the interacting particle systems, so that the decaying of the loss function can be understood as gradient flows in Wasserstein-2 space using optimal transport theory. However, the decaying can only be shown in the limiting regime where and the decaying rate is unknown. The mechanism is that the parameters will converge to some minimization regions.

In the work [6] by S. Du et al., the two layer of neural networks were studied, i.e.,

 φ(x,w)=σ(w⋅x), w∈Rd.

Here,

is the so-called activation function. In their work, the following form of apprxomation is used:

 f(n)(x,t):=1√nn∑i=1ai(t)φ(x,wi(t))=1√nn∑i=1ai(t)σ(wi(t)⋅x). (1.3)

Though this seems to be a little change, the dynamics has been changed significantly compared with (1.2). They are actually able to show that the weights ’s will not converge under the gradient descent method: they are close to the initialization. They claim that the loss function converges to zero exponentially in the overparamterized regime. They obtain this result by considering the dynamics of the prediction . In [6], the convergence largely is due to their Assumption 3.1 and Theorem 3.1. The analysis for their

matrix needs to exclude the cases that there are two vectors being nearly parallel, which is not desired in the

regime, where is the number of training data. We note that if we rely on the dynamics of , the matrix can be positive definite simply due to the positivity of , and this is the strategy we adopt here.

Below, we aim to combine and improve the ideas from these two works to show a convergence rate of the gradient descent independent of the sample size . We use the approximation from [6], i.e. (1.3), to study the dynamics of the training loss directly. We makes use the key observation from [6] that the weights are close to the initial values, but instead rely on the dynamics on ’s to obtain the positive definiteness of the Gram matrix. Another important technique is the generic chaining [10] from probability theory that kind of guarantees the the Gram matrix is close to some positive definite matrix uniformly data . In section 2, we set up the mathematical formulation and state the main results. The proof is then performed in section 3. Lastly, we make some discussion in section 4.

## 2 Mathematical setup and the main result

We focus on the following special setup, and consider two layer neural networks as in (1.3). The generalization to other cases will be done soon in subsequent works.

About the data set and measure for , we assume the following:

###### Assumption 1.

The domain is a compact set such that . The measure is a probability measure supported on a countable subset of .

Below, we will denote the support of by :

 T:=suppD. (2.1)

One common example of is the empirical measure

 ν:=1MM∑p=1δ(⋅−xp), (2.2)

where ’s are the training data sampled i.i.d from some underlying distribution. Of course, in our discussion in this work, ’s are fixed.

Moreover, we assume the initial conditions as follows.

###### Assumption 2.

For the weight , we assume

 wi(0)∼N(0,Id), i.i.d.,

and ’s are i.i.d such that .

The condition is important because it guarantees that the approximation (1.3) is with high probability at to close up the estimation later (Lemma 4). The distribution for

’s are assumed to be Gaussians only for technical convenience. The argument can be extended to other distributions with finite second moments.

Moreover, we assume

###### Assumption 3.

The activation function satisfies , and . Moreover, , and thus

 ∥σ∥∞+∥σ′∥∞<∞.

The sigmoid functions should satisfy the condition. This condition may be too strong, however, we impose this only for simplicity. The generalization to other functions should be doable.

Under the gradient descent with loss function (1.1), the parameters ’s satisfy

 ˙wi(t)=−∇wiℓ=−1√n∫D(f(n)(x)−f(x))ai(t)σ′(wi⋅x)xdν(x), (2.3)

where means the time derivative. Similarly, the parameters ’s satisfy

 ˙ai(t)=−∂ℓ∂ai=−1√n∫D(f(n)(x)−f(x))σ(wi⋅x)dν(x). (2.4)

Instead of studying the prediction as in [6], we study the dynamics of the loss function directly as motivated by [4],

 (2.5)

The matrix

 M(x,y):=1nn∑i=1(σ(wi(t)⋅x)σ(wi(t)⋅y)+a2i(t)σ′(wi(t)⋅x)σ′(wi(t)⋅y)x⋅y) (2.6)

is called the Gram matrix in the terminology from [6]. We aim to use the positive definiteness of this matrix to show the exponential convergence. The second part is positive semi-definite (in fact in [6], it was argued that this part is also positive definite when the data points are not parallel), so we can control easily as

 ddtℓ≤−1nn∑i=1∬D×D(f(n)(x)−f(x))(f(n)(y)−f(y))σ(wi(t)⋅x)σ(wi(t)⋅y)dν(x)dν(y). (2.7)

One key observation from [6] is that the weights do not change much during the dynamics. The choice (1.3) ensures the form

, so that one has kind of law of large number convergence here. Moreover, we note that

are i.i.d D Gaussian variables regardless the dimension of itself. Using the generic chaining approach [10], this allows us obtain the convergence rate of the loss function independent of and when is large enough ( without exponential dependence in ).

###### Theorem 1.

Under Assumptions 1-3, for any small, when , with probability , it holds that

 ℓ(t)≤Clog(1δ)exp(−λ2t), (2.8)

where and are independent of , and .

We remark that as does not mean approximates well. It only approximates the function values on the discrete points . However, since the convergence is independent of , this result somehow says that any continuous function can be approximated by the two-layer neural networks with high probability.

In fact, it seems that the following approximation ability of two layer neural networks holds.

###### Corollary 1.

Suppose satisfying Assumptions 1 is the union of closures of some domains (open connected sets) so that the Lebesgue measure can be defined. Then, the two layer neural networks satisfying Assumptions 3 with width are able to approximate any functions , , under the norm.

Here, means the norm with Lebesgue measure,. As commented in section 4, it seems that such good approximations are available ”everywhere” in the parameter space.

## 3 The proof of the main result

Below, we use to indicate a constant independent of , whose concrete meaning can change from line to line.

###### Lemma 1.

The derivatives of the parameters satisfy

 12ddtn∑i=1a2i=−∫D(f(n)(x)−f(x))f(n)(x)dν(x)≤∥f∥∞∫D|f(n)(x)−f(x)|dν(x)≤∥f∥∞√ℓ(t), (3.1)

and

 |˙wi|≤C|ai(t)|√n√ℓ(t), (3.2)

where is independent of and .

The proof is straightforward and we omit. Let us just remark that if we consider, for example, . Using (2.3), by Hölder’s inequality, it is straightforward to obtain

 |˙wi(t)|≤1√n∣∣∣∫D(f(n)(x)−f(x))ai(t)σ′(wi⋅x)xdν(x)∣∣∣≤|ai(t)|√n(∫D(f(n)(x)−f(x))2dν(x))1/2(∫D∥σ′∥2∞dν(x))1/2≤C|ai(t)|√n√ℓ(t).

Using Lemma 1, we are able to establish:

 ddtℓ(t)≤−1nn∑i=1∬D×D(f(n)(x)−f(x))(f(n)(y)−f(y))×σ(wi(0)⋅x)σ(wi(0)⋅y)dν(x)dν(y)+R(t)=:−Dn(t)+R(t), (3.3)

where we have introduced

 Dn(t):=1nn∑i=1∬D×D(f(n)(x)−f(x))(f(n)(y)−f(y))σ(wi(0)⋅x)σ(wi(0)⋅y)dν(x)dν(y), (3.4)

and

 R(t):=Cnn∑i=1∬D×D|f(n)(x)−f(x)||f(n)(y)−f(y)|×∫t0|ai(s)|√n√ℓ(s)dsdν(x)dν(y). (3.5)

We now estimate these two terms separately.

###### Lemma 2.

The term satisfies

 R(t)≤C⎛⎝1√n(∑ni=1ai(0)2n)1/2b(t)+√∥f∥∞nb(t)3/2⎞⎠ℓ(t),

where

 b(t):=∫t0√ℓ(s)ds. (3.6)
###### Proof.

First of all, one finds

 1nn∑i=1|ai(s)|√n≤1n(n∑i=1a2i(s))1/2.

Using (3.1), one has

 n∑i=1a2i(s)≤n∑i=1a2i(0)+2∥f∥∞∫s0√ℓ(τ)dτ.

Hence, due to for , one has

 1nn∑i=1|ai(s)|√n≤1n(n∑i=1a2i(0))1/2+√2∥f∥∞n(∫s0√ℓ(τ)dτ)1/2.

Consequently, one has

 1nn∑i=1∫t0|ai(s)|√n√ℓ(s)ds≤1n(n∑i=1a2i(0))1/2∫t0√ℓ(s)ds+2√23√∥f∥∞n(∫t0√ℓ(s)ds)3/2, (3.7)

where we have used the fact that

 (∫s0√ℓ(τ)dτ)1/2√ℓ(s)=23dds(∫s0√ℓ(τ)dτ)3/2.

The claim then follows. ∎

We are now in a position to take in . In particular, we need to estimate

 Dn(x,y):=1nn∑i=1σ(wi(0)⋅x)σ(wi(0)⋅y).

For fixed , it is clear that this converges to

 D(x,y):=Ew∼N(0,Id)σ(w⋅x)σ(w⋅y),

and the rate is independent of . However, to obtain the uniform convergence rate for is challenging. The usual technique is the union bound, which gives a quantity involving , and this further requires the over-parametrized regime. One may also use the convergence of empirical measures in the so called Wasserstein spaces [11], which yields a quantity independent of the training data, but suffers from curse of dimensionality. In our problem here, we are essentially considering the projections of high dimensional Gaussians. Hence, we expect the convergence rate will not suffer from curse of dimensionality. We find that the generic chaining approach in [10, 12] useful.

We introduce the following result essentially derived from [12, Proposition 2.2] and the proof of [12, Theorem 1.1]. The proof is then deferred to Appendix A using generic chaining approach.

###### Proposition 1.

Consider

 {Zx:=1nn∑i=1Yix}x∈D.

Suppose that for any given , ’s are independent, and for any , and are independent for . Moreover, ’s have the same distribution, satisfying the following.

1. They are uniformly bounded, have mean zero, and finite variance.

2. there is a constant such that

 E|Yix−Yiy|2≤C1|x−y|2, ∀i,x,y,

then there is a parameter only depending on with magnitude , such that for any small and that is countable, when , with probability ,

 supx∈T|Zx|≤Clog(1δ)γ2√n, (3.8)

where is independent of .

We now move to the estimates of . We have

###### Lemma 3.

For any small , when , with probability , it holds that

 supx,y∈T|Dn(x,y)−D(x,y)|≤Clog(1δ)γ2√n. (3.9)
###### Proof.

We only need to verify the conditions in Proposition 1. Here, the domain we work on is while is the countable subset we consider. The distance is defined by .

Clearly, we need to identity

 Yix,y=σ(wi(0)⋅x)σ(wi(0)⋅y)−Ew∼N(0,Id)σ(w⋅x)σ(w⋅y).

Then,

 |Yix,y−Yix1,y1|2≤C(|wi(0)⋅(x−x1)|2+|wi(0)⋅(y−y1)|2).

Taking expectation, one finds . The claim follows.

Note that in this case, the dimension is , but clearly, . ∎

Note that the boundedness of is also used here for the Bernstein inequality in Appendix A. Consequently, when (3.9) holds, one has

 |Dn(t)−D(t)|≤Clog(1δ)γ2√nℓ(t). (3.10)

Now, we move onto the initial values of . If we identify

 Yix:=ai(0)σ(wi(0)⋅x)

and use Proposition 1, we can find that when , with probability , it holds that

 supx∈T∣∣1√nn∑i=1ai(0)σ(wi(0)⋅x)∣∣≤Clog(1δ)γ2. (3.11)

Here, depends on . However, according to [8, Lemma 1], one has the following.

###### Lemma 4.

For any small , with probability , it holds that

 supx∈T∣∣1√nn∑i=1ai(0)σ(wi(0)⋅x)∣∣≤Clog(1δ). (3.12)

The proof is done via the Rademacher complexity, and we refer the readers to [8].

Now, we estimate the Gram matrix.

###### Lemma 5.

There exists independent of such that

 Ew∼N(0,Id)σ(w⋅x)σ(w⋅y)≥λ, ∀x,y∈D. (3.13)
###### Proof.

In fact, , which is a 2D Gaussian with covariant matrix

 Σ=(|x|2x⋅yx⋅y|y|2).

Consider the interval where . Then, contains for some .

Now, we should estimate . The expectation can be rewritten as

where and are independent standard normal variables, and

 r=x⋅y|x||y|.

Note that is bounded, hence, we only need and where is a number chosen such that and are in for all . Clearly can be made universal independent of . The probability hence has a lower bound independent of . Then, setting will suffice. ∎

We are now able to prove the main result.

###### Proof of Theorem 1.

By the previous lemmas, for any small, when , we may pick an event with probability , such that outside this, we have , , and also .

Then, on

 dℓ(t)dt≤−λℓ(t)+Clog(1δ)γ2ℓ(t)√n+C(b(t)√n+√∥f∥∞nb(t)3/2)ℓ(t).

Define

 Tb:=inf{t≥0:λ−Clog(1δ)γ2√n−C√nb(s)−C√∥f∥∞nb(s)3/2>λ2, ∀0≤s≤t}.

We claim that if for some , then

We first of all note that is nonincreasing. We now consider a fixed . Then, all the quantities below are essentially discussed for such , and we omit this dependence for convenience.

In fact, since , it is then clear that if , . Then, for , then

 dℓ(t)dt≤−12λℓ(t).

Then,

 ℓ(t)≤ℓ(0)exp(−λ2t)≤Clog(1δ)exp(−λ2t).

Hence, for all

 b(t)=∫t0√ℓ(s)ds<4λ√Clog(1δ).

Hence, if is chosen such that

 Clog(1/δ)γ2√n+4C√Cλ√log(1δ)√n+C7/4√∥f∥∞n(4λ)3/2(log(1δ))3/4<λ2,

also holds, then we must have . In other words, for some ,

 n≥C1(γ22+√∥f∥∞)(logδ)2

will imply .

Otherwise, suppose . Then,

still holds. Then, by continuity, there exists such that on this still holds. This then contradicts with the definition of .

Now that , holds for all . Lastly, we note , the claim then follows. ∎

We now move to the corollary regarding approximation.

###### Proof of Corollary 1.

We fix . Clearly, one can use a function that is Lipschitz to approximate with small error under norm. Hence, we assume that is Lipschitz.

We first of all have the following estimates:

 ∫D|f(x)−f(n)(x)|2dx≤∣∣∣∫D|f(x)−f(n)(x)|2dx−∫D|f(x)−f(n)(x)|2dνM∣∣∣+∫D|f(x)−f(n)(x)|2dνM, (3.14)

where

 νM=1MM∑p=1δ(x−xp)

is the empirical measure for data .

By the standard result of convergence empirical measures [11], there exists such with such that,

 W1(νM,dx)<ϵ,

where is the Wasserstein distance.

Then, we set . Then, we take the event such that the claims in Theorem 1 hold for , and such that . Hence, we are able to find the parameters for large enough such that

 ∫D|f(x)−f(n)(x,t)|2dνM≤ϵ.

The important consequence of Theorem 1 is that is independent of (and thus ).

Now, we fix such and estimate the Lipschitz constant of . The gradient of this function is

 −2(f(x)−f(n)(x,t))(∇f+1√nn∑i=1ai(t)σ′(wi(t)⋅x)wi(t)).

Hence, we need to estimate and .

By Lemma 3.1,

 n∑i=1a2i(t)≤n∑i=1a2i(0)+C=n+C.

Similarly,

 |wi(t)|≤|wi(0)|+C√n∫t0|ai(s)|√ℓ(s)ds.

Hence,

 n∑i=1|wi|≤√n ⎷n∑i=1|w2i(0)|+C∫t0 ⎷n∑i=1a2i(s)√ℓ(s)ds≤C√n.

Hence, and . In other words, the Lipschitz constant of is controlled by a polynomial of and the Lipschitz constant of (but independent of ), since is independent of .

The whole error is thus given by . As is arbitrary, the claim follows. ∎

## 4 Conclusion and discussion

Current works show the convergence of loss function in the overparametrized regime for neural networks. By considering the loss function directly, we obtain the decay of the loss function in a rate independent of the number of training data, where the approximation is given specifically by (1.3). The key observation was made by S. Du et al that the weights are close to the initialization. In this sense, the loss function has global minimizations “everywhere” in the parameter space. As we have seen, the proof works for large number of data , and as Corollary 1

indicates, the parameters that can achieve the good approximation seems to be available ”everywhere” in the parameter space, which we strongly believe leads to the good performance of stochastic gradient descent. If we instead use the approximation of the form (

1.2) inspired by mean field limit, then one must wait the weights to converge to some specific regions to gain global minimization. This probably means the approximation (1.3) has larger capacity and may be used to explain why SGD behaves well.

The loss function used here is quadratic functions. However, we argue that quadratic functions are general enough since near the minima, all functions look like quadratic functions.

There are many instant subsequent works, for example, one may explore whether the ideas and techniques here can be applied to improve the the generalization error estimates in literature, and to get better explanation of why stochastic gradient descent works well. These should be challenging but exciting problems.

## Acknowledgement

The work of L. Li was partially sponsored by NSFC 11901389 and 11971314, and Shanghai Sailing Program 19YF1421300.

## Appendix A Proof of Proposition 1

The proof divides into several steps.

Step 1. Basic concentration inequalities.

Define

 φ(v):=√v+1 (A.1)

and define

 Wx:=φ(|Zx|). (A.2)

Hence, using , one has

 |Wx−Wy|≤√||Zx|−|Zy||≤√|Zx−Zy|.

Note that

 Zx−Zy=1nn∑i=1(Yix−Yiy),

where ’s are still independent with mean zero. By Bernstein inequality [13], one has for any

 P(|Zx−Zy|≥t|x−y|)≤2exp(−nt2|x−y|2/σ22(1+Lt|x−y|/σ))≤2exp(−C1nmin(t2,t)), (A.3)

where .

Consequently,

 P(|Wx−Wy|≥t√|x−y|)≤P(|Zx−Zy|≥t2|x−y|)≤2exp(−C1nmin(t4,t2))

Hence, one defines the metric (note that the maximum of two metrics is still a metric)

 g(x,y)=max(|x−y|,|x−y|1/2), (A.4)

then one has

 P(|Zx−Zy|≥tg(x,y))≤2exp(−C1nmin(t2,t)), (A.5)

while for ,

 P(|Wx−Wy|≥tg(x,y))≤2exp(−C1nt2) (A.6)

Step 2 Generic chaining

Now consider the domain equipped with the metric in (A.4). Define a number

 γα:=infsupx∈D∞∑s=02s/αg(x,Ts), (A.7)

where is a set containing points. The is taken over the collections of such sets . From to , the number of points in is squared. Hence, the distance decays double exponentially if the sequence is chosen suitably. Hence, is well-defined. For a given , for every , there must be , such that

 g(x,Ts)=g(x,y),

and we define

 πs(x):=y. (A.8)

Let be the usual Euclidean distance, and we find that on the suitable sequence of subsets, . Hence, the number is comparable to

 ~γα:=infsupx∈D∞∑s=02s/αge(x,Ts).

Below, we focus only on . The number of for the unit ball is like as shown in [10, sec. 2.2], where is the dimension.

By the definition of , we are able to choose the collection of such that

 ∞∑i=12s/2g(πs(x),πs+1(x))≤2γ2,

which holds for all .

Step 3 Controlling .

Let be such that . Consider any . By step 1, for any fixed , one has

 P⎛⎝|Wx−Wy|≥2C1√log(|Ts||Ts+1|)ng(x,y)⎞⎠≤2exp(−2log(|Ts||Ts+1|))=2|Ts|2|Ts+1|2≤1|Ts|2|Ts+1