# A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space. Our analysis suggests strongly that in terms of `implicit regularization', two-layer neural network models do not outperform the kernel method.

## Authors

• 55 publications
• 57 publications
• 38 publications
• ### Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

The behavior of the gradient descent (GD) algorithm is analyzed for a de...
04/10/2019 ∙ by Weinan E, et al. ∙ 15

• ### The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models

A numerical and phenomenological study of the gradient descent (GD) algo...
06/25/2020 ∙ by Chao Ma, et al. ∙ 6

• ### Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent work has revealed that overparameterized networks trained by grad...
09/26/2019 ∙ by Ziwei Ji, et al. ∙ 0

• ### Compressing invariant manifolds in neural nets

We study how neural networks compress uninformative input space in model...
07/22/2020 ∙ by Jonas Paccolat, et al. ∙ 0

• ### Kernel and Deep Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "k...
06/13/2019 ∙ by Blake Woodworth, et al. ∙ 0

• ### Persistency of Excitation for Robustness of Neural Networks

When an online learning algorithm is used to estimate the unknown parame...
11/04/2019 ∙ by Kamil Nar, et al. ∙ 12

• ### The Slow Deterioration of the Generalization Error of the Random Feature Model

The random feature model exhibits a kind of resonance behavior when the ...
08/13/2020 ∙ by Chao Ma, et al. ∙ 45

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Optimization and generalization are two central issues in the theoretical analysis of machine learning models. These issues are of special interest for modern neural network models, not only because of their practical success

[18, 19], but also because of the fact that these neural network models are often heavily over-parametrized and traditional machine learning theory does not seem to work directly [21, 30]. For this reason, there has been a lot of recent theoretical work centered on these issues [15, 16, 12, 11, 2, 8, 10, 31, 29, 28, 25, 27]. One issue of particular interest is whether the gradient descent (GD) algorithm can produce models that optimize the empirical risk and at the same time generalize well for the population risk. In the case of over-parametrized two-layer neural network models, which will be the focus of this paper, it is generally understood that as a result of the non-degeneracy of the associated Gram matrix [29, 12], optimization can be accomplished using the gradient descent algorithm regardless of the quality of the labels, in spite of the fact that the empirical risk function is non-convex. In this regard, one can say that over-parametrization facilitates optimization.

The situation with generalization is a different story. There has been a lot of interest on the so-called “implicit regularization” effect [21], i.e. by tuning the parameters in the optimization algorithms, one might be able to guide the algorithm to move towards network models that generalize well, without the need to add any explicit regularization terms (see below for a review of the existing literature). But despite these efforts, it is fair to say that the general picture has yet to emerge.

In this paper, we perform a rather thorough analysis of the gradient descent algorithm for training two-layer neural network models. We study the case in which the parameters in both the input and output layers are updated – the case found in practice. In the heavily over-parametrized regime, for general initializations, we prove that the results of [12] still hold, namely, the gradient descent dynamics still converges to a global minimum exponentially fast, regardless of the quality of the labels. However, we also prove that the functions obtained are uniformly close to the ones found in an associated kernel method, with the kernel defined by the initialization. In the second part of the paper, we study the more general situation when the assumption of over-parametrization is relaxed. We provide sharp estimates for both the empirical and population risks. In particular, we prove that for target functions in the appropriate reproducing kernel Hilbert space (RKHS) [3], the generalization error can be made small if certain early stopping strategy is adopted for the gradient descent algorithm.

Our results imply that in the absence of explicit regularization over-parametrized two-layer neural networks are a lot like the kernel methods: They can always fit any set of random labels, but in order to generalize, the target functions have to be in the right RKHS. In light of the optimal generalization error bounds proved in [13] for regularized models, one is tempted to conclude that explicit regularization is necessary for two-layer neural network models to fully realize their potential in expressing complex functional relationships.

### 1.1 Related work

The seminal work of [30] presented both numerical and theoretical evidence that over-parametrized neural networks can fit random labels. Building upon earlier work on the non-degeneracy of some Gram matrices [29], Du et al. went a step further by proving that the GD algorithm can find global minima of the empirical risk for sufficiently over-parametrized two-layer neural networks [12]. This result was extended to multi-layer networks in [11, 2]. The related result for infinitely wide neural networks was obtained in [14]. The similar result for a general setting also appears in [9].

The issue of generalization is less clear. [10]

established generalization error bounds for solutions produced by the online stochastic gradient descent (SGD) algorithm with early stopping when the target function is in a certain RKHS. Similar results were proved in

[20] for the classification problem, and in [8] for offline SGD algorithms. In [1], generalization results were proved for the GD algorithm for target functions that can be represented by the underlying neural network models. More recently in [4]

, a generalization bound was derived for GD solutions using a data-dependent norm. This norm is bounded if the target function belongs to the appropriate RKHS. However, their error bounds are not strong enough to rule out the possibility of curse of dimensionality. Indeed the results of the present paper do suggest that curse of dimensionality does occur in their setting (see Theorem

3.4).

## 2 Preliminaries

Throughout this paper, we will use the following notation , if is a positive integer. We use and to denote the and Frobenius norms for matrices, respectively. We let , and use

to denote the uniform distribution over

. We use to indicate that there exists an absolute constant such that , and is similarly defined. If is a function defined on and

is a probability distribution on

, we let .

### 2.1 Problem setup

We focus on the regression problem with a training data set given by , i.i.d. samples drawn from a distribution , which is assumed fixed but only known through the samples. In this paper, we assume and . We are interested in fitting the data by a two-layer neural network:

 fm(x;Θ)=aTσ(Bx), (1)

where and denote all the parameters. Here

is the nonlinear activation function. We will omit the subscript

in the notation for if there is no danger of confusion. In formula (1), we omit the bias term for notational simplicity. The effect of the bias term can be incorporated if we think of as .

The ultimate goal is to minimize the population risk defined by

 R(Θ)=12Ex,y[(f(x;Θ)−y)2].

But in practice, we can only work with the following empirical risk

 ^Rn(Θ)=12nn∑i=1(f(xi;Θ)−yi)2.

We are interested in analyzing the property of the following gradient descent algorithm: where is the learning rate. For simplicity, we will focus on its continuous version, the gradient descent (GD) dynamics:

 dΘtdt=−∇^Rn(Θt). (2)
##### Initialization

. We assume that

are i.i.d. random variables drawn from

, and are i.i.d. random variables drawn from the distribution defined by . Here controls the magnitude of the initialization, and it may depend on , e.g. or . Other initialization schemes can also be considered (e.g. distributions other than , other ways of initializing ). The needed argument does not change much from the ones for this special case.

### 2.2 Assumption on the input data

With the activation function and the distribution , we can define two positive definite (PD) functions 111We say that a continuous symmetric function is positive definite if and only if for any , the kernel matrix with is positive definite.

 k(a)(x,x′) def=Eb∼π0[σ(bTx)σ(bTx′)], k(b)(x,x′) def=Eb∼π0[σ′(bTx)σ′(bTx′)⟨x,x′⟩].

For a fixed training sample, the corresponding normalized kernel matrices are defined by

 K(a)i,j =1nk(a)(xi,xj) (3) K(b)i,j =1nk(a)(xi,xj).

Throughout this paper, we make the following assumption on the training set.

###### Assumption 1.

For the given training set

, we assume that the smallest eigenvalues of the two kernel matrices defined above are both positive, i.e.

 λ(a)ndef=λmin(Ka)>0,λ(b)ndef=λmin(K(b))>0.

Let .

###### Remark 1.

Note that . In general, depend on the data set. For any PD functions , the Hilbert-Schmidt integral operator is defined by

 Tsf(x)=∫Sd−1s(x,x′)f(x′)dπ0(x′).

Let denote its -th largest eigenvalue. If are independently drawn from , it was proved in [6] that with high probability and . Using the similar idea, [29] provided lower bounds for based on some geometric discrepancy, which quantifies the uniformity degree of . In this paper, we leave as our basic assumption.

### 2.3 The random feature model

We introduce the following random feature model [22] as a reference for the two-layer neural network model

 fm(x;~a,B0)def=~aTσ(B0x), (4)

where . Here is fixed at the corresponding initial values for the neural network model, and is not part of the parameters to be trained. The corresponding gradient descent dynamics is given by

 d~atdt=−1nn∑i=1(~aTtσ(B0xi)−yi)σ(B0xi). (5)

This dynamics is relatively simple since it is linear.

## 3 Analysis of the over-parameterized case

In this section, we consider the optimization and generalization properties of the GD dynamics in the over-parametrized regime. We introduce two Gram matrices , defined by

 G(a)i,j(Θ) =1nmm∑k=1σ(bTkxi)σ(bTkxj), G(b)i,j(Θ) =1nmm∑k=1a2kxTixjσ′(bTkxi)σ′(bTkxj).

Let and , it is easy to see that

 ∥∇Θ^Rn∥2=mneTGe. (6)

Since , we have

 2mλmin(G)^Rn≤∥∇Θ^Rn∥2≤2mλmax(G)^Rn.

### 3.1 Properties of the initialization

###### Lemma 1.

For any fixed , with probability at least over the random initialization, we have

 ^Rn(Θ0)≤12(1+c(δ)√mβ)2,

where .

The proof of this lemma can be found in Appendix C.

In addition, at the initialization, the Gram matrices satisfy

 G(a)(Θ0)→K(a),G(b)(Θ0)→β2K(b)asm→∞.

In fact, we have

###### Lemma 2.

For , if , we have, with probability at least over the random choice of

 λmin(G(Θ0))≥34(λ(a)n+β2λ(b)n).

The proof of this lemma is deferred to Appendix D.

### 3.2 Gradient descent near the initialization

We define a neighborhood of the initialization by

 I(Θ0)def={Θ:∥G(Θ)−G(Θ0)∥F≤14(λ(a)n+β2λ(b)n)}. (7)

Using the lemma above, we conclude that for any fixed , with probability at least over the random choices of , we must have

 λmin(G(Θ))≥λmin(G(Θ0))−∥G(Θ)−G(Θ0)∥F≥12(λ(a)n+β2λ(b)n)

for all .

For the GD dynamics, we define the exit time of by

 t0def=inf{t:Θt∉I(Θ0)}. (8)
###### Lemma 3.

For any fixed , assume that . Then with probability at least over the random choices of , we have the following holds for any ,

 ^Rn(Θt)≤e−m(λ(a)n+β2λ(b)n)t^Rn(Θ0).
###### Proof.

We have

 d^Rn(Θt)dt=−∥∇Θ^Rn∥2F≤−m(λ(a)n+β2λ(b)n)^Rn(Θt),

where the last inequality is due to the fact that . This completes the proof. ∎

We define two quantities:

 pndef=4√^Rn(Θ0)m(λ(a)n+β2λ(b)n),qndef=p2n+βpn. (9)

The following is the most crucial characterization of the GD dynamics.

###### Proposition 3.1.

For any , assume . Then, with probability at least , we have the following holds for any ,

 |ak(t)−ak(0)| ≤2pn ∥bk(t)−bk(0)∥ ≤2qn.
###### Proof.

First, we have

 ∥∇ak^Rn∥2 =(1nn∑i=1eiσ(xTibk))2≤2∥bk∥2^Rn(Θ) ∥∇bk^Rn∥2 =∥1nn∑i=1eiakσ′(xTibk)xi∥2≤2a2k^Rn(Θ).

To facilitate the analysis, we define the following two quantities,

 αk(t)=maxs∈[0,t]|ak(s)|,ωk(t)=maxs∈[0,t]∥bk(s)∥.

Using Lemma 3, we have

 ∥bk(t)−bk(0)∥ ≤∫t0∥∇bk^Rn(Θt′)∥dt′ (10) ≤2∫t0αk(t)√^Rn(Θt′)dt′ ≤4√^Rn(Θ0)αk(t)m(λ(a)n+β2λ(b)n)=pnαk(t) |ak(t)−ak(0)| ≤∫t0|∇ak^Rn(Θt′)|dt′ ≤2∫t0ωk(t)√^Rn(Θt′)dt′ ≤4√^Rn(Θ0)ωk(t)m(λ(a)n+β2λ(b)n)=pnωk(t).

Combining the two inequalities above, we get

 αk(t)≤|ak(0)|+pn(1+pnαk(t)).

Using Lemma 1 and the fact that , we have

 pn ≤4(1+c(δ)√mβ)m(λ(a)n+β2λ(b)n) ≤4mλ(a)n+4c(δ)√mλ(a)nλ(b)n≤12. (11)

Therefore,

 αk(t)≤(1−p2n)−1(pn+β)≤2(pn+β).

Inserting the above estimates back to (10), we obtain

 ∥bk(t)−bk(0)∥ ≤2p2n+2βpn.

Since , we have

 2βpn ≤8β(1+c(δ)√mβ)m(λ(a)n+β2λ(b)n) ≤8βm(λ(a)n+β2λ(b)n)+8c(δ)√mλ(b)n ≤4m√λ(a)nλ(b)n+8c(δ)√mλ(b)n ≤12, (12)

Therefore we have , which leads to

 |ak(t)−ak(0)|≤pnωk(t)≤2pn.

The following lemma provides that how and depend on and .

###### Lemma 4.

For any , assume . Let . If , we have

 pn ≤C(δ)√mλ(a)n(1√m+β) (13) qn ≤C(δ)m(λ(a)n)2(1m+2β√m+β2)+C(δ)βmλ(a)n+C(δ)β2√mλ(a)n.

If , we have

 pn ≤C(δ)√mλ(a)nλ(b)n (14) qn ≤C(δ)√mλ(b)n.

### 3.3 Global convergence for arbitrary labels

Proposition 3.1 and Lemma 4 tell us that no matter how large is, we have

This actually implies that the GD dynamics always stays in , i.e. .

###### Theorem 3.2.

For any , assume . Then with probability at least over the random initialization, we have

 ^Rn(Θt)≤e−m(λ(a)n+β2λ(b)n)t^Rn(Θ0),

for any .

###### Proof.

According to Lemma 3, we only need to prove that . Assume .

Let us first consider the Gram matrix . Since is Lipschitz and , we have

 |G(a)i,j(Θt0)−G(a)i,j(Θ0)| =1nmm∑k=1(σ(bTk(t0)xi)σ(bTk(t0)xj)−σ(bTk(0)xi)σ(bTk(0)xj)) ≤1nmm∑k=1(2∥bk(t0)−bk(0)∥+∥bk(t0)−bk(0)∥2) ≤3qnn.

 ∥G(a)(Θt0)−G(a)(Θ0)∥F≤3qn. (15)

Next we turn to the Gram matrix . Define the event

 Di,k={bk(0):∥bk(t0)−bk(0)∥≤qn,σ′(bTk(t0)xi)≠σ′(bTk(0)xi)}.

Since

is ReLU, this event happens only if

. By the fact that and is drawn from the uniform distribution over the sphere, we have . Therefore the entry-wise deviation of satisfies,

 n|G(b)i,j(Θt0) −G(b)i,j(Θ0)| ≤|xTixj|2m2|m∑k=1(a2k(t0)σ′(bTk(t0)xi)σ′(bTk(t0)xj)−a2k(0)σ′(bTk(0)xi)σ′(bTk(0)xj))| ≤1m2|m∑k=1(a2k(t0)Qk,i,j+Pk)|,

where

 Qk,i,j =|σ′(xTibk(t0))σ′(xTjbk(t0))−σ′(xTibk(0))σ′(xTjbk(0))| Pk =|a2k(t0)−a2k(0)|.

Note that . In addition, by Proposition 3.1, we have

 Pk ≤(β+2pn)2−β2≲qn a2k(t0) ≤a2k(0)+Pk≲β2+qn.

Hence using , we obtain

 nE[|G(b)i,j(Θt0)−G(b)i,j(Θ0)|] ≲(β2+qn)qn+qn ≲(1+β2)qn. (16)

By the Markov inequality, with probability we have

 |G(b)i,j(Θt0)−G(b)i,j(Θ0)|≤(1+β2)qnδ.

Consequently, with probability we have

 ∥G(b)(Θt0)−G(b)(Θ0)∥F≲(1+β2)nqnδ. (17)

Combining (15) and (17), we get

 ∥G(t0)−G(0)∥F ≤∥G(a)(t0)−G(a)(0)∥F+∥G(b)(t0)−G(b)(0)∥F ≲3qn+(1+β2)nqnδ ≲(nδ−1+1)C(δ)√mλ(b)n+β2nδ−1C(δ)√mλ(b)n,

where the last inequality comes from Lemma (4). Taking , we get

 ∥G(t0)−G(0)∥F<14(λ(a)n+β2λ(b)n).

The above result contradicts the definition of . Therefore . ∎

###### Remark 2.

Compared with Proposition 3.1, the above theorem imposes a stronger assumption on the network width: . This is due to the lack of continuity of when handling . If is continuous, we can get rid of the dependence on . In addition, it is also possible to remove this assumption for the case when , since in this case the Gram matrix is dominated by .

###### Remark 3.

Theorem 3.2 is closely related to the result of Du et al. [12] where exponential convergence to global minima was first proved for over-parametrized two-layer neural networks. But it improves the result of [12] in two aspects. First, as is done in practice, we allow the parameters in both layers to be updated, while [12] chooses to freeze the parameters in the first layer. Secondly, our analysis does not impose any specific requirement on the scale of the initialization whereas the proof of [12] relies on the specific scaling: .

### 3.4 Characterization of the whole GD trajectory

In the last subsection, we showed that very wide networks can fit arbitrary labels. In this subsection, we study the functions represented by such networks. We show that for highly over-parametrized two-layer neural networks, the solution of the GD dynamics is uniformly close to the solution for the random feature model starting from the same initial function.

###### Theorem 3.3.

Assume . Denote the solution of GD dynamics for the random feature model by

 fkerm(x,t)=fm(x;~at,B0),

where is the solution of GD dynamics (5). For any , assume that . Then with probability at least we have,

 |fm(x;Θt)−fkerm(x,t)|≲c2(δ)λ(a)n(1√m+β+√mβ3), (18)

where .

###### Remark 4.

Again the factor in the condition for can be removed if is assumed to be smooth or is assumed to be small (see the remark at the end of Theorem 3.2).

###### Remark 5.

If , the right-hand-side of (18) goes to as . For example, if we take , we have

 |fm(x;Θt)−fkerm(x,t)∥≲c(δ)λ(a)n√m. (19)

Hence this theorem says that the GD trajectory of a very wide network is uniformly close to the GD trajectory of the related kernel method (5).

#### Proof of Theorem 3.3

We define

 g(a)(x,x′) =1mnm∑k=1σ(bk(0)Tx)σ(bk(0)Tx′) (20) g(x,x′,t) =1mnm∑k=1(σ(bk(t)Tx)σ(bk(t)Tx′)+ak(t)2σ′(bk(t)Tx)σ′(bk(t)Tx′)xTx′).

Recall the definition of in Section 3, we know that . For any , let be two

-dimensional vectors defined by

 g(a)i(x) =g(a)(x,xi) (21) gi(x,t) =g(x,xi,t).

For GD dynamics (2), define . Then we have,

 ddte(t)=−mG(Θt)e(t) (22) ddtfm(x;at,Bt)=−mg(x,t)Te(t).

For GD dynamics (5) of the random feature model, we define . Then, we have

 ddt~e(t)=−mG(a)(Θ0)~e(t) (23)