# Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent work has revealed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size n, the (inverse) training error 1/ϵ, and the (inverse) failure probability 1/δ. This work shows that O(1/ϵ) iterations of gradient descent on two-layer networks of any width exceeding polylog(n,1/ϵ,1/δ) and Ω(1/ϵ^2) training examples suffices to achieve a test error of ϵ. The analysis further relies upon a margin property of the limiting kernel, which is guaranteed positive, and can distinguish between true labels and random labels.

• 20 publications
• 30 publications
02/15/2022

### Random Feature Amplification: Feature Learning and Generalization in Neural Networks

In this work, we provide a characterization of the feature-learning proc...
03/01/2021

### Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

We study the relative power of learning with gradient descent on differe...
08/04/2022

### Feature selection with gradient descent on two-layer networks in low-rotation regimes

This work establishes low test error of gradient flow (GF) and stochasti...
10/08/2021

### On the Implicit Biases of Architecture Gradient Descent

Do neural networks generalise because of bias in the functions returned ...
05/08/2020

### A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods

In this work we demonstrate provable guarantees on the training of depth...
07/12/2021

### Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

We explore the ability of overparameterized shallow neural networks to l...
07/22/2020

### Preconditioned Gradient Descent Algorithm for Inverse Filtering on Spatially Distributed Networks

Graph filters and their inverses have been widely used in denoising, smo...

## 1 Introduction

Despite the extensive empirical success of deep networks, their optimization and generalization properties are still not well understood. Recently, the neural tangent kernel (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limiting kernel which stays constant during training; on the other hand, when the width is large enough, the function learned by gradient descent follows the NTK (Jacot et al., 2018)

. This motivates the study of overparameterized networks trained by gradient descent, using properties of the NTK. In fact, parameters related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and generalization

(Arora et al., 2019).

However, in addition to such NTK-dependent parameters, prior work also requires the width to depend polynomially on , or , where denotes the size of the training set, denotes the failure probability, and denotes the target error. These large widths far exceed what is used empirically, constituting a significant gap between theory and practice.

#### Our contributions.

In this paper, we narrow this gap by showing that a two-layer ReLU network with hidden units trained by gradient descent achieves classification error on test data, meaning both optimization and generalization occur. Unlike prior work, the width is fully polylogarithmic in , , and ; the width will additionally depend on the separation margin of the limiting kernel, a quantity which is guaranteed positive (assuming no inputs are duplicated with noisy labels), and can distinguish between true labels and random labels. The paper organization together with some details are described below.

Section 2

studies gradient descent on the training set. Using the geometry inherent in classification tasks, we prove that with any width at least polylogarithmic and any constant step size no larger than , gradient descent achieves training error in iterations (cf. Section 2). As is common in the NTK literature (Chizat and Bach, 2019), we also show the parameters hardly change, which will be essential to our generalization analysis.

Section 3

gives a test error bound. Concretely, using the preceding gradient descent analysis, and standard Rademacher tools but exploiting how little the weights moved, we show that with samples and iterations, gradient descent finds a solution with test error (cf. Section 3 and Section 3). (As discussed in Section 3, samples also suffice via a smoothness-based generalization bound, at the expense of large constant factors.)

Section 4

considers stochastic gradient descent (SGD) with access to a standard stochastic online oracle. We prove that with width at least polylogarithmic and sample complexity

, SGD achieves an arbitrarily small test error (cf. Section 4).

Section 5

discusses the separability condition, which is in general a positive number, but reflects the difficulty of the classification problem. Regarding random labels, we show that starting from a distribution with a good positive margin, but replacing the labels with random noise, the margin can degrade all the way down to , which (correctly) removes the possibility of generalization. In this way, our analysis can distinguish between true labels and random labels.

Section 6

concludes with some open problems.

### 1.1 Related work

There has been a large literature studying gradient descent on overparameterized networks via the NTK. The most closely related work is (Nitanda and Suzuki, 2019), which shows that a two-layer network trained by gradient descent with the logistic loss can achieve a small test error, under the same assumption that the neural tangent model with respect to the first layer can separate the data distribution. However, they analyze smooth activations, while we handle the ReLU. They require hidden units, data samples, and steps, while our result only needs polylogarithmic hidden units, data samples, and steps.

Additionally on shallow networks, Du et al. (2018b) prove that on an overparameterized two-layer network, gradient descent can globally minimize the empirical risk with the squared loss. Their result requires hidden units. Oymak and Soltanolkotabi (2019) further reduces the required overparameterization, but it still has a dependency. Using the same amount of overparameterization as (Du et al., 2018b), Arora et al. (2019)

further show that the two-layer network learned by gradient descent can achieve a small test error, assuming that on the data distribution the smallest eigenvalue of the limiting kernel is at least some positive constant. They also give a fine-grained characterization of the predictions made by gradient descent iterates; such a characterization makes use of a special property of the squared loss and cannot be applied to the logistic regression setting.

Li and Liang (2018) show that stochastic gradient descent (SGD) with the cross entropy loss can learn a two-layer network with small test error, using hidden units, where is at least the covering number of the unit sphere using balls whose radii are no larger than the smallest distance between two data points with different labels. In a high-dimensional space, could be very large. Allen-Zhu et al. (2018a) consider SGD on a two-layer network, and a variant of SGD on a three-layer network. The three-layer analysis further exhibits some properties not captured by the NTK. They assume a ground truth network with infinite-order smooth activations, and they require the width to depend polynomially on and some constants related to the smoothness of the activations of the ground truth network.

On deep networks, a variety of works have established low training error (Allen-Zhu et al., 2018b; Du et al., 2018a; Zou et al., 2018; Zou and Gu, 2019). Cao and Gu (2019a) assume that the neural tangent model with respect to the second layer of a two-layer network can separate the data distribution, and prove that gradient descent on a deep network can achieve test error with samples and hidden units. Cao and Gu (2019b) consider SGD with an online oracle and give a general result. Under the same assumption as in (Cao and Gu, 2019a), their result requires hidden units and sample complexity . By contrast, with the same online oracle, our result only needs polylogarithmic hidden units and sample complexity .

### 1.2 Notation

The dataset is denoted by where and . For simplicity, we assume that for any , which is standard in the NTK literature.

The two-layer network has weight matrices and . We use the following parameterization, which is also used in (Du et al., 2018b; Arora et al., 2019):

 f(x;W,a):=1√mm∑s=1asσ\del\ipwsx,

with initialization

 ws,0∼N(0,Id),andas∼unif\del\cbr−1,+1.

Note that in this paper, denotes the -th row of at step . We fix and only train , as in (Li and Liang, 2018; Du et al., 2018b; Arora et al., 2019; Nitanda and Suzuki, 2019). We consider the ReLU activation , though our analysis can be extended easily to Lipschitz continuous, positively homogeneous activations such as leaky ReLU.

We use the logistic (binary cross entropy) loss and gradient descent. For any and any , let . The empirical risk and its gradient are given by

 \hcR(W):=1nn∑i=1ℓ\delyifi(W),and\hnR(W)=1nn∑i=1ℓ′\delyifi(W)yi∇fi(W).

For any , the gradient descent step is given by . Also define

 f(t)i(W):=\ip∇fi(Wt)W,and\hcR(t)(W):=1nn∑i=1ℓ\delyif(t)i(W).

Note that . This property generally holds due to homogeneity: for any and any ,

 ∂fi∂ws=1√mas\1\sbr\ipwsxi>0xi,and\ip∂fi∂wsws=1√masσ\del\ipwsxi,

and thus .

## 2 Empirical risk minimization

In this section, we consider a fixed training set and empirical risk minimization. We first state our assumption on the separability of the neural tangent model, and then give our main result and a proof sketch.

Here is some additional notation. Let denote the Gaussian measure on , given by the Gaussian density with respect to the Lebesgue measure on . We consider the following Hilbert space

For each , define by

 ϕi(z):=xi\1\sbr\ipzxi>0.

One can verify that , and thus is indeed in .

There exist and such that for any , and for any ,

 yi\ip\barvϕi\cH:=yi∫\ip\barv(z)ϕi(z)\difμ\cN(z)≥γ.

A more natural assumption is that the infinite-width limit of the NTK can separate the training set with a positive margin. This assumption actually implies Section 2 (cf. Section 5). Some other discussion on the separability assumption is also given in Section 5.

With Section 2, we state our main empirical risk result. Under Section 2, given any risk target and any , let

 λ:=√2ln(4n/δ)+ln(4/ϵ)γ/3,andM:=max\cbr162ln(n/δ)γ2,25ln\del2nδ,324λ2γ3.

Then for any and any constant step size , with probability over the random initialization,

 1T∑t

Moreover for any ,

 \enVertWt−W0F≤√6λ,and% \enVertws,t−ws,02≤3√6λγ√m for any 1≤s≤m.

While the number of hidden units required by prior work all have a polynomial dependency on , or , Section 2 only requires . In the rest of Section 2, we give a proof sketch of Section 2.

### 2.1 Properties at initialization

In this subsection, we give some nice properties of random initialization. The proofs are given in Appendix A.

Given an initialization , for any , define

 \barus:=1√mas\barv(ws,0), (1)

where is given by Section 2. Collect into a matrix . It holds that , and . Section 2.1 ensures that with high probability has a positive margin at initialization. Under Section 2, given any and any , if , then with probability , it holds simultaneously for all that

 yif(0)i\del[1]¯¯¯¯U=yi\ip∇fi(W0)¯¯¯¯U≥γ−√2ln(n/δ)m≥γ−ϵ1.

For any , any , and any , define

 αi(W,ϵ2)=1mm∑s=1\1\sbr\envert\ipwsxi≤ϵ2.

Section 2.1 controls . It will help us show that has a good margin during the training process. Under the condition of Section 2.1, for any , with probability , it holds simultaneously for all that

 αi\delW0,ϵ2≤√2πϵ2+√ln(n/δ)2m≤ϵ2+ϵ12.

Finally, Section 2.1 controls the output of the network at initialization. Given any , if , then with probability , it holds simultaneously for all that

 \envertf(xi;W0,a)≤√2ln\del4n/δ.

### 2.2 Convergence analysis of gradient descent

We analyze gradient descent in this subsection. First, define

 \hcQ(W):=1nn∑i=1−ℓ′\delyifi(W).

We have the following observations.

• For any and any , , and thus . Therefore by the triangle inequality, .

• If is -Lipschitz continuous (as with the logistic loss), then .

• If (as with the logistic loss), then .

With the above observations, we give the following general result which does not require staying close to initialization. It plays an important role in the proof of our main result, and could be useful when analyzing neural networks beyond the NTK setting. For any

and any , if , then

 ηt\hcR(Wt)≤\enVertWt−¯¯¯¯¯¯W2F−\enVertWt+1−¯¯¯¯¯¯W2F+2ηt\hcR(t)\del[1]¯¯¯¯¯¯W.

Consequently, if we use a constant step size for , then

 η\del∑τ
###### Proof.

We have

 \enVertWt+1−¯¯¯¯¯¯W2F=\enVertWt−¯¯¯¯¯¯W2F−2ηt\ip\hnR(Wt)Wt−¯¯¯¯¯¯W+η2t\enVert\hnR(Wt)2F. (2)

The first order term of eq. 2 can be handled using the convexity of and homogeneity of ReLU:

 \ip\hnR(Wt)Wt−¯¯¯¯¯¯W=1nn∑i=1ℓ′\delyifi(Wt)yi\ip\nfi(Wt)Wt−¯¯¯¯¯¯W=1nn∑i=1ℓ′\delyifi(Wt)\delyifi(Wt)−yif(t)i\del[1]¯¯¯¯¯¯W≥1nn∑i=1\delℓ\delyifi(Wt)−ℓ\delyif(t)i\del[1]¯¯¯¯¯¯W=\hcR(Wt)−\hcR(t)\del[1]¯¯¯¯¯¯W. (3)

The second-order term of eq. 2 can be bounded as follows

 η2t\enVert\hnR(Wt)2F≤η2t\hcQ(Wt)2≤ηt\hcQ(Wt)≤ηt\hcR(Wt), (4)

because , and , and . Combining eqs. 4, 3 and 2 gives

 ηt\hcR(Wt)≤\enVertWt−¯¯¯¯¯¯W2F−\enVertWt+1−¯¯¯¯¯¯W2F+2ηt\hcR(t)\del[1]¯¯¯¯¯¯W.

Telescoping gives the other claim. ∎

Using Sections 2.2, 2.1, 2.1 and 2.1, we can prove Section 2. Below is a proof sketch; the full proof is given in Appendix A.

1. We first show that defined in eq. 1 gives a positive margin at step as long as the activation patterns do not change too much from the initialization.

2. We then show that such a phase lasts for a long time with a mild overparameterization by giving a strong control of via Section 2.2. Prior work only shows an or upper bound on , which then requires the number of hidden units to be . By contrast, we are able to control by , which allows us to have a overparameterization.

3. Next we use Section 2.2 once again to get the empirical risk guarantee.

4. We also give an upper bound on , or . This will give us a Rademacher complexity bound in Section 3.

## 3 Generalization

To get a generalization bound, we naturally extend Section 2 to the following assumption, which is also made in (Nitanda and Suzuki, 2019) for smooth activations. There exist and such that for any , and

 y∫\ip\barv(z)x\1\sbr\ipzx>0\difμN(z)≥γ

for any sampled from the data distribution (i.e., almost surely over ).

Here is our test error bound with Section 3. Under Section 3, given any and any , let and be given as in Section 2:

 λ:=√2ln(4n/δ)+ln(4/ϵ)γ/3,andM:=max\cbr162ln(n/δ)γ2,25ln\del2nδ,324λ2γ3.

Then for any and any constant step size , with probability over the random initialization and data sampling,

 P(x,y)∼\cD\delyf(x;Wk,a)≤0≤2ϵ+24\del√2ln(4n/δ)+ln(4/ϵ)γ2√n+6√ln(2/δ)2n,

where denotes the step with the minimum empirical risk before .

Below is a direct corollary of Section 3. Under Section 3, given any , using a constant step size and let

 n=˜Ω\del1γ4ϵ2,andm=Ω\delln(n/δ)+ln(1/ϵ)2γ5,

it holds with probability that , where denotes the step with the minimum empirical risk in the first steps.

To prove Section 3, we consider the sigmoid mapping , the empirical average , and the corresponding population average . First of all, since , it is enough to control . Next, as is controlled by Section 2, it is enough to control the generalization error . Moreover, since is supported on and -Lipschitz, it is enough to bound the Rademacher complexity of the function space explored by gradient descent. Invoking the bound on from Section 2 finishes the proof. The proof details are given in Appendix B.

To get Section 3, we use a Lipschitz-based Rademacher complexity bound. One can also use a smoothness-based Rademacher complexity bound (Srebro et al., 2010, Theorem 1) and get a sample complexity . However, the bound will become complicated and some large constant will be introduced. It is an interesting open question to give a clean analysis based on smoothness.

There are some different formulations of SGD. In this section, we consider SGD with an online oracle. We randomly sample and , and fix during training. At step , a data example is sampled from the data distribution. We still let , and perform the following update

 Wi+1:=Wi−ηiℓ′\delyifi(Wi)yi\nfi(Wi).

Note that here starts from .

Still with Section 3, we show the following result. Under Section 3, given any , using a constant step size and , it holds with probability that

 1nn∑i=1P(x,y)∼\cD\delyf(x;Wi,a)≤0≤ϵ,forn=˜O(\nicefrac1γ2ϵ).

Below is a proof sketch; the details are given in Appendix C. For any , define

 \cRi(W):=ℓ\delyi\ip\nfi(Wi)W,% and\cQi(W):=−ℓ′\delyi\ip\nfi(Wi)W.

Due to homogeneity, and .

The first step is an extension of Section 2.2 to the SGD setting. The proofs are similar. With a constant step size , for any and any ,

 η\del∑t

With Section 4, we can also extend Section 2 to the SGD setting and get a bound on , using a similar proof. To further get a bound on the cumulative population risk , the key observation is that is a martingale. Using a martingale Bernstein bound, we prove the following lemma; applying it finishes the proof of Section 4. Given any , with probability ,

 ∑t

## 5 On separability

Given a training set , the linear kernel is defined as

. The maximum margin achievable by a linear classifier is given by

 γ0:=minq∈Δn√\delq⊙y⊤K0\delq⊙y, (5)

where denotes the element-wise product of and . If the data is not linearly separable, .

In this paper we train the first layer of a two-layer network, and the kernel we consider is the NTK of the first layer:

 K1\delxi,xj :=E\sbr∂f(xi;W0,a)∂W0,∂f(xj;W0,a)∂W0 =\ipxixjEw∼N(0,Id)\sbr\1\sbr\ipxiw>0\1\sbr\ipxjw>0.

Similar to the definition of , the margin given by is defined as

 γ1:=minq∈Δn√\delq⊙y⊤K1\delq⊙y.

Regarding the relation between and Section 2, we have the following result. If , then there exists s.t. , and for any , and for any . The proof is given in Appendix D, and uses Fenchel duality theory. given by Section 5 satisfies Section 2 with , but there might exist some with a much better , since the upper bound in Section 5 might be very loose.

We can further ask how large could be. Oymak and Soltanolkotabi (2019, Corollary I.2)

show that if for any two feature vectors

and , we have and for some , then

 λ0:=λmin(K1)≥θ100n2.

For arbitrary labels , since , we have the worst case bound . However, real world labels could give a much better . For example, a tighter lower bound on is , where denotes the number of support vectors, which might be much smaller than .

On the other hand, given any training set which may have a large margin, if we replace with random labels , with high probability the margin becomes . To see this, let denote the uniform probability vector . Note that

 \bbEϵ∼unif\del{−1,+1}n\sbr\del^q⊙ϵ⊤K1\del^q⊙ϵ =\bbEϵ∼unif\del{−1,+1}n\sbrn∑i,j=11n2ϵiϵjK1(xi,xj) =1n2n∑i,j=1\bbEϵ∼unif\del{−1,+1}n\sbrϵiϵjK1(xi,xj) =1n2n∑i=1K1(xi,xi)=12n.

Since for any , by Hoeffding’s inequality it holds with high probability that , and thus the margin is .

## 6 Open problems

In this paper, we analyze gradient descent on a two-layer network in the NTK regime, where the weights stay close to the initialization. It is an interesting open question if gradient descent learns something beyond the NTK, after the iterates move far enough from the initial weights. It is also interesting to extend our analysis to other architectures, such as multi-layer networks, convolutional networks, and residual networks. Finally, in this paper we only discuss binary classification; it is interesting to see if it is possible to get similar results for other tasks, such as regression.

## Appendix A Omitted proofs from Section 2

###### Proof of Section 2.1.

By Section 2, given any ,

 μ:=Ew∼N(0,Id)\sbryi\ip\barv(w)xi\1\sbr\ipwxi>0≥γ.

On the other hand,

 yif(0)i\del[1]¯¯¯¯U=1mm∑s=1yi\ip\barv(ws,0)xi\1\sbr\ipws,0xi>0

is the empirical mean of i.i.d. r.v.’s supported on with mean . Therefore by Hoeffding’s inequality, with probability ,

 yif(0)i\del[1]¯¯¯¯U−γ≥yif(0)i\del[1]¯¯¯¯U−μ≥−√2ln(n/δ)m.

Applying a union bound finishes the proof. ∎

###### Proof of Section 2.1.

Given any fixed and ,

 E\sbrαi(W0,ϵ2)=P\del\envert\ipwxi≤ϵ2≤2ϵ2√2π=√2πϵ2,

because is a standard Gaussian r.v. and the density of standard Gaussian has maximum . Since is the empirical mean of Bernoulli r.v.’s, by Hoeffding’s inequality, with probability ,

 αi(W0,ϵ2)≤E\sbrαi(W0,ϵ2)+√ln(n/δ)2m≤√2πϵ2+√ln(n/δ)2m.

Applying a union bound finishes the proof. ∎

To prove Section 2.1, we need the following technical result.

Consider the random vector , where for some that is -Lipschitz, and are i.i.d. standard Gaussian r.v.’s. Then the r.v. is -sub-Gaussian, and thus with probability ,

 ∥X∥2−E\sbr∥X∥2≤√2ln(1/δ).
###### Proof.

Given , define

 f(a)= ⎷m∑i=1σ(ai)2=\enVertσ(a)2,

where is obtained by applying coordinate-wisely to . For any , by the triangle inequality, we have

 \envertf(a)−f(b)=\envert\enVertσ(a)2−\enVertσ(b)2 ≤\enVertσ(a)−σ(b)2= ⎷m∑i=1\delσ(ai)−σ(bi)2,

and by further using the -Lipschitz continuity of , we have

 \envertf(a)−f(b)≤ ⎷m∑i=1\delσ(ai)−σ(bi)2≤ ⎷m∑i=1(ai−bi)2=\enVerta−b2.

As a result, is a -Lipschitz continuous function w.r.t. the norm, indeed is -sub-Gaussian and the bound follows by Gaussian concentration (Wainwright, 2015, Theorem 2.4). ∎

###### Proof of Section 2.1.

Given , let . By Appendix A,

is sub-Gaussian with variance proxy

, and with probability at least over ,

 ∥hi∥2−E\sbr∥hi∥2≤√2ln(2n/δ)m≤√2ln(2n/δ)25ln(2n/δ)≤1−√22.

On the other hand, by Jensen’s inequality,

 E\sbr∥hi∥2≤√E\sbr∥hi∥22=√22.

As a result, with probability , it holds that . By a union bound, with probability over , for all , we have .

For any such that the above event holds, and for any , the r.v. is sub-Gaussian with variance proxy . By Hoeffding’s inequality, with probability over ,

 \envert\iphia=\envertf(xi;W0,a)≤√2ln\del4n/δ.

By the union bound, with probability over , for all , we have .

The probability that the above events all happen is at least , over and . ∎

###### Proof of Section 2.

The condition on ensures that Sections 2.1, 2.1 and 2.1 hold with and .

For any and any step , let denote the proportion of activation patterns for that are different from step to step . Formally,

 ξi,t:=1mm∑s=1\1