# How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size n and the inverse of the target accuracy ϵ^-1, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, how much over-parameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in n and ϵ^-1. Our results push the study of over-parameterized deep neural networks towards more practical settings.

## Authors

• 2 publications
• 37 publications
• 8 publications
• 59 publications
• ### An Improved Analysis of Training Over-parameterized Deep Neural Networks

A recent line of research has shown that gradient-based algorithms with ...
06/11/2019 ∙ by Difan Zou, et al. ∙ 0

• ### Over-parameterization Improves Generalization in the XOR Detection Problem

Empirical evidence suggests that neural networks with ReLU activations g...
10/06/2018 ∙ by Alon Brutzkus, et al. ∙ 0

• ### Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis

A remarkable recent discovery in machine learning has been that deep neu...
11/27/2019 ∙ by Thanh V. Nguyen, et al. ∙ 0

• ### A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods

In this work we demonstrate provable guarantees on the training of depth...
05/08/2020 ∙ by Anirbit Mukherjee, et al. ∙ 0

• ### Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems

Recently, several studies have proven the global convergence and general...
05/23/2019 ∙ by Atsushi Nitanda, et al. ∙ 2

• ### Trainability and Data-dependent Initialization of Over-parameterized ReLU Neural Networks

A neural network is said to be over-specified if its representational po...
07/23/2019 ∙ by Yeonjong Shin, et al. ∙ 3

• ### Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

The strong lottery ticket hypothesis (LTH) postulates that one can appr...
06/14/2020 ∙ by Ankit Pensia, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks have become one of the most important and prevalent machine learning models due to their remarkable power in various real-world applications. However, the success of deep learning has not been well-explained in theory. It remains mysterious why standard training algorithms tend to find a globally optimal solution, despite the highly non-convex landscape of the training loss function. Moreover, despite the extremely large amount of parameters, deep neural networks rarely over-fit, and can often generalize well to unseen data and achieve good test accuracy. Understanding these mysterious phenomena on the optimization and generalization of deep neural networks is one of the most fundamental goals in deep learning theory.

Recent breakthroughs have shed light on the optimization and generalization of deep neural networks under the over-parameterized setting, where the hidden layer width is extremely large. In terms of optimization, a line of work (Du et al., 2019b; Allen-Zhu et al., 2019b; Zou et al., 2018; Oymak and Soltanolkotabi, 2019b; Arora et al., 2019b; Zou and Gu, 2019) proved that (stochastic) gradient descent with random initialization can successfully find a global optimum of the training loss function regardless of the labeling of the data, as long as the width of the network is larger than , where is the training sample size. For generalization, Allen-Zhu et al. (2019a); Arora et al. (2019a); Cao and Gu (2019b, a); Nitanda and Suzuki (2019) established generalization bounds of neural networks trained with (stochastic) gradient descent under certain data distribution assumptions, when the network width is at least . Although these results have provided important insights into the learning of extremely over-parameterized neural networks, the requirement on the network width is still far from the practical settings. Very recently, Ji and Telgarsky (2019b) showed that for two-layer ReLU networks, when the training data are well separated, polylogarithmic width is sufficient to guarantee good optimization and generalization performance of neural networks trained by GD/SGD. However, it remains unclear whether similar results can be developed for deep neural networks.

In fact, most of the aforementioned results can be categorized in the so called neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019b) regime or lazy training regime (Chizat et al., 2019), where along the whole training process, the neural network function behaves similarly as its first-order Taylor expansion at initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019a). It is recognized that in order to make the learning of neural networks stay in the NTK regime, a proper scaling with respect to the network width is essential. For example, Cao and Gu (2019a) introduced a scaling factor in their definition of the neural network function, where is the network width. Same scaling factor has also been applied to the initialization of the output weights in Allen-Zhu et al. (2019b); Zou et al. (2019); Cao and Gu (2019b); Zou and Gu (2019). Many other results in the NTK regime used a different type of parameterization, but essentially have the same scaling factor (Jacot et al., 2018; Du et al., 2019b, a; Arora et al., 2019a, b). In fact, without such a scaling factor, it has been shown that the training of two-layer networks falls in a different regime, namely the “mean-field” regime (Mei et al., 2018; Chizat and Bach, 2018; Chizat et al., 2019; Sirignano and Spiliopoulos, 2019; Rotskoff and Vanden-Eijnden, 2018; Wei et al., 2019; Mei et al., 2019; Fang et al., 2019a, b).

In this paper, we study the optimization and generalization of deep ReLU networks for a wider range of scaling. Specifically, for a ReLU network with hidden nodes per layer, we generalize the scaling factor introduced in Cao and Gu (2019a) to , where is a constant. Note that similar scaling has been studied in Nitanda and Suzuki (2019). We show that for all such , as long as there exists a good neural network weight configuration within certain distance to the initialization, the global convergence property as well as good generalization performance can be provably established under mild condition on the neural network width, which is polylogarithmic in sample size and inverse target accuracy . At the core of our analysis is a milder requirement on the first-order approximation of neural network function, which allows the algorithm to travel longer to find the global minima. Our contributions are highlighted as follows:

• [leftmargin = *]

• We establish the global convergence guarantee of GD for training deep ReLU networks for binary classification. Specifically, we prove that for any positive constant , if there exists a good neural network weight configuration within distance to the initialization, and the neural network width satisfies , GD can achieve -training loss within iterations, where is the neural network width, is the scaling factor of the neural network and is the neural network depth.

• We also establish the generalization guarantees for both GD and SGD in the same setting. Specifically, for GD, we establish a sample complexity for a wide range of network width. For SGD, we prove a sample complexity. For both algorithms, our results provide tighter sample complexities based on milder network width conditions compared with existing results.

• Our theoretical results can be generalized to the scenarios with different data separability assumptions studied in the literature, and therefore can cover and improve many existing results in the NTK regime. Specifically, under the data separability assumptions studied in Cao and Gu (2019a); Ji and Telgarsky (2019b), our results hold with , where

is the failure probability parameter. This suggests that a neural network with width

can be learned by GD/SGD with good optimization and generalization guarantees. Moreover, we also show that under a very mild data nondegeneration assumption in Zou et al. (2019), our theoretical result can lead to a sharper over-parameterization condition, which improves the existing results in Zou et al. (2019) if the neural network depth satisfies .

In terms of optimization, a line of work focuses on the optimization landscape of neural networks (Haeffele and Vidal, 2015; Kawaguchi, 2016; Freeman and Bruna, 2017; Hardt and Ma, 2017; Safran and Shamir, 2018; Xie et al., 2017; Nguyen and Hein, 2017; Soltanolkotabi et al., 2018; Zhou and Liang, 2017; Yun et al., 2018; Du and Lee, 2018; Venturi et al., 2018; Nguyen, 2019). They study the properties of landscape of the optimization problem in deep learning, and demonstrate that under certain situations the local minima are also globally optimal. However, most of the positive results along this line of work only hold for simplified cases like linear networks or two-layer networks under certain assumptions on the input/output dimensions and sample size.

For the generalization of neural networks, a vast amount of work has established uniform convergence based generalization error bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018; Li et al., 2018a). While such results can be applied to the mean-field regime to establish certain generalization bounds (Wei et al., 2019), the bounds are loose when applied to the NTK regime due to the larger scaling of network parameters. For example, some case studies in Cao and Gu (2019b) showed that the resulting uniform convergence based generalization bounds are increasing in the network width .

Another important topic on neural networks is the implicit bias of training algorithms such as GD and SGD. Overall, the study of implicit bias aims to figure out the specific properties of the solutions given by a certain training algorithm, as the solutions to the optimization problem may not be unique. Along this line of research, many prior work (Gunasekar et al., 2017; Soudry et al., 2018; Ji and Telgarsky, 2019a; Gunasekar et al., 2018a, b; Nacson et al., 2019b; Li et al., 2018b)

has studied implicit regularization/bias of gradient flow, GD, SGD or mirror descent for matrix factorization, logistic regression, and deep linear networks. However, generalizing these results to deep non-linear neural networks turns out to be much more challenging.

Nacson et al. (2019a); Lyu and Li (2019) studied the implicit bias of deep homogeneous model trained by gradient flow, and proved that the convergent direction of parameters is a KKT point of the max-margin problem. Nevertheless, they cannot handle practical optimization algorithms such as GD and SGD, and did not characterize how large the resulting margin is.

Several recent results have proved that neural networks can outperform kernel methods or behave differently than NTK-based kernel regression under certain conditions. Wei et al. (2019) studied the convergence of noisy Wasserstein flow in the mean-field regime, while Allen-Zhu and Li (2019) studied three layer ResNets with a scaling similar to the mean-field regime. Moreover, Allen-Zhu et al. (2019a); Bai and Lee (2019) studied fully-connected three-layer or two-layer networks with a scaling similar to the NTK regime, but utilized certain randomization tricks to make the network “almost quadratic” instead of “almost linear” in its parameters, making the network behave differently from the standard NTK regime.

Notation. For two scalars and , we use and to denote and respective. Given two scalars , we denote by if and if

. For a vector

we use to denote its Euclidean norm. For a matrix , we use and to denote its spectral norm and Frobenius norm respectively, and denote by the entry of at the -th row and -th column. Given two matrices and with the same dimension, we denote . Given a collection of matrices and a function mapping , we define by the partial gradient of with respect to and . Given two collections of matrices and , we denote and . Given two sequences and , we denote if for some absolute positive constant , if for some absolute positive constant , and if for some absolute constants and . We also use notations and to hide logarithmic factors in and respectively. Moreover, given a collection of matrices and a positive scalar , we denote .

## 2 Preliminaries on Learning Neural Networks

In this section we introduce the problem setting studied in this paper, including definitions of the network function and loss function, and the detailed training algorithms, i.e., GD and SGD with random initialization.

Neural network function. Given an input , the output of deep fully-connected ReLU network is defined as follows,

 f\Wb(\xb)=mα\WbLσ(\WbL−1⋯σ(\Wb1\xb)⋯),

where is a scaling parameter, , and . We denote the collection of all weight matrices as .

Loss function. Given training dataset with input and output , we define the training loss function as

 LS(\Wb)=1nn∑i=1Li(\Wb),

where is defined as the cross-entropy loss.

Algorithms. We consider both gradient descent and stochastic gradient descent with Gaussian random initialization. These two algorithms are displayed in Algorithms 1 and 2 respectively. Specifically, the entries in

are generated independently from univariate Gaussian distribution

and the entries in are generated independently from . For GD, we consider using the full gradient to update the model parameters. For SGD, we consider only using one training data in each iteration.

## 3 Main Theory

In this section, we present the main theoretical results about the optimization and generalization guarantees of GD and SGD for learning deep ReLU networks. We first make the following assumption on the training data points. All training data points satisfy , . This assumption has been widely made in many previous work (Allen-Zhu et al., 2019b, c; Du et al., 2019b, a; Zou et al., 2019) in order to simplify the theoretical analysis. We also make the following assumption regarding the loss function . There exists a positive constant and such that for all . Considering a sufficiently small , Assumption 3 spells out that there exists a neural network model with parameters

such that all training data points can be correctly classified, i.e., achieving zero training error. We claim that this is a common empirical observation, thus Assumption

3 can be easily satisfied in practice. Moreover, note that we consider cross-entropy loss, therefore, Assumption 3 is equivalent to . In Section 4, we will show that Assumption 3 can be implied by a variety of assumptions made in prior work.

In what follows, we are going to deliver our main theoretical results regarding the optimization and generalization guarantees of learning deep ReLU networks. Specifically, we consider two training algorithms: GD and SGD with random initialization in Algorithms 1 and 2. We will thoroughly analyze these two algorithms separately.

The following theorem establishes the global convergence of GD for training deep ReLU networks for binary classification. For any , there exists that satisfies

 m∗(δ,R,L,α)=~\cO([poly(R,L)]1/α⋅log2/(2−α)(n/δ)),

such that if , with probability at least over the initialization, GD with step size can train a neural network to achieve at most training loss within iterations.

Theorem 3.1 suggests that the minimum required neural network width, i.e., , is polynomially large in and and has a logarithmic dependency on the training sample size and the failure probability parameter . As will be discussed in Section 4, if the training data can be separated by neural tangent random feature model or shallow neural tangent kernel, is in the order of . This further implies that is sufficient to guarantee the global convergence of GD. We would also like to remark that Theorem 3.1 will not hold for larger given that , which implies that one needs to apply early stopping when running Algorithm 1.

Then we characterize the generalization performance of the neural network trained by GD in the following theorem. Under the same assumptions as Theorem 3.1, with probability at least , the iterate of Algorithm 1 satisfies that

for all .

Theorem 3.1 provides an algorithm independent generalization bound. Note that the second term in the bound distinguishes our result from most of the previous work on the algorithm-dependent generalization bounds of over-parameterized neural networks (Allen-Zhu et al., 2019a; Arora et al., 2019a; Cao and Gu, 2019b; Yehudai and Shamir, 2019; Cao and Gu, 2019a; Nitanda and Suzuki, 2019). Specifically, while these previous results mainly focus on establishing a bound that does not explode when the network width goes to infinity, our result covers a wider range of , and therefore implements different bounds for small or large ’s. As will be shown in Section 4, under various assumptions made in previous work, Assumption 3 holds for , and therefore Theorem 3.1 guarantees a sample complexity of order for , which has not been covered by previous results.

A trend can be observed in Theorem 3.1: the generalization error bound first increases with the network width and then starts to decrease when becomes even larger. This to certain extent bears a similarity to the “double descent” phenomenon studied in a recent line of work (Belkin et al., 2019a, b; Hastie et al., 2019; Mei and Montanari, 2019). However, since Theorem 3.1 only demonstrates a double descent curve for an upper bound of the generalization error, it is not sufficient to give any conclusive result on the double descent phenomenon. In fact, for two-layer networks, Ji and Telgarsky (2019b) has proved a generalization error bound that does not depend on over the range , under certain data separability assumptions. Therefore, it is possible that the double descent curve in our bound is an artifact of our analysis. We believe a further analysis on the generalization error and its relation to the double descent curve is an important future direction.

In this part, we are going to characterize the performance of SGD for training deep ReLU networks. Specifically, the following Theorem establishes an generalization error bound in terms of the output of SGD, under certain condition on the neural network width.

For any , there exists that satisfies

 m∗(δ,R,L,α)=~\cO([poly(R,L)]1/α⋅log2/(2−α)(n/δ)),

such that if , with probability at least , SGD with step size achieves

 \EE[L0−1\cD(^\Wb)]≤8L2R2n+8log(1/δ)n+24ϵ,

where the expectation is taken over the uniform draw of from . Theorem 3.2 gives a sample complexity for deep ReLU networks trained with SGD. Treating as a constant, then as long as (which we will verify in Section 4 under various conditions), this is a sample complexity of order . Our result improves the results given by Allen-Zhu et al. (2019a); Cao and Gu (2019a) in two aspects. First, the sample complexity is improved from (Allen-Zhu et al., 2019a; Cao and Gu, 2019a) to . Moreover, while Allen-Zhu et al. (2019a); Cao and Gu (2019a) requires , our result works for .

## 4 Discussions on Data Separability

In this section, we will discuss different data separability assumptions made in existing work. Specifically, we will show that the assumptions on training data made in Cao and Gu (2019a), Ji and Telgarsky (2019b) and Zou et al. (2019) can imply Assumption 3, and thus our theoretical results can be directly applied to these settings.

### 4.1 Data Separability by Neural Tangent Random Feature Model

We formally restate the definition of Neural Tangent Random Feature (NTRF) introduced in Cao and Gu (2019a) as follows. Let be the initialization weights, the NTRF function class is defined as follows

 \cF(\Wb(0),R,α)={f(⋅)=f\Wb(0)(⋅)+⟨∇f\Wb(0)(⋅),\Wb⟩:\Wb∈\cB(0,R⋅m−α)}.

The NTRF function class is closely related to the neural tangent kernel. For wide enough neural networks, it has been shown that the functions NTRF model can learn are in the NTK-induced reproducing kernel Hilbert space (RKHS) (Cao and Gu, 2019a). The following proposition states that if there is a good function in NTRF function class that achieves small training loss, Assumption 3 can also be satisfied. Suppose there is a function such that for all , then Assumption 3 can be satisfied with .

Proposition 4.1 states that if the training data can be well classified by a function in the NTRF function class, they can also be well learned by deep ReLU networks. However, one may ask in which case there exists such a good function in the NTRF function class, and what is the corresponding value of ? We further provide such an example by introducing the following assumption on the neural tangent random features, i.e., .

There exists a collection of matrices satisfying , such that for all

 yi⟨∇f\Wb(0)(\xbi),\Ub∗⟩≥mαγ,

where is an absolute positive constant111The factor is introduced here since the Frobenius norm is typically in the order of . . Then based on Proposition 4.1, the following corollary shows that under Assumption 4.1, Assumption 3 can be satisfied with a certain choice of .

Under Assumption 4.1, Assumption 3 can be satisfied with for some absolute constant . Corollary 4.1 shows that if NTRF of all training data are linear separable with constant margin, Assumption 3 can be satisfied with the radius parameter logarithmic in , and . Substituting this result into Theorems 3.1, 3.1 and 3.2, it can be shown that a neural network with width suffices to guarantee good optimization and generalization performances for both GD and SGD.

### 4.2 Data Separability by Shallow Neural Tangent Model

In this subsection we study the data separation assumption made in Ji and Telgarsky (2019b) and show that our resutls covers this particular setting. We first restate the assumption made in Ji and Telgarsky (2019b) as follows. There exist and such that for any , and for all ,

 yi∫\RRdσ′(⟨\zb,\xbi⟩)⋅⟨¯\ub(\zb),\xbi⟩dμN(\zb)≥γ.

Assumption 4.2 is related to the linear separability of the gradients of the first layer parameters at random initialization, where the randomness is replaced with an integral by taking the infinite width limit. Note that similar assumptions have also been studied in Cao and Gu (2019b); Frei et al. (2019), where the gradients with respect to the second layer weights instead of the first layer weights are considered. In the following, we mainly focus on Assumption 4.2. However we remark that our result also covers the setting studied in Cao and Gu (2019b); Frei et al. (2019).

In order to make a fair comparison, we reduce our results for multilayer networks to the one-hidden-layer setting:

 f\Wb(\xb)=mα\Wb2σ(\Wb1).

Then we provide the following proposition, which states that Assumption 4.2 can also imply Assumption 3 with a certain choice of .

Suppose the training data satisfy Assumption 4.2, then if the neural network width satisfies , Assumption 3 can be satisfied with for some absolute constant . Proposition 4.2 suggests that for two-layer ReLU networks, under Assumption 4.2, Assumption 3 can be satisfied with . Plugging this into Theorem 3.1, and setting , the condition on the neural network width becomes 222Similar to Ji and Telgarsky (2019b), the margin parameter is considered as a constant and thus will not appear in the condition on ., which matches the condition proved in Ji and Telgarsky (2019b) if choosing .

### 4.3 Class-dependent Data Nondegeneration

In Zou et al. (2019), an assumption on the minimum distance between inputs from different classes is made to guarantee the convergence of gradient descent to a global minimum. We restate this training data assumption as follows. For all if , then for some absolute constant . In contrast to the data nondegeneration assumption (i.e., no duplicate data points) made in Allen-Zhu et al. (2019b); Du et al. (2019b, a); Oymak and Soltanolkotabi (2019a); Zou and Gu (2019)333Specifically, Allen-Zhu et al. (2019b); Oymak and Soltanolkotabi (2019a); Zou and Gu (2019) require that any two data points are separated by a positive distance. Zou and Gu (2019) shows that this assumption is equivalent to those made in Du et al. (2019b, a), which require that the composite kernel matrix is strictly positive definite., Assumption 4.3 only requires that the data points from different classes are nondegenerate, thus we call it class-dependent data nondegeneration assumption. It is clear that Assumption 4.3 is milder since it can allow the data points to be arbitrary close as long as they are from the same class, while the data nondegeneration assumption requires that any two data points should be separated by a constant distance.

Then we provide the following proposition which shows that Assumption 4.3 also implies Assumption 3 for certain choices of and . Suppose the training data points satisfy Assumption 4.3, then if

 m=Ω([L2n9/2ϕ−2log(n/(δϵ))]1/α),

Assumption 3 can be satisfied with for some absolute constant . Proposition 4.3 suggests that when the neural network is sufficiently wide, as long as there exists no duplicate training data from different classes, Assumption 3 can still be satisfied with . We can also plug this result into Theorem 3.1, which gives the over-parameterization condition 444The detailed dependency on and of defined in Theorem 3.1 can be found in (4). Here hides logarithmic dependency on , and . if choosing . Compared with the counterpart proved in Zou et al. (2019), i.e., , our result is strictly sharper if the network depth satisfies .

## 5 Proof of the Main Theory

In this section we present the proofs of our main results in Section 3.

### 5.1 Proof of Theorem 3.1

We first present the following lemma which states that the neural network function is almost linear in terms of its weights. [Lemma 4.1 in Cao and Gu (2019a)] With probability at least over the randomness of initialization, for all and with , it holds that

 ∣∣f\Wb′(\xbi)−f\Wb(\xbi)−⟨∇f\Wb(\xbi),\Wb′−\Wb⟩∣∣=O(τ1/3L2mα√logm)L−1∑l=1∥\Wb′l−\Wbl∥2. (1)

We make a slight modification of its original version in Cao and Gu (2019a) as our neural network function encloses an additional scaling parameter . Then assuming that all iterates are close to the initialization, we establish a convergence guarantee of GD in the following lemma. Set the step size and . Then given and suppose for all , with probability at least over the randomness of initialization it holds that

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(t′)l−\Wb∗l∥2F≥η⋅[t′−1∑t=0LS(\Wb(t))−2t′ϵ],

Then the remaining part is to characterize that under which condition on , we can guarantee all iterates are staying inside the required region until convergence. Based on Lemmas 5.1 and 5.1, we complete the remaining proof as follows.

###### Proof of Theorem 3.1.

In the following proof we choose and . Note that Lemmas 5.1 and 5.1 hold with probability at least over the randomness of initialization and . Therefore, if the neural network width satisfies

 m=Ω[L1/(2−α)log(m)2/(2−α)log(nL2/δ)2/(2−α)], (2)

we have with probability at least all results in Lemmas 5.1 and 5.1 hold.

Then we prove the theorem by two parts: 1) we show that all iterates will stay inside the region ; and 2) we show that gradient descent can find a neural network with at most training loss within iterations.

All iterates stay inside . We prove this part by induction. Specifically, given , we assume for all and prove that . First, it is clear that . Then by Lemma 5.1 and apply the fact that , we have

 L∑l=1∥\Wb(t′)l−\Wb∗l∥2F≤L∑l=1∥\Wb(0)l−\Wb∗l∥2F+2ηt′ϵ

Note that and , we have

 L∑l=1∥\Wb(t′)l−\Wb∗l∥2F ≤CLR2m−2α,

where is an absolute constant. Therefore, by triangle inequality, we further have the following for all ,

 ∥\Wb(t′)l−\Wb(0)l∥F≤∥\Wb(t′)l−\Wb∗l∥F+∥\Wb(0)l−\Wb∗l∥F≤√CLRm−α+Rm−α≤2√CLRm−α. (3)

Therefore, in order to guarantee that , by our choice of , it suffices to ensure that . Combining with the condition on provided in (2), we have if

 m≥m∗(δ,R,L,α)=~\cO([R4L11]1/α⋅log2/(2−α)(n/δ)), (4)

the iterate will be staying inside the region , which completes the proof of the first part.

Convergence of gradient descent. By Lemma 5.1, we have

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(T)l−\Wb∗l∥2F≥η(T−1∑t=0LS(\Wb(t))−2Tϵ).

Dividing by on the both sides, we get

 1TT−1∑t=0LS(\Wb(t)) ≤∑Ll=1∥\Wb(0)l−\Wb∗l∥2F−∑Ll=1∥\Wb(T)l−\Wb∗l∥2FηT+2ϵ ≤LR2m−2αηT+2ϵ ≤3ϵ,

where the second inequality is by the fact that and the last inequality is by our choices of and which ensure that . Notice that . This completes the proof of the second part, and we are able to complete the proof. ∎

### 5.2 Proof of Theorem 3.1

Following Cao and Gu (2019b), we first introduce the definition of surrogate loss of the network, which is defined by the derivative of the loss function. We define the empirical surrogate error and population surrogate error as follows:

 \cES(\Wb):=−1nn∑i=1ℓ′[yi⋅f\Wb(\bxi)], \cE\cD(\Wb):=\EE(\xb,y)∼\cD{−ℓ′[y⋅f\Wb(\xb)]}.

The following lemma gives uniform-convergence type of results for utilizing the fact that is bounded and Lipschitz continuous. For any , suppose that . Then with probability at least , it holds that

 |\cE\cD(\Wb)−\cES(\Wb)|≤~\cO(min{4LL3/2~R√mn,L~R√n+L3~R4/3mα/3})+\cO(√log(1/δ)n)

for all

We are now ready to prove Theorem 3.1, which combines the trajectory distance analysis in the proof of Theorem 3.1 with Lemma 5.2.

###### Proof of Theorem 3.1.

With exactly the same proof as Theorem 3.1, by (3) and induction we have with . Therefore by Lemma 5.2, we have

 |\cE\cD(\Wb(t))−\cES(\Wb(t))|≤~\cO(min{4LL2R√mn,L3/2R√n+L11/3R4/3mα/3})+\cO(√log(1/δ)n)

for all . Note that we have . Therefore,

 \EE[L0−1\cD(^\Wb)] ≤2\cE\cD(\Wb(t)) ≤2\EE[LS(^\Wb)]+~\cO(min{4LL2R√mn,L3/2R√n+L11/3R4/3mα/3})+\cO(√log(1/δ)n).

This finishes the proof. ∎

### 5.3 Proof of Theorem 3.2

In this section we provide the proof of Theorem 3.2. The following result is the counterpart of Lemma 5.1 for SGD. Set the step size and . Then given a positive integer and suppose for all , with probability at least over the randomness of initialization it holds that

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(n′)l−\Wb∗l∥2F≥η(n′∑i=1Li(\Wb(i−1))−2n′ϵ),

Our proof is based on the application of Lemma 5.3 and an online-to-batch conversion argument (Cesa-Bianchi et al., 2004), which is inspired by Cao and Gu (2019a); Ji and Telgarsky (2019b). We denote . The following lemma is provided in Ji and Telgarsky (2019b), whose proof only relies on the boundedness of and therefore is applicable in our setting.

For any , with probability at least , the iterates of Algorithm 2 satisfies that

 1nn∑i=1\cE\cD(\Wb(i−1))≤4nn∑i=1\cEi(\Wb(i−1))+4log(1/δ)n.
###### Proof of Theorem 3.2.

Similar to the proof of Theorem 3.1, we prove this theorem in two parts: 1) all iterates stay inside ; and 2) convergence of SGD.

All iterates stay inside . Similar to the proof of Theorem 3.1, we prove this part by induction. Assuming satisfies for all , by Lemma 5.3, we have

 L∑l=1∥\Wb(n′)l−\Wb∗l∥2F≤L∑l=1∥\Wb(0)l−\Wb∗l∥2F+2n′ηϵ≤LR2⋅m−2α+2nηϵ,

where the second inequality is by and . Then by triangle inequality, we further get

 ∥\Wb(n′)l−\Wb(0)l∥F≤∥\Wb(n′)l−\Wb∗l∥F+∥\Wb∗l−\Wb(0)l∥F≤\cO(√LRm−α+√nηϵ).

Then by our choices of , and , it can be easily verified that if , we have . This completes the proof of the first part.

Convergence of online SGD By Lemma 5.3, we have

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(n)l−\Wb∗l∥2F≥η(n∑i=