1 Introduction
Although deep neural networks have achieved revolutionary success over various tasks, i.e., computer vision
(He et al., 2016) and natural language understanding (Hochreiter and Schmidhuber, 1997), they are still in lack of a rigorous theoretical study of the optimization and generalization properties. Specifically for the optimization, because the loss of deep neural network is highly nonconvex, local search algorithms like gradient descent is hard to analyze with performance guarantee. Many recent works (Choromanska et al., 2015; Kawaguchi, 2016; Nguyen and Hein, 2017; Soudry and Hoffer, 2017) have studied the loss surface of the neural networks and a common claim is that (deep) neural networks have relatively no bad local minima. However, the scenarios they study are often under many strict assumptions on the network architectures, i.e., deep linear network, or shallow network with one hidden layer or differentiable nonlinear activation, and on the input data i.e., Gaussian input data or linear separable input data. In fact, Safran and Shamir (2017)have shown that spurious local minima are common in twolayer ReLU neural networks. Overall, the loss surface study is still far from understanding practical models.
As most neural network models are trained with (stochastic) gradient descent, the optimization property of gradient descent in training deep neural network has also been widely studied.
Soltanolkotabi et al. (2018); Brutzkus et al. (2017) point out that overparameterization might play a key role in the convergence analysis of (stochastic) gradient descent. More recently, Li and Liang (2018); Du et al. (2019) prove that (stochastic) gradient descent converges linearly to the global minimum for training twolayer neural networks as long as the network is sufficiently overparameterized. The high level idea is to show the gradient of the network exhibits good property at initialization and then argue the gradient descent finds global minimum within a neighborhood of the initialization, in which the benign property roughly maintains.A breakthrough is achieved by AllenZhu et al. (2018b); Du et al. (2018), who extend the analysis to deep neural networks (more than two layers). Specifically, Du et al. (2018)
prove that gradient descent finds global minimum with the assumption that activation function is smooth and some gram matrix at the last layer has lower bounded singular value. Their result requires that the width of the network grow exponentially with the depth of the network for feedforward network. At the same time,
AllenZhu et al. (2018b) prove that the width of the network growing polynomially with the depth of the network for feedforward network with ReLU activation is enough to show the linear convergence of gradient descent. The high level idea is first bounding the forward and backward stability for deep networks and apply a similar argument for the convergence result of the twolayer’s case.From the above theoretical results, it seems that any vanilla feedforward neural network can be successfully trained as long as it is sufficiently overparameterized. Alternatively, the practical difficulty of training deep network, i.e., exploding or vanishing gradient, is due to that the network is not wide enough. However, in practice, with skip connection we can successfully train deep network with hundreds and thousands of layers without much difficulty. It naturally motivates us to ask
“Does residual network (ResNet) make itself preferable than vanilla feedforward network from the theoretic convergence analysis of gradient descent?”
We note that although AllenZhu et al. (2018b); Du et al. (2018) have established the convergence results of gradient descent for ResNet, their results do not clearly answer this question. Du et al. (2018) show that the provable training steps of ResNet is polynomial in the number of layers while vanilla feedforward network is exponential. Nonetheless, AllenZhu et al. (2018b) show that the provable training time for feedforward network is still polynomial in the number of layers and that for ResNet is also polynomial, which makes the benefit of ResNet unclear.
In this paper we establish that for ResNet the overparameterization requirement on the width does not directly depend on the depth, which is the best possible result we can expect for the depth dependence. Our contribution can be summarized as follows.

We show that the overparameterization requirement for ResNet is almost independent with the depth of the network.

We show that the provable training steps do not depend directly on the depth of the network, which recalls that training deep overparameterized ResNet can be almost as easy as training a twolayer network.
Moreover, the overparameterization for ResNet does not depend on the optimization acuracy^{2}^{2}2The new version of AllenZhu et al. (2018b) also achieves this.. Technically, we make several critical improvements over the proof in AllenZhu et al. (2018b) for analyzing the convergence of gradient descent training overparameterized deep ResNet. Specifically, we exploit the fact that both the output change of each layer and the magnitude of the gradient on the parameters in the residual block become smaller as the depth of the network increases because the output of the parametric mapping in the residual block is scaled by where is the depth and is the width, which is adopted in both AllenZhu et al. (2018b) and Du et al. (2018). We note that being small^{3}^{3}3Preliminary experiments show that may be improved to , whose rigorous argument needs further development.
is necessary both for the proof and for the practice for our ResNet model that does not include batch normalization layer. We fully exploit such setting of
and successfully remove the dependence of the width on the depth . Moreover, we also introduce two new proofs on bounding the forward stability and tighten several arguments in AllenZhu et al. (2018b). Our theoretical result reflects that from the optimization perspective, the training deep neural network with skip connection is much easier than training vanilla feedforword network. Extensive experiments corroborate our finding.1.1 Related works
Several papers argue the benefit of ResNet but they are either lack of rigorous theory or study the ResNet without nonlinear activation. Specifically, Veit et al. (2016) interpret ResNet behaves like an ensemble of shallower networks, which is imprecise because the shallower networks are trained jointly, not independently (Xie et al., 2017). Zhang et al. (2018) argue the benefit of skip connection form the perspective of improving the local Hessian and Hardt and Ma (2016) show that deep linear residual networks have no spurious local optima.
The most related papers are AllenZhu et al. (2018b); Zou et al. (2018); Du et al. (2018). Zou et al. (2018) shares the same high level proof idea as AllenZhu et al. (2018b) and studies binary classification problem and shows stochastic gradient descent can find the global minimum when training an overparameterized deep ReLU network. In contrast, we improve the condition guaranteeing that gradient descent finds global minimum for ResNet and achieve an optimal dependence of overparameterization on the network depth.
People are skeptical about the overparameterization partially because of the classic wisdom in learning theory: controlling the complexity of the function space leads to good generalization. However, the great success of deep learning urges to reconsider the generalization property in the overparameterized regime. Recently, some progress has been made along this line.
Brutzkus et al. (2017) provide both optimization and generalization guarantees of the SGD solution for overparameterized twolayer networks given that the data is linear separable. Li and Liang (2018); AllenZhu et al. (2018a) show that the overparameterized neural network provably generalize for twolayer and threelayer networks. Neyshabur et al. (2019) use unitwise capacity and obtain a bound on the empirical Rademacher complexity, which can provide an explanation (not rigorous argument) of the generalization for overparameterized twolayer ReLU networks.Papers studying other overparameterized models and the local geometry of neural networks are also related. Xu et al. (2018)
show that overparameterization can help Expectation Maximization avoid spurious local optima. A result with similar flavor
(Li et al., 2017) has also been obtained for the matrix sensing problem. Chizat and Bach (2018) use optimal transport theory to analyze continuous time gradient descent on overparameterized neural network with a single hidden layer. Oymak and Soltanolkotabi (2018); Fu et al. (2018); Zhou and Liang (2017) study the local geometry of neural networks that are responsible for the behavior of gradient descent.1.2 Paper Organization
The rest of this paper is organized as follows. Section 2 introduce the model and notations. Section 3 presents the main results, including the theory and the proof roadmap. Section 4 presents the the proofs for theorems and critical lemmas. Section 5 gives some experiments that support our theory. Finally, we conclude in Section 6.
2 Model and Notations
There are many residual network models since the seminal paper He et al. (2016). Here we study a very simple ResNet model^{4}^{4}4 The same ResNet model has been used in AllenZhu et al. (2018b) and Du et al. (2018). Many notations are borrowed from AllenZhu et al. (2018b), which may help readers to better compare the results and proofs. because we are targeting understanding how skip connection help the optimization rather than achieving good performance. The ResNet model is described as follows,

Input layer: ;

residual layers: ;

A fullyconnected layer: ;

Output layer: ;
where is the pointwise activation function, and we use ReLU activation . Specifically, we assume the input dimension is and hence , the intermediate layers have the same width , and hence for , and the output has dimension and hence . Denote the values before activation by for and . Use and to denote the value of and
, respectively, when the input vector is
, and the diagonal sign matrix where .We adopt the following initialization scheme:

Each entry of is sampled independently from ;

Each entry of is sampled independently from for ;

Each entry of is sampled independently from .
Specifically, we set . We note that a small is necessary both for the proof and for the practice for our ResNet model with the above initialization because there is not batch normalization layer. For example, with the output of the ResNet explodes easily as the depth increases, which can be verified by calculating the expected value and by experiment. However, whether can be improved requires further consideration.
The training data set is , where is the feature vector and is the target signal for all . We make the following assumption on the training data.
Assumption 1.
For every pair , we have .
We consider regression task and the objective function is
where are the trainable parameters. Specifically, we clarify some notations here. We use to denote the norm of the vector . We further use and to denote the spectral norm and the Frobenius norm of the matrix , respectively. Denote and .
We note that the initialization scheme, the choice of and the assumption on the data are the same as those in AllenZhu et al. (2018b) so that the result is comparable.
The training is conducted by running the gradient descent algorithm. The gradient is computed through backpropagation. Since the layer and the following layers have different forms, we dispose them separately. Specifically for a fixed sample , we have
where is a backpropagation operator to simplify the expression given by
For all , we define
3 Main Result
Given the model introduced in Section 2, our main result for gradient descent is as follows.
Theorem 1.
For the ResNet defined and initialized as in Section 2, if the network width
, then with probability at least
, gradient descent with learning rate finds a point in iterations.This implies that gradient descent converges to global minima in linear time. The bound on does not depend on and directly if the third term in dominates, which usually should be the case. We have the following two remarks to compare our result with previous works.
Remark 1.
Remark 2.
The network width requirement imposed on in Theorem 1 does not directly depend on the optimization accuracy .
We can also have a similar result for minibatch stochastic gradient descent.
Theorem 2.
In the following, we first present the proof’s highlevel idea from a generic perspective of nonconvex optimization. We then give the proof roadmap for Theorem 1 and explain why and how we can achieve stronger result for optimizing overparameterized ResNet.
3.1 Proof’s Highlevel Idea
From the generic nonconvex optimization, we understand that in order to build linear convergence to global minima of function value, one needs at least to build a gradient dominance condition. Suppose that is a global minimizer of a generic function , and is a neighborhood around with radius , then the gradient dominance condition with respect to is depicted as
Suppose further the gradient of satisfies some smoothness condition, e.g., is Lipschitz continuous
for all . The gradient descent update step
gives the linear convergence of function value if choosing (Karimi et al., 2016).
3.2 Proof Roadmap
Next one needs only to build similar gradient dominance condition and gradient smooth condition for deep ResNet to show the linear convergence of gradient descent.
We first build the gradient upper bound for deep ResNet.
Theorem 3.
With probability at least over the randomness of , it satisfies for every , every , and every with for ,
(2)  
(3)  
(4) 
We establish tighter gradient upper bound than AllenZhu et al. (2018b) by involving for the residual layers. Specifically, Theorem 3 treats the top layer and the residual layers for separately. This gives us the freedom to tighten the smoothness property in Theorem 5.
Theorem 4.
Let . With probability at least over the randomness of , it satisfies for every with ,
(5) 
This gradient lower bound on acts like the gradient dominance condition and it is the same as AllenZhu et al. (2018b) except that our range on does not depend on the depth .
With the help of Theorem 3 and several improvements, we can obtain a tighter bound on the semismoothness condition of the objective function.
Theorem 5.
Let and be at random initialization. With probability at least over the randomness of , we have for every with , and for every with , we have
(6) 
This semismoothness condition is stronger than AllenZhu et al. (2018b) because it removes the dependence of the right hand side on and it holds for larger region, i.e., the range of increases.
Our main improvements include the following, which will be more specific in Section 4.

We provide a tighter bound on , i.e., the representation at layer . Now can be arbitrarily close to 1 for all depth ResNet, which is critical for downstream bounding tasks e.g., the separateness for proving Theorem 4.

We enlarge the region where the good properties hold. Now breaks the dependence on the depth .

We improve the bound on the spectral norm of the perturbed intermediate mappings, which is helpful for downstream bounding task.
Finally, we can prove Theorem 1 with the help of Theorem 4, Theorem 3 and Theorem 5, which together produce a bound on the overparameterization requirement of .
Outline Proof of Theorem 1
We note that we remove the dependence of on the solution accuracy by employing the fact that the gradient norm shrinks to 0 exponentially fast along the path of gradient descent interation. We also treat and separately to obtain a free bound on . The complete proof is relegated to Appendix D.
Based on the forward stability and the randomness of , we can show that with probability at least , and therefore .
Assume that for every ,
(7)  
(8) 
From Theorem 5 and Theorem 3, we can obtain that for one gradient descent step,
(9) 
where the last inequality uses the gradient lower bound in Theorem 4 and the choice of and the assumption on . That is, after iterations, .
We need to verify for each , the iterate stays in the region where good properties hold. Therefore, we calculate
(10) 
where (a) is due to Theorem 3 and (b) is due to an upper bound of the sum of a geometric sequence. Similarly, we have for ,
4 Proofs of Theorems and Critical Lemmas
In this section, we prove the theorems in Section 3
and introduce several lemmas that helps to establish the proofs. First we list several useful bounds on Gaussian distribution.
Lemma 1.
Suppose , then
(11)  
(12) 
Another bound is on the spectral norm of random matrix
(Vershynin, 2012, Corollary 5.35).Lemma 2.
Let , and entries of
are independent standard Gaussian random variables. Then for every
, with probability at least one has(13) 
where are the largest singular value of .
Next we give a useful lemma related to ResNet (slightly different from that in AllenZhu et al. (2018b)).
Lemma 3.
For ResNet initialized as in Section 2, with probability at least , one have
(14) 
for any and can be made arbitrarily small by the choice of .
Next we show the good property at the initialization with the help of randomization and concentration. Then we show that such properties still hold after small perturbation. At last we prove that the perturbation is indeed small for gradient descent update with an appropriate step size.
4.1 Critical Lemmas at Initialization
The main idea is to build the forward and backward stability at the initialization, i.e., the norm and the distance are kept even after many layers’ mapping.
We first bound how the norm changes after layers’ mapping.
Lemma 4.
With probability at least over the randomness of and , we have
(15) 
where can be arbitrarily small for the choice of and a sufficiently large .
We note that Lemma 4 achieve stronger result than the argument in AllenZhu et al. (2018b) which cannot guarantee arbitrarily close to . The property of arbitrarily close to is required for downstreaming bounding tasks. For example, the gradient lower bound (Theorem 4 requires this property and the separateness (Lemma 6).
Proof.
With property (14), we can derive that for every and . The lower bound on is argued as follows for a fixed input .
Note that each coordinate of follows i.i.d. from a distribution which is 0 with probability , and with probability (AllenZhu et al., 2018b, Fact 4.2). Therefore, with probability , .
The event that for all input samples and all holds with probability at least . Condition on the above event, we have due to the choice of .
Moreover, since is Gaussian with probability and 0 with probability , then
Let , then . If choosing , then we have . Hence . We note that the above constants 1.1, 0.9 and 0.98 can be made arbitrarily close to 1 by choosing appropriately and sufficiently large. ∎
We next prove that the norm of a sparse vector through the network mapping.
Lemma 5.
If , then for all and and for all sparse vectors and for all ,
(16) 
holds with probability at least .
Proof.
For any fixed vector , holds with probability at least (over the randomness of ).
On the above event, for a fixed vector and any fixed for , the randomness only comes from , then
is a Gaussian variable with mean 0 and variance no larger than
. HenceTake net over all sparse vectors of and all dimensional vectors of , if then with probability the claim holds for all sparse vectors of and all dimensional vectors of . ∎
We next give a bound on the distance of the representations and in each layer for two input vectors with . In comparison with a similar result in AllenZhu et al. (2018b), our distance bound does not depend on the depth .
Lemma 6.
For any and any pair satisfying , then with probability at least ,
holds for all .
Proof.
The full proof is relegated to Appendix A. ∎
4.2 Critical Lemmas after Perturbation
Next we establish the forward stability after perturbation. We use to denote the weight matrices at initialization and use to denote the perturbation matrices. Let . Similarly, we define and for , and and . Furthermore, we let and .
Lemma 7.
Suppose for , and for . Then with probability at least , the following bounds on and hold for all and all ,
(17)  
(18) 
Proof.
The proof is relegated to Appendix B. ∎
Lemma 8.
With probability at least over the randomness of , , for every , for every diagonal matrices such that for all , for every perturbation matrices with , we have
(19)  
(20) 
Proof.
This is a direct result by using the argument as in the proof of Lemma 3. ∎
We note the spectral norm bound in the above lemma does not depend on the depth any more, in sharp contrast with the feedforward case.
4.3 Proofs of Theorems
Proof of Theorem 4 (Gradient Lower Bound)
Because the gradient is pathological and datadependent, in order to build bound on the gradient, we need to consider all possible point and all cases of data. Hence we first introduce an arbitrary loss vector and then the gradient bound can be obtained by taking a union bound.
Definition 1 (Definition 6.1 in AllenZhu et al. (2018b)).
For any vector tuple (viewed as a fake loss vector), we define
Proof.
The gradient lowerbound at the initialization is given in (AllenZhu et al., 2018b, Section 6.2) via the smoothed analysis (Spielman and Teng, 2004): with high probability the gradient is lowerbounded, although the worst case it might be 0. The proof is the same given two preconditioned results Lemma 4 and Lemma 6. We shall not repeat the proof here.
Now suppose that we have
Comments
There are no comments yet.