Although deep neural networks have achieved revolutionary success over various tasks, i.e., computer vision(He et al., 2016) and natural language understanding (Hochreiter and Schmidhuber, 1997), they are still in lack of a rigorous theoretical study of the optimization and generalization properties. Specifically for the optimization, because the loss of deep neural network is highly nonconvex, local search algorithms like gradient descent is hard to analyze with performance guarantee. Many recent works (Choromanska et al., 2015; Kawaguchi, 2016; Nguyen and Hein, 2017; Soudry and Hoffer, 2017) have studied the loss surface of the neural networks and a common claim is that (deep) neural networks have relatively no bad local minima. However, the scenarios they study are often under many strict assumptions on the network architectures, i.e., deep linear network, or shallow network with one hidden layer or differentiable nonlinear activation, and on the input data i.e., Gaussian input data or linear separable input data. In fact, Safran and Shamir (2017)
have shown that spurious local minima are common in two-layer ReLU neural networks. Overall, the loss surface study is still far from understanding practical models.
As most neural network models are trained with (stochastic) gradient descent, the optimization property of gradient descent in training deep neural network has also been widely studied.Soltanolkotabi et al. (2018); Brutzkus et al. (2017) point out that over-parameterization might play a key role in the convergence analysis of (stochastic) gradient descent. More recently, Li and Liang (2018); Du et al. (2019) prove that (stochastic) gradient descent converges linearly to the global minimum for training two-layer neural networks as long as the network is sufficiently over-parameterized. The high level idea is to show the gradient of the network exhibits good property at initialization and then argue the gradient descent finds global minimum within a neighborhood of the initialization, in which the benign property roughly maintains.
prove that gradient descent finds global minimum with the assumption that activation function is smooth and some gram matrix at the last layer has lower bounded singular value. Their result requires that the width of the network grow exponentially with the depth of the network for feedforward network. At the same time,Allen-Zhu et al. (2018b) prove that the width of the network growing polynomially with the depth of the network for feedforward network with ReLU activation is enough to show the linear convergence of gradient descent. The high level idea is first bounding the forward and backward stability for deep networks and apply a similar argument for the convergence result of the two-layer’s case.
From the above theoretical results, it seems that any vanilla feedforward neural network can be successfully trained as long as it is sufficiently over-parameterized. Alternatively, the practical difficulty of training deep network, i.e., exploding or vanishing gradient, is due to that the network is not wide enough. However, in practice, with skip connection we can successfully train deep network with hundreds and thousands of layers without much difficulty. It naturally motivates us to ask
“Does residual network (ResNet) make itself preferable than vanilla feedforward network from the theoretic convergence analysis of gradient descent?”
We note that although Allen-Zhu et al. (2018b); Du et al. (2018) have established the convergence results of gradient descent for ResNet, their results do not clearly answer this question. Du et al. (2018) show that the provable training steps of ResNet is polynomial in the number of layers while vanilla feedforward network is exponential. Nonetheless, Allen-Zhu et al. (2018b) show that the provable training time for feedforward network is still polynomial in the number of layers and that for ResNet is also polynomial, which makes the benefit of ResNet unclear.
In this paper we establish that for ResNet the over-parameterization requirement on the width does not directly depend on the depth, which is the best possible result we can expect for the depth dependence. Our contribution can be summarized as follows.
We show that the over-parameterization requirement for ResNet is almost independent with the depth of the network.
We show that the provable training steps do not depend directly on the depth of the network, which recalls that training deep over-parameterized ResNet can be almost as easy as training a two-layer network.
Moreover, the over-parameterization for ResNet does not depend on the optimization acuracy222The new version of Allen-Zhu et al. (2018b) also achieves this.. Technically, we make several critical improvements over the proof in Allen-Zhu et al. (2018b) for analyzing the convergence of gradient descent training over-parameterized deep ResNet. Specifically, we exploit the fact that both the output change of each layer and the magnitude of the gradient on the parameters in the residual block become smaller as the depth of the network increases because the output of the parametric mapping in the residual block is scaled by where is the depth and is the width, which is adopted in both Allen-Zhu et al. (2018b) and Du et al. (2018). We note that being small333Preliminary experiments show that may be improved to , whose rigorous argument needs further development.
is necessary both for the proof and for the practice for our ResNet model that does not include batch normalization layer. We fully exploit such setting ofand successfully remove the dependence of the width on the depth . Moreover, we also introduce two new proofs on bounding the forward stability and tighten several arguments in Allen-Zhu et al. (2018b). Our theoretical result reflects that from the optimization perspective, the training deep neural network with skip connection is much easier than training vanilla feedforword network. Extensive experiments corroborate our finding.
1.1 Related works
Several papers argue the benefit of ResNet but they are either lack of rigorous theory or study the ResNet without nonlinear activation. Specifically, Veit et al. (2016) interpret ResNet behaves like an ensemble of shallower networks, which is imprecise because the shallower networks are trained jointly, not independently (Xie et al., 2017). Zhang et al. (2018) argue the benefit of skip connection form the perspective of improving the local Hessian and Hardt and Ma (2016) show that deep linear residual networks have no spurious local optima.
The most related papers are Allen-Zhu et al. (2018b); Zou et al. (2018); Du et al. (2018). Zou et al. (2018) shares the same high level proof idea as Allen-Zhu et al. (2018b) and studies binary classification problem and shows stochastic gradient descent can find the global minimum when training an over-parameterized deep ReLU network. In contrast, we improve the condition guaranteeing that gradient descent finds global minimum for ResNet and achieve an optimal dependence of over-parameterization on the network depth.
People are skeptical about the over-parameterization partially because of the classic wisdom in learning theory: controlling the complexity of the function space leads to good generalization. However, the great success of deep learning urges to reconsider the generalization property in the over-parameterized regime. Recently, some progress has been made along this line.Brutzkus et al. (2017) provide both optimization and generalization guarantees of the SGD solution for over-parameterized two-layer networks given that the data is linear separable. Li and Liang (2018); Allen-Zhu et al. (2018a) show that the over-parameterized neural network provably generalize for two-layer and three-layer networks. Neyshabur et al. (2019) use unit-wise capacity and obtain a bound on the empirical Rademacher complexity, which can provide an explanation (not rigorous argument) of the generalization for over-parameterized two-layer ReLU networks.
Papers studying other over-parameterized models and the local geometry of neural networks are also related. Xu et al. (2018)
show that over-parameterization can help Expectation Maximization avoid spurious local optima. A result with similar flavor(Li et al., 2017) has also been obtained for the matrix sensing problem. Chizat and Bach (2018) use optimal transport theory to analyze continuous time gradient descent on over-parameterized neural network with a single hidden layer. Oymak and Soltanolkotabi (2018); Fu et al. (2018); Zhou and Liang (2017) study the local geometry of neural networks that are responsible for the behavior of gradient descent.
1.2 Paper Organization
The rest of this paper is organized as follows. Section 2 introduce the model and notations. Section 3 presents the main results, including the theory and the proof roadmap. Section 4 presents the the proofs for theorems and critical lemmas. Section 5 gives some experiments that support our theory. Finally, we conclude in Section 6.
2 Model and Notations
There are many residual network models since the seminal paper He et al. (2016). Here we study a very simple ResNet model444 The same ResNet model has been used in Allen-Zhu et al. (2018b) and Du et al. (2018). Many notations are borrowed from Allen-Zhu et al. (2018b), which may help readers to better compare the results and proofs. because we are targeting understanding how skip connection help the optimization rather than achieving good performance. The ResNet model is described as follows,
Input layer: ;
residual layers: ;
A fully-connected layer: ;
Output layer: ;
where is the point-wise activation function, and we use ReLU activation . Specifically, we assume the input dimension is and hence , the intermediate layers have the same width , and hence for , and the output has dimension and hence . Denote the values before activation by for and . Use and to denote the value of and
, respectively, when the input vector is, and the diagonal sign matrix where .
We adopt the following initialization scheme:
Each entry of is sampled independently from ;
Each entry of is sampled independently from for ;
Each entry of is sampled independently from .
Specifically, we set . We note that a small is necessary both for the proof and for the practice for our ResNet model with the above initialization because there is not batch normalization layer. For example, with the output of the ResNet explodes easily as the depth increases, which can be verified by calculating the expected value and by experiment. However, whether can be improved requires further consideration.
The training data set is , where is the feature vector and is the target signal for all . We make the following assumption on the training data.
For every pair , we have .
We consider regression task and the objective function is
where are the trainable parameters. Specifically, we clarify some notations here. We use to denote the norm of the vector . We further use and to denote the spectral norm and the Frobenius norm of the matrix , respectively. Denote and .
We note that the initialization scheme, the choice of and the assumption on the data are the same as those in Allen-Zhu et al. (2018b) so that the result is comparable.
The training is conducted by running the gradient descent algorithm. The gradient is computed through back-propagation. Since the layer and the following layers have different forms, we dispose them separately. Specifically for a fixed sample , we have
where is a back-propagation operator to simplify the expression given by
For all , we define
3 Main Result
Given the model introduced in Section 2, our main result for gradient descent is as follows.
This implies that gradient descent converges to global minima in linear time. The bound on does not depend on and directly if the third term in dominates, which usually should be the case. We have the following two remarks to compare our result with previous works.
The network width requirement imposed on in Theorem 1 does not directly depend on the optimization accuracy .
We can also have a similar result for mini-batch stochastic gradient descent.
In the following, we first present the proof’s high-level idea from a generic perspective of nonconvex optimization. We then give the proof roadmap for Theorem 1 and explain why and how we can achieve stronger result for optimizing over-parameterized ResNet.
3.1 Proof’s High-level Idea
From the generic nonconvex optimization, we understand that in order to build linear convergence to global minima of function value, one needs at least to build a gradient dominance condition. Suppose that is a global minimizer of a generic function , and is a neighborhood around with radius , then the -gradient dominance condition with respect to is depicted as
Suppose further the gradient of satisfies some smoothness condition, e.g., is -Lipschitz continuous
for all . The gradient descent update step
gives the linear convergence of function value if choosing (Karimi et al., 2016).
3.2 Proof Roadmap
Next one needs only to build similar gradient dominance condition and gradient smooth condition for deep ResNet to show the linear convergence of gradient descent.
We first build the gradient upper bound for deep ResNet.
With probability at least over the randomness of , it satisfies for every , every , and every with for ,
We establish tighter gradient upper bound than Allen-Zhu et al. (2018b) by involving for the residual layers. Specifically, Theorem 3 treats the top layer and the residual layers for separately. This gives us the freedom to tighten the smoothness property in Theorem 5.
Let . With probability at least over the randomness of , it satisfies for every with ,
This gradient lower bound on acts like the gradient dominance condition and it is the same as Allen-Zhu et al. (2018b) except that our range on does not depend on the depth .
With the help of Theorem 3 and several improvements, we can obtain a tighter bound on the semi-smoothness condition of the objective function.
Let and be at random initialization. With probability at least over the randomness of , we have for every with , and for every with , we have
This semi-smoothness condition is stronger than Allen-Zhu et al. (2018b) because it removes the dependence of the right hand side on and it holds for larger region, i.e., the range of increases.
Our main improvements include the following, which will be more specific in Section 4.
We provide a tighter bound on , i.e., the representation at layer . Now can be arbitrarily close to 1 for all depth ResNet, which is critical for downstream bounding tasks e.g., the -separateness for proving Theorem 4.
We enlarge the region where the good properties hold. Now breaks the dependence on the depth .
We improve the bound on the spectral norm of the perturbed intermediate mappings, which is helpful for downstream bounding task.
Outline Proof of Theorem 1
We note that we remove the dependence of on the solution accuracy by employing the fact that the gradient norm shrinks to 0 exponentially fast along the path of gradient descent interation. We also treat and separately to obtain a -free bound on . The complete proof is relegated to Appendix D.
Based on the forward stability and the randomness of , we can show that with probability at least , and therefore .
Assume that for every ,
where the last inequality uses the gradient lower bound in Theorem 4 and the choice of and the assumption on . That is, after iterations, .
We need to verify for each , the iterate stays in the region where good properties hold. Therefore, we calculate
where (a) is due to Theorem 3 and (b) is due to an upper bound of the sum of a geometric sequence. Similarly, we have for ,
4 Proofs of Theorems and Critical Lemmas
In this section, we prove the theorems in Section 3
and introduce several lemmas that helps to establish the proofs. First we list several useful bounds on Gaussian distribution.
Suppose , then
Let , and entries
of are independent standard Gaussian random variables. Then for every
are independent standard Gaussian random variables. Then for every, with probability at least one has
where are the largest singular value of .
Next we give a useful lemma related to ResNet (slightly different from that in Allen-Zhu et al. (2018b)).
For ResNet initialized as in Section 2, with probability at least , one have
for any and can be made arbitrarily small by the choice of .
Next we show the good property at the initialization with the help of randomization and concentration. Then we show that such properties still hold after small perturbation. At last we prove that the perturbation is indeed small for gradient descent update with an appropriate step size.
4.1 Critical Lemmas at Initialization
The main idea is to build the forward and backward stability at the initialization, i.e., the norm and the distance are kept even after many layers’ mapping.
We first bound how the norm changes after layers’ mapping.
With probability at least over the randomness of and , we have
where can be arbitrarily small for the choice of and a sufficiently large .
We note that Lemma 4 achieve stronger result than the argument in Allen-Zhu et al. (2018b) which cannot guarantee arbitrarily close to . The property of arbitrarily close to is required for down-streaming bounding tasks. For example, the gradient lower bound (Theorem 4 requires this property and the -separateness (Lemma 6).
With property (14), we can derive that for every and . The lower bound on is argued as follows for a fixed input .
Note that each coordinate of follows i.i.d. from a distribution which is 0 with probability , and with probability (Allen-Zhu et al., 2018b, Fact 4.2). Therefore, with probability , .
The event that for all input samples and all holds with probability at least . Condition on the above event, we have due to the choice of .
Moreover, since is Gaussian with probability and 0 with probability , then
Let , then . If choosing , then we have . Hence . We note that the above constants 1.1, 0.9 and 0.98 can be made arbitrarily close to 1 by choosing appropriately and sufficiently large. ∎
We next prove that the norm of a sparse vector through the network mapping.
If , then for all and and for all -sparse vectors and for all ,
holds with probability at least .
For any fixed vector , holds with probability at least (over the randomness of ).
On the above event, for a fixed vector and any fixed for , the randomness only comes from , then
is a Gaussian variable with mean 0 and variance no larger than. Hence
Take -net over all -sparse vectors of and all -dimensional vectors of , if then with probability the claim holds for all -sparse vectors of and all -dimensional vectors of . ∎
We next give a bound on the distance of the representations and in each layer for two input vectors with . In comparison with a similar result in Allen-Zhu et al. (2018b), our distance bound does not depend on the depth .
For any and any pair satisfying , then with probability at least ,
holds for all .
The full proof is relegated to Appendix A. ∎
4.2 Critical Lemmas after Perturbation
Next we establish the forward stability after perturbation. We use to denote the weight matrices at initialization and use to denote the perturbation matrices. Let . Similarly, we define and for , and and . Furthermore, we let and .
Suppose for , and for . Then with probability at least , the following bounds on and hold for all and all ,
The proof is relegated to Appendix B. ∎
With probability at least over the randomness of , , for every , for every diagonal matrices such that for all , for every perturbation matrices with , we have
This is a direct result by using the argument as in the proof of Lemma 3. ∎
We note the spectral norm bound in the above lemma does not depend on the depth any more, in sharp contrast with the feedforward case.
4.3 Proofs of Theorems
Proof of Theorem 4 (Gradient Lower Bound)
Because the gradient is pathological and data-dependent, in order to build bound on the gradient, we need to consider all possible point and all cases of data. Hence we first introduce an arbitrary loss vector and then the gradient bound can be obtained by taking a union bound.
Definition 1 (Definition 6.1 in Allen-Zhu et al. (2018b)).
For any vector tuple (viewed as a fake loss vector), we define
The gradient lower-bound at the initialization is given in (Allen-Zhu et al., 2018b, Section 6.2) via the smoothed analysis (Spielman and Teng, 2004): with high probability the gradient is lower-bounded, although the worst case it might be 0. The proof is the same given two preconditioned results Lemma 4 and Lemma 6. We shall not repeat the proof here.
Now suppose that we have