Training Over-parameterized Deep ResNet Is almost as Easy as Training a Two-layer Network

It has been proved that gradient descent converges linearly to the global minima for training deep neural network in the over-parameterized regime. However, according to allen2018convergence, the width of each layer should grow at least with the polynomial of the depth (the number of layers) for residual network (ResNet) in order to guarantee the linear convergence of gradient descent, which shows no obvious advantage over feedforward network. In this paper, we successfully remove the dependence of the width on the depth of the network for ResNet and reach a conclusion that training deep residual network can be as easy as training a two-layer network. This theoretically justifies the benefit of skip connection in terms of facilitating the convergence of gradient descent. Our experiments also justify that the width of ResNet to guarantee successful training is much smaller than that of deep feedforward neural network.

Authors

• 15 publications
• 3 publications
• 141 publications
• 101 publications
• Gradient Descent Finds Global Minima of Deep Neural Networks

Gradient descent finds a global minimum in training deep neural networks...
11/09/2018 ∙ by Simon S. Du, et al. ∙ 20

• Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

Deep neural networks are highly expressive machine learning models with ...
11/11/2019 ∙ by Talha Cihad Gulcu, et al. ∙ 0

• Optimization Algorithm Inspired Deep Neural Network Structure Design

Deep neural networks have been one of the dominant machine learning appr...
10/03/2018 ∙ by Huan Li, et al. ∙ 0

• Random Walk Initialization for Training Very Deep Feedforward Networks

Training very deep networks is an important open problem in machine lear...
12/19/2014 ∙ by David Sussillo, et al. ∙ 0

• Are ResNets Provably Better than Linear Predictors?

A residual network (or ResNet) is a standard deep neural net architectur...
04/18/2018 ∙ by Ohad Shamir, et al. ∙ 0

• Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

The behavior of the gradient descent (GD) algorithm is analyzed for a de...
04/10/2019 ∙ by Weinan E, et al. ∙ 15

• Towards Understanding the Importance of Shortcut Connections in Residual Networks

Residual Network (ResNet) is undoubtedly a milestone in deep learning. R...
09/10/2019 ∙ by Tianyi Liu, et al. ∙ 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although deep neural networks have achieved revolutionary success over various tasks, i.e., computer vision

(He et al., 2016) and natural language understanding (Hochreiter and Schmidhuber, 1997), they are still in lack of a rigorous theoretical study of the optimization and generalization properties. Specifically for the optimization, because the loss of deep neural network is highly nonconvex, local search algorithms like gradient descent is hard to analyze with performance guarantee. Many recent works (Choromanska et al., 2015; Kawaguchi, 2016; Nguyen and Hein, 2017; Soudry and Hoffer, 2017) have studied the loss surface of the neural networks and a common claim is that (deep) neural networks have relatively no bad local minima. However, the scenarios they study are often under many strict assumptions on the network architectures, i.e., deep linear network, or shallow network with one hidden layer or differentiable nonlinear activation, and on the input data i.e., Gaussian input data or linear separable input data. In fact, Safran and Shamir (2017)

have shown that spurious local minima are common in two-layer ReLU neural networks. Overall, the loss surface study is still far from understanding practical models.

As most neural network models are trained with (stochastic) gradient descent, the optimization property of gradient descent in training deep neural network has also been widely studied.

Soltanolkotabi et al. (2018); Brutzkus et al. (2017) point out that over-parameterization might play a key role in the convergence analysis of (stochastic) gradient descent. More recently, Li and Liang (2018); Du et al. (2019) prove that (stochastic) gradient descent converges linearly to the global minimum for training two-layer neural networks as long as the network is sufficiently over-parameterized. The high level idea is to show the gradient of the network exhibits good property at initialization and then argue the gradient descent finds global minimum within a neighborhood of the initialization, in which the benign property roughly maintains.

A breakthrough is achieved by Allen-Zhu et al. (2018b); Du et al. (2018), who extend the analysis to deep neural networks (more than two layers). Specifically, Du et al. (2018)

prove that gradient descent finds global minimum with the assumption that activation function is smooth and some gram matrix at the last layer has lower bounded singular value. Their result requires that the width of the network grow exponentially with the depth of the network for feedforward network. At the same time,

Allen-Zhu et al. (2018b) prove that the width of the network growing polynomially with the depth of the network for feedforward network with ReLU activation is enough to show the linear convergence of gradient descent. The high level idea is first bounding the forward and backward stability for deep networks and apply a similar argument for the convergence result of the two-layer’s case.

From the above theoretical results, it seems that any vanilla feedforward neural network can be successfully trained as long as it is sufficiently over-parameterized. Alternatively, the practical difficulty of training deep network, i.e., exploding or vanishing gradient, is due to that the network is not wide enough. However, in practice, with skip connection we can successfully train deep network with hundreds and thousands of layers without much difficulty. It naturally motivates us to ask

Does residual network (ResNet) make itself preferable than vanilla feedforward network from the theoretic convergence analysis of gradient descent?

We note that although Allen-Zhu et al. (2018b); Du et al. (2018) have established the convergence results of gradient descent for ResNet, their results do not clearly answer this question. Du et al. (2018) show that the provable training steps of ResNet is polynomial in the number of layers while vanilla feedforward network is exponential. Nonetheless, Allen-Zhu et al. (2018b) show that the provable training time for feedforward network is still polynomial in the number of layers and that for ResNet is also polynomial, which makes the benefit of ResNet unclear.

In this paper we establish that for ResNet the over-parameterization requirement on the width does not directly depend on the depth, which is the best possible result we can expect for the depth dependence. Our contribution can be summarized as follows.

• We show that the over-parameterization requirement for ResNet is almost independent with the depth of the network.

• We show that the provable training steps do not depend directly on the depth of the network, which recalls that training deep over-parameterized ResNet can be almost as easy as training a two-layer network.

Moreover, the over-parameterization for ResNet does not depend on the optimization acuracy222The new version of Allen-Zhu et al. (2018b) also achieves this.. Technically, we make several critical improvements over the proof in Allen-Zhu et al. (2018b) for analyzing the convergence of gradient descent training over-parameterized deep ResNet. Specifically, we exploit the fact that both the output change of each layer and the magnitude of the gradient on the parameters in the residual block become smaller as the depth of the network increases because the output of the parametric mapping in the residual block is scaled by where is the depth and is the width, which is adopted in both Allen-Zhu et al. (2018b) and Du et al. (2018). We note that being small333Preliminary experiments show that may be improved to , whose rigorous argument needs further development.

is necessary both for the proof and for the practice for our ResNet model that does not include batch normalization layer. We fully exploit such setting of

and successfully remove the dependence of the width on the depth . Moreover, we also introduce two new proofs on bounding the forward stability and tighten several arguments in Allen-Zhu et al. (2018b). Our theoretical result reflects that from the optimization perspective, the training deep neural network with skip connection is much easier than training vanilla feedforword network. Extensive experiments corroborate our finding.

1.1 Related works

Several papers argue the benefit of ResNet but they are either lack of rigorous theory or study the ResNet without nonlinear activation. Specifically, Veit et al. (2016) interpret ResNet behaves like an ensemble of shallower networks, which is imprecise because the shallower networks are trained jointly, not independently (Xie et al., 2017). Zhang et al. (2018) argue the benefit of skip connection form the perspective of improving the local Hessian and Hardt and Ma (2016) show that deep linear residual networks have no spurious local optima.

The most related papers are Allen-Zhu et al. (2018b); Zou et al. (2018); Du et al. (2018). Zou et al. (2018) shares the same high level proof idea as Allen-Zhu et al. (2018b) and studies binary classification problem and shows stochastic gradient descent can find the global minimum when training an over-parameterized deep ReLU network. In contrast, we improve the condition guaranteeing that gradient descent finds global minimum for ResNet and achieve an optimal dependence of over-parameterization on the network depth.

People are skeptical about the over-parameterization partially because of the classic wisdom in learning theory: controlling the complexity of the function space leads to good generalization. However, the great success of deep learning urges to reconsider the generalization property in the over-parameterized regime. Recently, some progress has been made along this line.

Brutzkus et al. (2017) provide both optimization and generalization guarantees of the SGD solution for over-parameterized two-layer networks given that the data is linear separable. Li and Liang (2018); Allen-Zhu et al. (2018a) show that the over-parameterized neural network provably generalize for two-layer and three-layer networks. Neyshabur et al. (2019) use unit-wise capacity and obtain a bound on the empirical Rademacher complexity, which can provide an explanation (not rigorous argument) of the generalization for over-parameterized two-layer ReLU networks.

Papers studying other over-parameterized models and the local geometry of neural networks are also related. Xu et al. (2018)

show that over-parameterization can help Expectation Maximization avoid spurious local optima. A result with similar flavor

(Li et al., 2017) has also been obtained for the matrix sensing problem. Chizat and Bach (2018) use optimal transport theory to analyze continuous time gradient descent on over-parameterized neural network with a single hidden layer. Oymak and Soltanolkotabi (2018); Fu et al. (2018); Zhou and Liang (2017) study the local geometry of neural networks that are responsible for the behavior of gradient descent.

1.2 Paper Organization

The rest of this paper is organized as follows. Section 2 introduce the model and notations. Section 3 presents the main results, including the theory and the proof roadmap. Section 4 presents the the proofs for theorems and critical lemmas. Section 5 gives some experiments that support our theory. Finally, we conclude in Section 6.

2 Model and Notations

There are many residual network models since the seminal paper He et al. (2016). Here we study a very simple ResNet model444 The same ResNet model has been used in Allen-Zhu et al. (2018b) and Du et al. (2018). Many notations are borrowed from Allen-Zhu et al. (2018b), which may help readers to better compare the results and proofs. because we are targeting understanding how skip connection help the optimization rather than achieving good performance. The ResNet model is described as follows,

• Input layer: ;

• residual layers: ;

• A fully-connected layer: ;

• Output layer: ;

where is the point-wise activation function, and we use ReLU activation . Specifically, we assume the input dimension is and hence , the intermediate layers have the same width , and hence for , and the output has dimension and hence . Denote the values before activation by for and . Use and to denote the value of and

, respectively, when the input vector is

, and the diagonal sign matrix where .

We adopt the following initialization scheme:

• Each entry of is sampled independently from ;

• Each entry of is sampled independently from for ;

• Each entry of is sampled independently from .

Specifically, we set . We note that a small is necessary both for the proof and for the practice for our ResNet model with the above initialization because there is not batch normalization layer. For example, with the output of the ResNet explodes easily as the depth increases, which can be verified by calculating the expected value and by experiment. However, whether can be improved requires further consideration.

The training data set is , where is the feature vector and is the target signal for all . We make the following assumption on the training data.

Assumption 1.

For every pair , we have .

We consider regression task and the objective function is

 F(−−−−−−−−−−−−→\boldmath{W}):=n∑i=1Fi(−−−−−−−−−−−−→\boldmath{W})where Fi(−−−−−−−−−−−−→\boldmath{W}):=12∥\boldmath{B}hi,L−y∗i∥2,

where are the trainable parameters. Specifically, we clarify some notations here. We use to denote the norm of the vector . We further use and to denote the spectral norm and the Frobenius norm of the matrix , respectively. Denote and .

We note that the initialization scheme, the choice of and the assumption on the data are the same as those in Allen-Zhu et al. (2018b) so that the result is comparable.

The training is conducted by running the gradient descent algorithm. The gradient is computed through back-propagation. Since the layer and the following layers have different forms, we dispose them separately. Specifically for a fixed sample , we have

 ∇\boldmath{W}LFi(−−−−−−−−−−−−−−→% \boldmath{W}):=\boldmath{D}i,L(\boldmath{B}T(\boldmath{B}hi,L−y∗i))hTi,L−1, ∇\boldmath{W}lFi(−−−−−−−−−−−−−−→% \boldmath{W}):=τ\boldmath{D}i,l(BackTi,l+1(\boldmath{B}hi,L−y∗i))hTi,l−1, % for layer l∈[L−1],

where is a back-propagation operator to simplify the expression given by

 Backi,l:=\boldmath{B}\boldmath{D}i,L\boldmath{W}L\boldmath{D}i,L−1(I+τ\boldmath{W}L−1)⋯\boldmath{D}i,l(I+τ\boldmath{W}l).

For all , we define

 ∇\boldmath{W}lF(−−−−−−−−−−−−−−→% \boldmath{W}):=n∑i=1∇\boldmath{W}lFi(−−−−−−−−−−−−→\boldmath{W}).

3 Main Result

Given the model introduced in Section 2, our main result for gradient descent is as follows.

Theorem 1.

For the ResNet defined and initialized as in Section 2, if the network width

, then with probability at least

, gradient descent with learning rate finds a point in iterations.

This implies that gradient descent converges to global minima in linear time. The bound on does not depend on and directly if the third term in dominates, which usually should be the case. We have the following two remarks to compare our result with previous works.

Remark 1.

Under the regime , the network width requirement imposed on in Theorem 1 does not depend on the depth , sharply in contrast with Allen-Zhu et al. (2018b) and Du et al. (2018).

Remark 2.

The network width requirement imposed on in Theorem 1 does not directly depend on the optimization accuracy .

We can also have a similar result for mini-batch stochastic gradient descent.

Theorem 2.

For the ResNet defined and initialized as in Section 2, the network width . Suppose we do stochastic gradient descent update starting from and

 −−−−−−−−−−−−→\boldmath{W}(t+1)=−−−−−−−−−−−−→\boldmath{W}(t)−ηn|St|∑i∈St∇Fi(−−−−−−−−−−−−→\boldmath{W}(t)), (1)

where is a random subset of with . Then with probability at least , stochastic gradient descent (1) with learning rate finds a point in iterations.

In the following, we first present the proof’s high-level idea from a generic perspective of nonconvex optimization. We then give the proof roadmap for Theorem 1 and explain why and how we can achieve stronger result for optimizing over-parameterized ResNet.

3.1 Proof’s High-level Idea

From the generic nonconvex optimization, we understand that in order to build linear convergence to global minima of function value, one needs at least to build a gradient dominance condition. Suppose that is a global minimizer of a generic function , and is a neighborhood around with radius , then the -gradient dominance condition with respect to is depicted as

 ∀x∈Bx∗(ρ),1λ∥∇f(x)∥22≥f(x)−f(x∗).

Suppose further the gradient of satisfies some smoothness condition, e.g., is -Lipschitz continuous

 f(x2)≤f(x1)+⟨∇f(x1),x2−x1⟩+L2∥x2−x1∥2

for all . The gradient descent update step

 x(t+1)←x(t)−η⋅∇f(x(t))

gives the linear convergence of function value if choosing (Karimi et al., 2016).

Next one needs only to build similar gradient dominance condition and gradient smooth condition for deep ResNet to show the linear convergence of gradient descent.

We first build the gradient upper bound for deep ResNet.

Theorem 3.

With probability at least over the randomness of , it satisfies for every , every , and every with for ,

 ∥∇\boldmath{W}lFi(−−−−−−−−−−−−−−→% \boldmath{W})∥2F≤O⎛⎜⎝Fi(−−−−−−−−−−−−−−→% \boldmath{W})d×τ2m⎞⎟⎠, (2) ∥∇\boldmath{W}lF(−−−−−−−−−−−−−−→% \boldmath{W})∥2F≤O⎛⎜⎝F(−−−−−−−−−−−−−−→\boldmath% {W})d×τ2mn⎞⎟⎠, (3) ∥∇\boldmath{W}LF(−−−−−−−−−−−−−−→% \boldmath{W})∥2F≤O⎛⎜⎝F(−−−−−−−−−−−−−−→\boldmath% {W})d×mn⎞⎟⎠. (4)

We establish tighter gradient upper bound than Allen-Zhu et al. (2018b) by involving for the residual layers. Specifically, Theorem 3 treats the top layer and the residual layers for separately. This gives us the freedom to tighten the smoothness property in Theorem 5.

Theorem 4.

Let . With probability at least over the randomness of , it satisfies for every with ,

 ∥∇\boldmath{W}LF(−−−−−−−−−−−−−−→% \boldmath{W})∥2F≥Ω⎛⎜ ⎜⎝maxi∈[n]Fi(−−−−−−−−−−−−→\boldmath{W})dn/δ×m⎞⎟ ⎟⎠. (5)

This gradient lower bound on acts like the gradient dominance condition and it is the same as Allen-Zhu et al. (2018b) except that our range on does not depend on the depth .

With the help of Theorem 3 and several improvements, we can obtain a tighter bound on the semi-smoothness condition of the objective function.

Theorem 5.

Let and be at random initialization. With probability at least over the randomness of , we have for every with , and for every with , we have

 F(˘−−−−−−−−−−−−→\boldmath{W}+→\boldmath{W}′) ≤F(˘−−−−−−−−−−−−→\boldmath{W})+⟨∇F(˘−−−−−−−−−−−−→\boldmath{W}),−−−−−−−−−−−−−→%\boldmath$W$′⟩+√nF(˘−−−−−−−−−−−−−−→% \boldmath{W})⋅ω1/3√mlogm√d⋅O(L∑l=1∥\boldmath{W}′l∥2) +O(nmd)∥−−−−−−−−−−−−→\boldmath{W}′∥22. (6)

This semi-smoothness condition is stronger than Allen-Zhu et al. (2018b) because it removes the dependence of the right hand side on and it holds for larger region, i.e., the range of increases.

Our main improvements include the following, which will be more specific in Section 4.

• We provide a tighter bound on , i.e., the representation at layer . Now can be arbitrarily close to 1 for all depth ResNet, which is critical for downstream bounding tasks e.g., the -separateness for proving Theorem 4.

• We enlarge the region where the good properties hold. Now breaks the dependence on the depth .

• We improve the bound on the spectral norm of the perturbed intermediate mappings, which is helpful for downstream bounding task.

Finally, we can prove Theorem 1 with the help of Theorem 4, Theorem 3 and Theorem 5, which together produce a bound on the over-parameterization requirement of .

Outline Proof of Theorem 1

We note that we remove the dependence of on the solution accuracy by employing the fact that the gradient norm shrinks to 0 exponentially fast along the path of gradient descent interation. We also treat and separately to obtain a -free bound on . The complete proof is relegated to Appendix D.

Based on the forward stability and the randomness of , we can show that with probability at least , and therefore .

Assume that for every ,

 ∥\boldmath{W}(t)L−\boldmath{W}(0)L∥F≤ωΔ=O(δ3n9), (7) ∥\boldmath{W}(t)l−\boldmath{W}(0)l∥F≤τω. (8)

From Theorem 5 and Theorem 3, we can obtain that for one gradient descent step,

 F(−−−−−−−−−−−−→\boldmath{W}(t+1)) ≤F(−−−−−−−−−−−−→\boldmath{W}(t))−η∥∇F(−−−−−−−−−−−−→\boldmath{W}(t))∥2F+O(ηnmw1/3d+η2n2m2d2)⋅F(−−−−−−−−−−−−→\boldmath{W}(t)) ≤(1−Ω(ηδmdn2))F(−−−−−−−−−−−−→\boldmath{W}(t)), (9)

where the last inequality uses the gradient lower bound in Theorem 4 and the choice of and the assumption on . That is, after iterations, .

We need to verify for each , the iterate stays in the region where good properties hold. Therefore, we calculate

 ∥\boldmath{W}(t)L−\boldmath{W}(0)L∥F≤t−1∑i=1∥η∇\boldmath{W}LF(−−−−−−−−−−−−→\boldmath{W}(i))∥F (a)≤O(η√nm/d)t−1∑i=1√F(−−−−−−−−−−−−→\boldmath{W}(i))(b)≤O(n3√dδ√mlogm) (10)

where (a) is due to Theorem 3 and (b) is due to an upper bound of the sum of a geometric sequence. Similarly, we have for ,

 ∥\boldmath{W}(t)l−\boldmath{W}(0)l∥F≤t−1∑i=1∥η∇\boldmath{W}lF(−−−−−−−−−−−−→\boldmath{W}(i))∥F ≤O(ητ√nm/d)t−1∑i=1√F(−−−−−−−−−−−−→\boldmath{W}(i))≤O(τn3√dδ√mlogm).

By combining (10) and the assumption on (7), we obtain a bound on .

4 Proofs of Theorems and Critical Lemmas

In this section, we prove the theorems in Section 3

and introduce several lemmas that helps to establish the proofs. First we list several useful bounds on Gaussian distribution.

Lemma 1.

Suppose , then

 P{|X|≤x}≥1−exp(−x22σ2), (11) P{|X|≤x}≤√2πxσ. (12)

Another bound is on the spectral norm of random matrix

(Vershynin, 2012, Corollary 5.35).

Lemma 2.

Let , and entries of

are independent standard Gaussian random variables. Then for every

, with probability at least one has

 smax(A)≤√N+√n+t, (13)

where are the largest singular value of .

Next we give a useful lemma related to ResNet (slightly different from that in Allen-Zhu et al. (2018b)).

Lemma 3.

For ResNet initialized as in Section 2, with probability at least , one have

 ∥(I+τ\boldmath{W}b)\boldmath{D}i,b+1⋯\boldmath{D}i,a(I+τ% \boldmath{W}a)∥2≤1+c (14)

for any and can be made arbitrarily small by the choice of .

Next we show the good property at the initialization with the help of randomization and concentration. Then we show that such properties still hold after small perturbation. At last we prove that the perturbation is indeed small for gradient descent update with an appropriate step size.

4.1 Critical Lemmas at Initialization

The main idea is to build the forward and backward stability at the initialization, i.e., the norm and the distance are kept even after many layers’ mapping.

We first bound how the norm changes after layers’ mapping.

Lemma 4.

With probability at least over the randomness of and , we have

 ∀i∈[n],l∈{0,1,…,L}:∥hi,l∥∈[1−c,1+c], (15)

where can be arbitrarily small for the choice of and a sufficiently large .

We note that Lemma 4 achieve stronger result than the argument in Allen-Zhu et al. (2018b) which cannot guarantee arbitrarily close to . The property of arbitrarily close to is required for down-streaming bounding tasks. For example, the gradient lower bound (Theorem 4 requires this property and the -separateness (Lemma 6).

Proof.

With property (14), we can derive that for every and . The lower bound on is argued as follows for a fixed input .

Note that each coordinate of follows i.i.d. from a distribution which is 0 with probability , and with probability (Allen-Zhu et al., 2018b, Fact 4.2). Therefore, with probability , .

The event that for all input samples and all holds with probability at least . Condition on the above event, we have due to the choice of .

Moreover, since is Gaussian with probability and 0 with probability , then

 P{(h0)k≥ξ}≥12−12√πξ√m, and P{0<(h0)k<ξ}≤12√πξ√m.

Let , then . If choosing , then we have . Hence . We note that the above constants 1.1, 0.9 and 0.98 can be made arbitrarily close to 1 by choosing appropriately and sufficiently large. ∎

We next prove that the norm of a sparse vector through the network mapping.

Lemma 5.

If , then for all and and for all -sparse vectors and for all ,

 |vT\boldmath{B}\boldmath{D}i,LWL% \boldmath{D}i,L−1(I+τ\boldmath{W}L−1)⋯\boldmath{D}i,a(I+τ\boldmath{W}a)u|≤O(√slogm√d∥u∥∥v∥) (16)

holds with probability at least .

Proof.

For any fixed vector , holds with probability at least (over the randomness of ).

On the above event, for a fixed vector and any fixed for , the randomness only comes from , then

is a Gaussian variable with mean 0 and variance no larger than

. Hence

 P{|vT\boldmath{B}\boldmath{D}% i,L\boldmath{W}L\boldmath{D}i,L−1(I+τ\boldmath{W}L−1)⋯\boldmath{D}i,a(I+τ\boldmath{W}a)u|≥√slogm⋅Ω(∥u∥∥v∥/√d)} =erfc(Ω(√slogm))≤exp(−Ω(slogm)).

Take -net over all -sparse vectors of and all -dimensional vectors of , if then with probability the claim holds for all -sparse vectors of and all -dimensional vectors of . ∎

We next give a bound on the distance of the representations and in each layer for two input vectors with . In comparison with a similar result in Allen-Zhu et al. (2018b), our distance bound does not depend on the depth .

Lemma 6.

For any and any pair satisfying , then with probability at least ,

 ∥hi,l−hj,l∥≥δ2,

holds for all .

Proof.

The full proof is relegated to Appendix A. ∎

4.2 Critical Lemmas after Perturbation

Next we establish the forward stability after perturbation. We use to denote the weight matrices at initialization and use to denote the perturbation matrices. Let . Similarly, we define and for , and and . Furthermore, we let and .

Lemma 7.

Suppose for , and for . Then with probability at least , the following bounds on and hold for all and all ,

 ∥h′i,l∥≤O(τω),∥\boldmath{D}% ′i,l∥0≤mω2/3, (17) ∥h′i,L∥≤O(ω),∥\boldmath{D}′i,L∥0≤mω2/3. (18)
Proof.

The proof is relegated to Appendix B. ∎

Lemma 8.

With probability at least over the randomness of , , for every , for every diagonal matrices such that for all , for every perturbation matrices with , we have

 ∥(I+τ\boldmath{W}(0)b)(% \boldmath{D}(0)i,b−1+\boldmath{D}′′i,b−1)⋯(\boldmath{D}(0)i,a+\boldmath{D}′′i,a)(I+τ\boldmath{W}(0)a)∥2≤O(1), (19) ∥(I+τ\boldmath{W}(0)b+τ% \boldmath{W}′b)(\boldmath{D}(0)i,b−1+% \boldmath{D}′′i,b−1)⋯(\boldmath{D}(0)i,a+\boldmath{D}′′i,a)(I+τ% \boldmath{W}(0)a+τ\boldmath{W}′a)∥2≤O(1). (20)
Proof.

This is a direct result by using the argument as in the proof of Lemma 3. ∎

We note the spectral norm bound in the above lemma does not depend on the depth any more, in sharp contrast with the feedforward case.

4.3 Proofs of Theorems

Proof of Theorem 4 (Gradient Lower Bound)

Because the gradient is pathological and data-dependent, in order to build bound on the gradient, we need to consider all possible point and all cases of data. Hence we first introduce an arbitrary loss vector and then the gradient bound can be obtained by taking a union bound.

Definition 1 (Definition 6.1 in Allen-Zhu et al. (2018b)).

For any vector tuple (viewed as a fake loss vector), we define

 ^∇→v\boldmath{W}LFi(−−−−−−−−−−−−→\boldmath{W}):=\boldmath{D}i,L(\boldmath{B}Tvi)hTi,L, ^∇→v\boldmath{W}lFi(−−−−−−−−−−−−→\boldmath{W}):=τ\boldmath{D}i,l(BackTi,l+1vi)hTi,l−1,∀l∈[L−1] ^∇→v\boldmath{W}lF(−−−−−−−−−−−−→\boldmath{W}):=n∑i=1^∇→v\boldmath{W}lFi(−−−−−−−−−−−−−−→% \boldmath{W})∀l∈[L].
Proof.

The gradient lower-bound at the initialization is given in (Allen-Zhu et al., 2018b, Section 6.2) via the smoothed analysis (Spielman and Teng, 2004): with high probability the gradient is lower-bounded, although the worst case it might be 0. The proof is the same given two preconditioned results Lemma 4 and Lemma 6. We shall not repeat the proof here.

Now suppose that we have