# A Convergence Theory for Deep Learning via Over-Parameterization

Deep neural networks (DNNs) have demonstrated dominating performance in many fields, e.g., computer vision, natural language progressing, and robotics. Since AlexNet, the neural networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer neural networks remains somewhat unsettled. We present a new theory to understand the convergence of training DNNs. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in n, the number of training samples and in L, the number of layers. We show on the training dataset, starting from randomly initialized weights, simple algorithms such as stochastic gradient descent attain 100 classification tasks, or minimize ℓ_2 regression loss in linear convergence rate, with a number of iterations that only scale polynomial in n and L. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

## Authors

• 39 publications
• 51 publications
• 61 publications
• ### On the Convergence Rate of Training Recurrent Neural Networks

Despite the huge success of deep learning, our understanding to how the ...
10/29/2018 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mystery in the success of neural networks is randomly initial...
10/04/2018 ∙ by Simon S. Du, et al. ∙ 30

• ### Perceptron Theory for Predicting the Accuracy of Neural Networks

Many neural network models have been successful at classification proble...
12/14/2020 ∙ by Denis Kleyko, et al. ∙ 0

• ### Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps

We propose a new algorithm to learn a one-hidden-layer convolutional neu...
05/20/2018 ∙ by Simon S. Du, et al. ∙ 0

• ### Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks

We develop exact representations of two layer neural networks with recti...
02/24/2020 ∙ by Mert Pilanci, et al. ∙ 38

• ### Partial local entropy and anisotropy in deep weight spaces

We refine a recently-proposed class of local entropic loss functions by ...
07/17/2020 ∙ by Daniele Musso, et al. ∙ 0

• ### The estimation of training accuracy for two-layer neural networks on random datasets without training

Although the neural network (NN) technique plays an important role in ma...
10/26/2020 ∙ by Shuyue Guan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Neural networks have demonstrated a great success in numerous machine-learning tasks

[33, 25, 36, 6, 30, 46, 47]. One of the empirical findings is that neural networks, trained by first-order methods from random initialization, have a remarkable ability of fitting training data [57].

From a capacity perspective, the ability to fit training data may not be surprising: modern neural networks are always heavily over-parameterized— they have (much) more parameters than the total number of training samples. Thus, in theory, there always exists parameter choices that achieve zero training error as long as the data does not degenerate.

Yet, from an optimization perspective, the fact that randomly initialized first-order methods can find such an optimal solution on the training data is quite non-trivial

: neural networks used in practice are often equipped with the ReLU activation function, which makes the training objective not only non-convex, but even non-smooth. Even the general convergence for finding approximate first and second-order critical points of a non-convex, non-smooth function is not fully understood

[12], and appears to be a challenging question on its own. This is in direct contrast to practice, in which ReLU networks trained by stochastic gradient descent (SGD) from random initialization almost never face the problem of non-smoothness or non-convexity, and can converge to even a global minimal over the training set quite easily.

Recently, there are quite a few papers trying to understand the success of neural networks from optimization perspective. Many of them focus on the case when the inputs are random Gaussian, and work only for two-layer neural networks [11, 50, 55, 35, 19, 22, 43, 60, 59].

In Li and Liang [34], it was shown that for a two-layer network with ReLU activation, SGD finds nearly-global optimal (say, 99% classification accuracy) solutions on the training data, as long as the network is over-parameterized, meaning that when the number of neurons is polynomially large comparing to the input size. Moreover, if the data is sufficiently structured (say, coming from mixtures of separable distributions), this perfect accuracy can be extended to test data as well. As a separate note, over-parameterization is suggested as the possible key to avoid bad local minima by Safran and Shamir [44] even for two-layer neural networks.

There are also results that go beyond two-layer neural networks but with limitations. Some consider deep linear neural networks without any activation functions [26, 7, 9, 31]. The result of Daniely [14] applies to multi-layer neural network with ReLU activation, but is about the convex training process only with respect to the last layer. Daniely worked in a parameter regime where the weight changes of all layers except the last one make negligible contribution to the final output (and they form the so-called conjugate kernel). The result of Soudry and Carmon [52] shows that under over-parameterization and under random input perturbation, there is bad local minima for multi-layer neural networks. Their work did not show any provable convergence rate.

In this paper, we study the following fundamental question [innertopmargin=3pt]

Can DNN be trained close to zero training error efficiently under mild assumptions?

If so, can the running time depend only polynomially in the number of layers?

Motivation.   In 2012 AlexNet [33] was born with convolutional layers. Since then, the common trend in the deep learning community is to build network architectures that go deeper. In 2014, Simonyan and Zisserman [49] proposed a VGG network with layers. Later, Szegedy et al. [54] proposed GoogleNet with layers. In practice, we cannot make the network deeper by naively stacking layers together due to the so-called vanishing / exploding gradient issues. For this reason, in 2015, He et al. [30] proposed an ingenious deep network structure called Deep Residual Network (ResNet), with the capability of handling at least layers. For more overview and variants of ResNet, we refer the readers to [21].

Compared to the practical neural networks that go much deeper, the existing theory has been mostly around two-layer (thus one-hidden-layer) networks even just for the training process alone. It is natural to ask if we can theoretically understand how the training process has worked for multi-layer neural networks.

### 1.1 Our Result

In this paper, we extend the over-parameterization theory to multi-layer neural networks. We show that over-parameterized neural networks can indeed be trained by regular first-order methods to global minima (e.g. zero training error), as as long as the dataset is non-degenerate. We say that the dataset is non-degenerate if the data points are distinct. This is a minimal requirement since a dataset with the same input and different labels can not be trained to zero error. We denote by the minimum (relative) distance between two training data points, and by the number of samples in the training dataset.

Now, consider an -layer fully-connected feedforward neural network, each layer consisting of neurons equipped with ReLU activation. We show that,

• As long as , starting from random Gaussian initialized weights, gradient descent (GD) and stochastic gradient descent (SGD) find -error global minimum in regression using at most iterations. This is a linear convergence rate.

• Using the same network, if the task is multi-label classification, then GD and SGD find an

accuracy classifier on the training set in

iterations.

• Our result also applies to other Lipschitz-smooth loss functions, and some other network architectures including convolutional neural networks (CNNs) and residual networks (ResNet).

###### Remark.

This paper does not cover the the generalization of over-parameterized neural networks to the test data. We refer interested readers to some practical evidence [53, 56] that deeper (and wider) neural networks actually generalize better. As for theoretical results, over-parameterized neural networks provably generalize at least for two-layer networks [34, 4] and three-layer networks [4].111If data is “well-structured” two-layer over-parameterized neural networks can learn it using SGD with polynomially many samples [34]. If data is produced by some unknown two-layer (resp. three-layer) neural network, then two-layer (resp. three-layer) neural networks can also provably learn it using SGD and polynomially many samples [4].

A concurrent but different result.   We acknowledge a concurrent work of Du et al. [18] which has a similar abstract to this paper, but is different from us in many aspects. Since we noticed many readers cannot tell the two results apart, we compare them carefully below. Du et al. [18] has two main results, one for fully-connected networks and the other for residual networks (ResNet).

For fully-connected networks, their provable training time is exponential in the number of layers, leading to a claim of the form “ResNet has an advantage because ResNet is polynomial-time but fully-connected network is exponential-time.” As we prove in this paper, fully-connected networks do have polynomial training time, so the logic behind their claim is wrong.

For residual networks, their training time scales polynomial in

, a parameter that depends on the minimal singular value of a complicated,

-times recursively-defined kernel matrix. It is not clear whether is small in practice. From the representation of our paper, it seems their could again be exponential in the number of layers. The authors of [18] have hidden this large factor from their stated complexity in the introduction.

Their result is different from us in many other aspects. Their result only applies to smooth activation functions and thus cannot apply to the state-of-the-art ReLU activation. Their ResNet requires the value of weight initialization to be a function polynomial in (which is our ); this can heavily depend on the input data. Their result only applies to gradient descent but not to SGD. Their result only applies to loss but not others.

### 1.2 Other Related Works

Li and Liang [34] originally proved their result for the cross-entropy loss. Later, the “training accuracy” (not the testing accuracy) part of [34] was extended to the loss [20]. The result of [20] claims to have adopted a learning rate times larger than [34], but that is unfair because they have re-scaled the network by a factor of .222Indeed, if one replaces any function with then the gradient decreases by a factor of and the needed movement in increases by a factor of . Thus, you can equivalently increase the learning rate by a factor of .

Linear networks without activation functions are important subjects on its own. Besides the already cited references [26, 7, 9, 31], there are a number of works that study linear dynamical systems

, which can be viewed as the linear version of recurrent neural networks or reinforcement learning. Recent works in this line of research include

[27, 28, 29, 16, 42, 1, 48, 39, 17, 8].

There is sequence of work about one-hidden-layer (multiple neurons) CNN [11, 59, 19, 24, 41]. Whether the patches overlap or not plays a crucial role in analyzing algorithms for such CNN. One category of the results have required the patches to be disjoint [11, 59, 19]. The other category [24, 41] have figured out a weaker assumption or even removed that patch-disjoint assumption. On input data distribution, most relied on inputs being Gaussian [11, 59, 19, 41], and some assumed inputs to be symmetrically distributed with identity covariance and boundedness [24].

As for ResNet, Li and Yuan [35] proved that SGD learns one-hidden-layer residual neural networks under Gaussian input assumption. The techniques in [60, 59]

can also be generalized to one-hidden-layer ResNet under the Gaussian input assumption; they can show that GD starting from good initialization point (via tensor initialization) learns ResNet.

Hardt and Ma [26] deep linear residual networks have no spurious local optima.

If no assumption is allowed, neural networks have been shown hard in several different perspectives. Thirty years ago, Blum and Rivest [10] first proved that learning the neural network is NP-complete. Stronger hardness results have been proved over the last decade [32, 37, 13, 15, 23, 51, 38].

An Over-Parameterized RNN Theory.   For experts in DNN theory, one may view this present paper as a deeply-simplified version of the recurrent neural network (RNN) paper [5] by the same set of authors. A recurrent neural network executed on input sequences with time horizon is very similar to a feedforward neural network with layers. The main difference between the two is that in feedforward neural networks, the weight matrices are different across layers, and thus independently randomly initialized; in contrast, in RNN, the same weight matrix is applied across the entire time horizon so we do not have fresh new randomness for proofs that involve in induction.

So, the over-parameterized convergence theory of DNN is much simpler than that of RNN.

We write this DNN result as a separate paper because: (1) not all the readers can easily notice that DNN is easier to study than RNN; (2) we believe the convergence of DNN is important on its own; (3) the proof in this paper is much simpler (30 vs 80 pages) and could reach out to a wider audience; (4) the simplicity of this paper allows us to tighten parameters in some non-trivial ways; and (5) the simplicity of this paper allows us to also study convolutional networks, residual networks, as well as different loss functions (all of them were missing from [5]).

We also note that the techniques of this paper can be combined with [5] to show the convergence of over-parameterized deep RNN.

## 2 Preliminaries

We use

to denote the Gaussian distribution of mean

and variance

; and

to denote the binomial distribution with

trials and success rate. We use

to denote Euclidean norms of vectors

, and to denote spectral and Frobenius norms of matrices . For a tuple of matrices, we let and .

We use to denote the ReLU function, and extend it to vectors by letting . We use to denote the indicator function for .

The training data consist of vector pairs , where each is the feature vector and is the label of the -th training sample. We assume without loss of generality that data are normalized so that and its last coordinate .333Without loss of generality, one can re-scale and assume for every

. Again, without loss of generality, one can pad each

by an additional coordinate to ensure . Finally, without loss of generality, one can pad each by an additional coordinate to ensure . This last coordinate is equivalent to introducing a (random) bias term, because where . In our proofs, the specific constant does not matter. We make the following separable assumption on the training data (motivated by [34]):

###### Assumption 2.1.

For every pair , we have .

To present the simplest possible proof, the main body of this paper only focuses on depth- feedforward fully-connected neural networks with an -regression task. Therefore, each is a target vector for the regression task. We explain how to extend it to more general settings in Section 3.3 and the Appendix. For notational simplicity, we assume all the hidden layers have the same number of neurons, and our results trivially generalize to each layer having different number of neurons. Specifically, we focus on the following network

 gi,0 =\bAxi hi,0 =ϕ(\bAxi) for i∈[n] gi,ℓ =\bWℓhi,ℓ−1 hi,ℓ =ϕ(\bWℓhi,ℓ−1) for i∈[n] and ℓ∈[L] yi =\bBhi,L for i∈[n]

where is the weight matrix for the input layer, is the weight matrix for the -th hidden layer, and is the weight matrix for the output layer. For notational convenience in the proofs, we may also use to denote and to denote .

###### Definition 2.2 (diagonal sign matrix).

For each and , we denote by the diagonal sign matrix where for each .

As a result, we have and .

We make the following standard choices of random initialization:

###### Definition 2.3.

We say that , and are at random initialization if

• for every and ;

• for every ; and

• for every .

###### Assumption 2.4.

Throughout this paper we assume . To present the simplest proof, we did not try hard to improve such polynomial factors.

Our regression objective is

We also denote by the loss vector for sample . For simplicity, we only focus on training in this paper and thus leave and at random initialization. Our techniques can be extended to the case when , and are jointly trained.

###### Definition 2.5.

For each , we define and for , we define .

Using this notation, one can calculate the gradient of as follows.

###### Fact 2.6.

The gradient with respect to the -th row of is

 ∇[\bWℓ]kF(−−→\bW)=∑ni=1(Back⊤i,ℓ+1lossi)k⋅hi,ℓ−1⋅1⟨[\bWℓ]k,hi,ℓ−1⟩≥0

The gradient with respect to is

 ∇\bWℓF(−−→\bW)=∑ni=1\bDi,ℓ(Back⊤i,ℓ+1lossi)h⊤i,ℓ−1

We denote by .

## 3 Our Results and Techniques

To present our result in the simplest possible way, we choose to mainly focus on fully-connected -layer neural networks with the regression loss. We shall extend it to more general settings (such as convolutional and residual networks and other losses) in Section 3.3. Our main results can be stated as follows:

Suppose

. Starting from random initialization, with probability at least

, gradient descent with learning rate finds a point in iterations.

This is known as the linear convergence rate because drops exponentially fast in . We have not tried to improve the polynomial factors in and , and are aware of several ways to improve these factors (but at the expense of complicating the proof). We note that is the data input dimension and our result is independent of .

###### Theorem 2 (Sgd).

Suppose and . Starting from random initialization, with probability at least , SGD with learning rate and mini-batch size finds in iterations.

This is a nearly-linear convergence rate because . The reason for the additional factor is because we have a high confidence bound.

###### Remark.

For experts in optimization theory, one may immediately question the accuracy of Theorem 2, because SGD is known to converge at a slower rate even for convex functions. There is no contradiction here. Imaging a strongly convex function that has a common minimizer for every , then SGD is known to converge in a linear convergence rate.

### 3.1 Technical Theorems

The main difficulty of this paper is to prove the following two technical theorems. The first one is about the gradient bounds for points that are sufficiently close to the random initialization:

###### Theorem 3 (no critical point).

With probability over randomness , it satisfies for every , every , and every with ,

 ∥∇F(−−→\bW)∥2F≤O(F(−−→\bW)×Lnmd)and∥∇F(−−→\bW)∥2F ≥Ω(F(−−→\bW)×δmdn2).

Most notably, the second property of Theorem 3 says that as long as the objective is large, the gradient norm is also large. (See also Figure 1(a).) This means, when we are sufficiently close to the random initialization, there is no saddle point or critical point of any order. This gives us hope to find global minima of the objective .

Unfortunately, Theorem 3 itself is enough. Even if we follow the negative gradient direction of , how can we guarantee that the objective truly decreases? Classically in optimization theory, one relies on the smoothness property (e.g. Lipscthiz smoothness [40]) to derive such objective-decrease guarantee. Unfortunately, smoothness property at least requires the objective to be twice differentiable, but ReLU activation is not.

To deal with this issue, we prove the following “semi-smoothness” property of the objective.

###### Theorem 4 (semi-smoothness).

With probability at least over the randomness of , we have for every with , and for every with ,

 F(˘−−→\bW+−−→\bW′)≤F(˘−−→\bW)+⟨∇F(˘−−→\bW),−−→\bW′⟩+\poly(L)√nmlogm√d⋅∥−−→\bW′∥2(F(˘−−→\bW))1/2+O(nL2md)∥−−→\bW′∥22

Quite different from classical smoothness, we still have a first-order term on the right hand side, but classical smoothness only has a second-order term . As one can see in our final proofs, as goes larger (so when we over-parameterize), the effect of the first-order term becomes smaller and smaller comparing to the second-order term. This brings Theorem 4 closer and closer, but still not identical, to the classical Lipschitz smoothness.

The derivation of our main Theorem 1 and 2 from technical Theorem 3 and 3 is quite straightforward, and can be found in Section 9 and 10.

###### Remark.

In our proofs, we show that GD and SGD can converge fast enough and thus the weights stay close to random initialization, by a seemingly small spectral norm bound . In fact this bound is large enough to totally change the outputs and fit the training data, because weights are randomly initialized (per entry) at around for being large.

In practice, we acknowledge that one often goes beyond this theory-predicted spectral-norm boundary. However, quite interestingly, we still observe Theorem 3 and 4 happen in practice at least for vision tasks. In Figure 1(a), we show the typical landscape near a point on the SGD training trajectory. The gradient is sufficiently large and going in its direction can indeed decrease the objective; in contrast, though the objective is non-convex, the negative curvature of its “Hessian” is not significant comparing to gradient. From Figure 1(a) we also see that the objective function is sufficiently smooth (at least in the two interested dimensions that we plot).

### 3.2 Main Techniques

Our proof to the Theorem 3 and 4 mostly consist of the following steps.

Step 1: properties at random initialization.   Let be at random initialization and and be defined with respect to . We first show that forward propagation neither explode or vanish. That is, for all and . This is basically because for a fixed , we have is around , and if its signs are sufficiently random, then ReLU activation kills half of the norm, that is . Then applying induction finishes the proof.

Analyzing forward propagation is not enough. We also need spectral norm bounds on the backward matrix , and on the intermediate matrix
for every . Note that if one naively bounds the spectral norm by induction, then and it will exponentially blow up! Our more careful analysis ensures that even when layers are stacked together, there is no exponential blow up in .

The final lemma in this step proves that, as long as , then for each layer it also satisfies . This can be proved by a careful induction. Details are in Section 4.

Step 2: stability after adversarial perturbation.   We show that for every that is “close” to initialization, meaning for every and for some , then the number of sign changes is at most , and the perturbation amount is at most . We emphasize here that may depend on the randomness of so one cannot use union bound. We call this “forward stability”, and it is one of the most technical proof of this paper.

Another main result in this step is to show that the backward matrix does not change by more than in spectral norm. (Recall that in the Step 1 we shown that this matrix is of spectral norm ; thus as long as , this change is somewhat negligible. Details are in Section 5.

Step 3: gradient bound.   The hard part of Theorem 3 is to show gradient lower bound. For this purpose, recall from Fact 2.6 that each term in the gradient can be written as where the backward matrix is applied to a loss vector . To show that this is large, intuitively, one wishes to show and are both vectors with large Euclidean norm. However, the main difficulty is that in calculating gradient, different samples

may form different gradient matrices and, when summing together, they could in principle each other and possibly even form a zero matrix. To deal with this issue, we use

from Step 1. In other words, even if the gradient matrix with respect to one sample is fixed, that with respect to other samples still have sufficient randomness so as the final gradient matrix will not be zero. This idea comes from the prior work [34] and helps us prove Theorem 3.444This is the only technical idea that we borrowed from Li and Liang [34], which is the over-parameterization theory for 2-layer neural networks. Details in Appendix 6 and 7.

Step 4: smoothness.   In order to prove Theorem 4, one needs to argue, if we are currently at and perturb it by , then how much does the objective change in second and higher order terms. This is different from our stability theory in Step 2, because Step 2 is regarding having a perturbation on ; in contrast, in Theorem 4 we need a (small) perturbation on top of , which may already be a point perturbed from . Nevertheless, we still manage to show that, if is calculated on and is calculated on , then . This is proportional to the small perturbation so, along with other properties to prove, ensures smoothness. This explains Theorem 4 and details are in Section 8.

### 3.3 Notable Extensions

Our Step 1 through Step 4 in Section 3.2 in fact give rise to a general plan for proving the training convergence of any neural network (at least with respect to the ReLU activation). Thus, it is expected that it can be generalized to many other settings. Not only we can have different number of neurons each layer, our theorems can be extended at least in the following three major directions.555In principle, each such proof may require a careful rewriting of the main body of this paper. We choose to sketch only the proof difference in order to keep this paper short. If there is sufficient interest from the readers, we can consider adding the full proofs in the future revision of this paper.

Different loss functions.   There is absolutely no need to restrict our attention only to regression loss. We prove in Appendix A that, for any Lipschitz-smooth loss function ):

• If is cross-entropy for multi-label classification, then we achieve training accuracy in at most iterations.

• If is gradient dominant (a.k.a. Polyak-Łojasiewicz) but possibly non-convex, we still have linear convergence.666Note that the loss function when combined with the neural network together is not gradient dominant. Therefore, one cannot apply classical theory on gradient dominant functions to derive our same result.

• If is convex, then we have convergence rate .

• If is non-convex, then we have convergence rate for finding .777Again, this cannot be derived from classical theory of finding approximate saddle points for non-convex functions, because weights with small is a very different (usually much harder) task comparing to having small gradient with respect to for the entire composite function .

Convolutional neural networks (CNN).   There are lots of different ways to design CNN and each of them may require somewhat different proofs. In Appendix B, we study the case when are convolutional while and are fully connected. We assume for notational simplicity that each hidden layer has points each with channels. (In vision tasks, a point is a pixel). In the most general setting, these values and can vary across layers. Our Theorem 5 says that, as long as is polynomially large, GD and SGD find an -error solution for regression in iterations.

Residual neural networks (ResNet).   There are lots of different ways to design ResNet and each of them may require somewhat different proofs. In symbols, between two layers, one may study , , or even . Since the main purpose here is to illustrate the generality of our techniques but not to attack each specific setting, in Appendix C, we choose to consider the simplest residual setting (that was also studied for instance by theoretical work [26]). With appropriately chosen random initialization, our Theorem C shows that one can also have linear convergence rate in the over-parameterized setting.

## 4 Properties at Random Initialization

Throughout this section we assume and are randomly generated according to Def. 2.3. The diagonal sign matrices are also determined according to this random initialization.

### 4.1 Forward Propagation

###### Lemma 4.1 (forward propagation).

If , with probability at least over the randomness of and , we have

 ∀i∈[n],ℓ∈{0,1,…,L}:∥hi,ℓ∥∈[1−ε,1+ε].
###### Remark.

Lemma 4.1 is in fact trivial to prove if the allowed failure probability is instead (by applying concentration inequality layer by layer).

Before proving Lemma 4.1 we note a simple mathematical fact:

###### Fact 4.2.

Let be fixed vectors and ,

be random matrix with i.i.d. entries

, and vector defined as . Then,

• follows i.i.d. from the following distribution: with half probability , and with the other half probability follows from folded Gaussian distributions .

• is in distribution identical to

(chi-square distribution of order

) where follows from binomial distribution .

###### Proof of Fact 4.2.

We assume each vector is generated by first generating a gaussian vector and then setting where the sign is chosen with half-half probability. Now, only depends on , and is in distribution identical to . Next, after the sign is determined, the indicator is with half probability and with another half. Therefore, satisfies the aforementioned distribution. As for , letting be the variable indicator how many indicators are 1, then and . ∎

###### Proof of Lemma 4.1.

We only prove Lemma 4.1 for a fixed and because we can apply union bound at the end. Below, we drop the subscript for notational convenience, and write and as and respectively.

Letting , we can write

 log∥hb−1∥2=log∥x∥2+b−1∑ℓ=0logΔℓ=b−1∑ℓ=0logΔℓ.

According to Fact 4.2, fixing any and letting be the only source of randomness, we have where . For such reason, for each , we can write where and . In the analysis below, we condition on the event that ; this happens with probability for each layer . To simplify our notations, if this event does not hold, we set .

Expectation.   One can verify that where is the digamma function. Using the bound of digamma function, we have

 log2ωm−2ω≤\E[logΔℓ,ω∣ω]≤log2ωm−1ω.

Whenever , we can write

 log2ωm=log(1+2ω−mm)≥2ω−mm−(2ω−mm)2

It is easy to verify and . Therefore,

 \Eω[log2ωm]≥−1m−Pr[ω∉[0.4m,0.6m]]⋅log2m≥−2m

Combining everything together, along with the fact that , we have (when is sufficiently larger than a constant)

 −4m≤\E[logΔℓ]≤0. (4.1)

Subgaussian Tail.   By standard tail bound for chi-square distribution, we know that

 ∀t∈[0,∞):Pr[∣∣m2Δℓ,ω−ω∣∣≤t∣∣ω]≥1−2e−Ω(t2/ω)−e−Ω(t).

Since we only need to focus on , this means

 ∀t∈[0,m]:Pr[∣∣m2Δℓ,ω−ω∣∣≤t∣∣ω≥0.4m]≥1−O(e−Ω(t2/m)).

On the other hand, by Chernoff-Hoeffding bound, we also have

 Prω[∣∣ω−m2∣∣≤t]≥1−O(e−Ω(t2/m))

Together, using the definition (or if ), we obtain

 ∀t∈[0,m]:Pr[∣∣m2Δℓ−m2∣∣≤t]≥1−O(e−Ω(t2/m)).

This implies,

 ∀t∈[0,m4]:Pr[|logΔℓ|≤tm]≥1−O(e−Ω(t2/m)). (4.2)

Now, let us make another simplification: define if and otherwise. In this way, (4.2) implies that is an

-subgaussian random variable.

Concentration.   Using martingale concentration on subgaussian variables (see for instance [45]), we have for ,

 Pr[∣∣ ∣∣b−1∑ℓ=0logˆΔℓ−\E[logˆΔℓ]∣∣ ∣∣>ε]≤O(e−Ω(ε2m/L)).

Since with probability it satisfies for all , combining this with (4.1), we have

 Pr[∣∣ ∣∣b−1∑ℓ=0logΔℓ∣∣ ∣∣>ε]≤O(e−Ω(ε2m/L)).

In other words, with probability at least . ∎

### 4.2 Intermediate Layers

###### Lemma 4.3 (intermediate layers).

Suppose . With probability at least over the randomness of , for all ,

1. [label=(), ref=4.3]

2. .

3. for all vectors with .

4. for all vectors with .

For any integer with , with probability at least over the randomness of :

1. [label=(), ref=4.3]

2. for all vectors with .

###### Proof.

Again we prove the lemma for fixed and because we can take a union bound at the end. We drop the subscript for notational convenience.

1. [label=(), ref=4.3]

2. Let be any fixed unit vector, and define . According to Fact 4.2 again, fixing any and letting be the only source of randomness, defining , we have that is distributed according to a where . Therefore, we have

 log∥zb−1∥2=log∥za−1∥2+b−1∑ℓ=alogΔℓ=b−1∑ℓ=alogΔℓ.

Using exactly the same proof as Lemma 4.1, we have

 ∥zb−1∥2=∥\bWb\bDb−1\bWb−1⋯\bDa\bWaza−1∥2∈[1−1/3,1+1/3]

with probability at least . As a result, if we fix a subset of cardinality , taking -net, we know that with probability at least , it satisfies

 ∥\bWb\bDb−1\bWb−1⋯\bDa\bWau∥≤2∥u∥ (4.3)

for all vectors whose coordinates are zeros outside . Now, for an arbitrary unit vector , we can decompose it as where , each is non-zero only at coordinates, and the vectors are non-zeros on different coordinates. We can apply (4.3) for each each such