Neural networks have demonstrated a great success in numerous machine-learning tasks[33, 25, 36, 6, 30, 46, 47]. One of the empirical findings is that neural networks, trained by first-order methods from random initialization, have a remarkable ability of fitting training data .
From a capacity perspective, the ability to fit training data may not be surprising: modern neural networks are always heavily over-parameterized— they have (much) more parameters than the total number of training samples. Thus, in theory, there always exists parameter choices that achieve zero training error as long as the data does not degenerate.
Yet, from an optimization perspective, the fact that randomly initialized first-order methods can find such an optimal solution on the training data is quite non-trivial
: neural networks used in practice are often equipped with the ReLU activation function, which makes the training objective not only non-convex, but even non-smooth. Even the general convergence for finding approximate first and second-order critical points of a non-convex, non-smooth function is not fully understood, and appears to be a challenging question on its own. This is in direct contrast to practice, in which ReLU networks trained by stochastic gradient descent (SGD) from random initialization almost never face the problem of non-smoothness or non-convexity, and can converge to even a global minimal over the training set quite easily.
Recently, there are quite a few papers trying to understand the success of neural networks from optimization perspective. Many of them focus on the case when the inputs are random Gaussian, and work only for two-layer neural networks [11, 50, 55, 35, 19, 22, 43, 60, 59].
In Li and Liang , it was shown that for a two-layer network with ReLU activation, SGD finds nearly-global optimal (say, 99% classification accuracy) solutions on the training data, as long as the network is over-parameterized, meaning that when the number of neurons is polynomially large comparing to the input size. Moreover, if the data is sufficiently structured (say, coming from mixtures of separable distributions), this perfect accuracy can be extended to test data as well. As a separate note, over-parameterization is suggested as the possible key to avoid bad local minima by Safran and Shamir  even for two-layer neural networks.
There are also results that go beyond two-layer neural networks but with limitations. Some consider deep linear neural networks without any activation functions [26, 7, 9, 31]. The result of Daniely  applies to multi-layer neural network with ReLU activation, but is about the convex training process only with respect to the last layer. Daniely worked in a parameter regime where the weight changes of all layers except the last one make negligible contribution to the final output (and they form the so-called conjugate kernel). The result of Soudry and Carmon  shows that under over-parameterization and under random input perturbation, there is bad local minima for multi-layer neural networks. Their work did not show any provable convergence rate.
In this paper, we study the following fundamental question [innertopmargin=3pt]
Can DNN be trained close to zero training error efficiently under mild assumptions?
If so, can the running time depend only polynomially in the number of layers?
Motivation. In 2012 AlexNet  was born with convolutional layers. Since then, the common trend in the deep learning community is to build network architectures that go deeper. In 2014, Simonyan and Zisserman  proposed a VGG network with layers. Later, Szegedy et al.  proposed GoogleNet with layers. In practice, we cannot make the network deeper by naively stacking layers together due to the so-called vanishing / exploding gradient issues. For this reason, in 2015, He et al.  proposed an ingenious deep network structure called Deep Residual Network (ResNet), with the capability of handling at least layers. For more overview and variants of ResNet, we refer the readers to .
Compared to the practical neural networks that go much deeper, the existing theory has been mostly around two-layer (thus one-hidden-layer) networks even just for the training process alone. It is natural to ask if we can theoretically understand how the training process has worked for multi-layer neural networks.
1.1 Our Result
In this paper, we extend the over-parameterization theory to multi-layer neural networks. We show that over-parameterized neural networks can indeed be trained by regular first-order methods to global minima (e.g. zero training error), as as long as the dataset is non-degenerate. We say that the dataset is non-degenerate if the data points are distinct. This is a minimal requirement since a dataset with the same input and different labels can not be trained to zero error. We denote by the minimum (relative) distance between two training data points, and by the number of samples in the training dataset.
Now, consider an -layer fully-connected feedforward neural network, each layer consisting of neurons equipped with ReLU activation. We show that,
As long as , starting from random Gaussian initialized weights, gradient descent (GD) and stochastic gradient descent (SGD) find -error global minimum in regression using at most iterations. This is a linear convergence rate.
Using the same network, if the task is multi-label classification, then GD and SGD find an
accuracy classifier on the training set initerations.
Our result also applies to other Lipschitz-smooth loss functions, and some other network architectures including convolutional neural networks (CNNs) and residual networks (ResNet).
This paper does not cover the the generalization of over-parameterized neural networks to the test data. We refer interested readers to some practical evidence [53, 56] that deeper (and wider) neural networks actually generalize better. As for theoretical results, over-parameterized neural networks provably generalize at least for two-layer networks [34, 4] and three-layer networks .111If data is “well-structured” two-layer over-parameterized neural networks can learn it using SGD with polynomially many samples . If data is produced by some unknown two-layer (resp. three-layer) neural network, then two-layer (resp. three-layer) neural networks can also provably learn it using SGD and polynomially many samples .
A concurrent but different result. We acknowledge a concurrent work of Du et al.  which has a similar abstract to this paper, but is different from us in many aspects. Since we noticed many readers cannot tell the two results apart, we compare them carefully below. Du et al.  has two main results, one for fully-connected networks and the other for residual networks (ResNet).
For fully-connected networks, their provable training time is exponential in the number of layers, leading to a claim of the form “ResNet has an advantage because ResNet is polynomial-time but fully-connected network is exponential-time.” As we prove in this paper, fully-connected networks do have polynomial training time, so the logic behind their claim is wrong.
For residual networks, their training time scales polynomial in
, a parameter that depends on the minimal singular value of a complicated,-times recursively-defined kernel matrix. It is not clear whether is small in practice. From the representation of our paper, it seems their could again be exponential in the number of layers. The authors of  have hidden this large factor from their stated complexity in the introduction.
Their result is different from us in many other aspects. Their result only applies to smooth activation functions and thus cannot apply to the state-of-the-art ReLU activation. Their ResNet requires the value of weight initialization to be a function polynomial in (which is our ); this can heavily depend on the input data. Their result only applies to gradient descent but not to SGD. Their result only applies to loss but not others.
1.2 Other Related Works
Li and Liang  originally proved their result for the cross-entropy loss. Later, the “training accuracy” (not the testing accuracy) part of  was extended to the loss . The result of  claims to have adopted a learning rate times larger than , but that is unfair because they have re-scaled the network by a factor of .222Indeed, if one replaces any function with then the gradient decreases by a factor of and the needed movement in increases by a factor of . Thus, you can equivalently increase the learning rate by a factor of .
There is sequence of work about one-hidden-layer (multiple neurons) CNN [11, 59, 19, 24, 41]. Whether the patches overlap or not plays a crucial role in analyzing algorithms for such CNN. One category of the results have required the patches to be disjoint [11, 59, 19]. The other category [24, 41] have figured out a weaker assumption or even removed that patch-disjoint assumption. On input data distribution, most relied on inputs being Gaussian [11, 59, 19, 41], and some assumed inputs to be symmetrically distributed with identity covariance and boundedness .
can also be generalized to one-hidden-layer ResNet under the Gaussian input assumption; they can show that GD starting from good initialization point (via tensor initialization) learns ResNet.Hardt and Ma  deep linear residual networks have no spurious local optima.
If no assumption is allowed, neural networks have been shown hard in several different perspectives. Thirty years ago, Blum and Rivest  first proved that learning the neural network is NP-complete. Stronger hardness results have been proved over the last decade [32, 37, 13, 15, 23, 51, 38].
An Over-Parameterized RNN Theory. For experts in DNN theory, one may view this present paper as a deeply-simplified version of the recurrent neural network (RNN) paper  by the same set of authors. A recurrent neural network executed on input sequences with time horizon is very similar to a feedforward neural network with layers. The main difference between the two is that in feedforward neural networks, the weight matrices are different across layers, and thus independently randomly initialized; in contrast, in RNN, the same weight matrix is applied across the entire time horizon so we do not have fresh new randomness for proofs that involve in induction.
So, the over-parameterized convergence theory of DNN is much simpler than that of RNN.
We write this DNN result as a separate paper because: (1) not all the readers can easily notice that DNN is easier to study than RNN; (2) we believe the convergence of DNN is important on its own; (3) the proof in this paper is much simpler (30 vs 80 pages) and could reach out to a wider audience; (4) the simplicity of this paper allows us to tighten parameters in some non-trivial ways; and (5) the simplicity of this paper allows us to also study convolutional networks, residual networks, as well as different loss functions (all of them were missing from ).
We also note that the techniques of this paper can be combined with  to show the convergence of over-parameterized deep RNN.
to denote the Gaussian distribution of mean
and variance; and
to denote the binomial distribution withtrials and success rate. We use
to denote Euclidean norms of vectors, and to denote spectral and Frobenius norms of matrices . For a tuple of matrices, we let and .
We use to denote the ReLU function, and extend it to vectors by letting . We use to denote the indicator function for .
The training data consist of vector pairs , where each is the feature vector and is the label of the -th training sample. We assume without loss of generality that data are normalized so that and its last coordinate .333Without loss of generality, one can re-scale and assume for every . Again, without loss of generality, one can pad each
. Again, without loss of generality, one can pad eachby an additional coordinate to ensure . Finally, without loss of generality, one can pad each by an additional coordinate to ensure . This last coordinate is equivalent to introducing a (random) bias term, because where . In our proofs, the specific constant does not matter. We make the following separable assumption on the training data (motivated by ):
For every pair , we have .
To present the simplest possible proof, the main body of this paper only focuses on depth- feedforward fully-connected neural networks with an -regression task. Therefore, each is a target vector for the regression task. We explain how to extend it to more general settings in Section 3.3 and the Appendix. For notational simplicity, we assume all the hidden layers have the same number of neurons, and our results trivially generalize to each layer having different number of neurons. Specifically, we focus on the following network
where is the weight matrix for the input layer, is the weight matrix for the -th hidden layer, and is the weight matrix for the output layer. For notational convenience in the proofs, we may also use to denote and to denote .
Definition 2.2 (diagonal sign matrix).
For each and , we denote by the diagonal sign matrix where for each .
As a result, we have and .
We make the following standard choices of random initialization:
We say that , and are at random initialization if
for every and ;
for every ; and
for every .
Throughout this paper we assume . To present the simplest proof, we did not try hard to improve such polynomial factors.
2.1 Objective and Gradient
Our regression objective is
We also denote by the loss vector for sample . For simplicity, we only focus on training in this paper and thus leave and at random initialization. Our techniques can be extended to the case when , and are jointly trained.
For each , we define and for , we define .
Using this notation, one can calculate the gradient of as follows.
The gradient with respect to the -th row of is
The gradient with respect to is
We denote by .
3 Our Results and Techniques
To present our result in the simplest possible way, we choose to mainly focus on fully-connected -layer neural networks with the regression loss. We shall extend it to more general settings (such as convolutional and residual networks and other losses) in Section 3.3. Our main results can be stated as follows:
Theorem 1 (gradient descent).
Starting from random initialization, with probability at least
. Starting from random initialization, with probability at least, gradient descent with learning rate finds a point in iterations.
This is known as the linear convergence rate because drops exponentially fast in . We have not tried to improve the polynomial factors in and , and are aware of several ways to improve these factors (but at the expense of complicating the proof). We note that is the data input dimension and our result is independent of .
Theorem 2 (Sgd).
Suppose and . Starting from random initialization, with probability at least , SGD with learning rate and mini-batch size finds in iterations.
This is a nearly-linear convergence rate because . The reason for the additional factor is because we have a high confidence bound.
For experts in optimization theory, one may immediately question the accuracy of Theorem 2, because SGD is known to converge at a slower rate even for convex functions. There is no contradiction here. Imaging a strongly convex function that has a common minimizer for every , then SGD is known to converge in a linear convergence rate.
3.1 Technical Theorems
(b) A typical training curve for SGD, where the norm of (full) gradient decreases as objective value decreases. The gradient norm does tend to zero because the cross-entropy loss is used for multi-label classification (see Section 3.3). The training accuracy already becomes 99.8%.
The used dataset is CIFAR10, and used the neural network is ResNet with 32 layers. Similar landscapes can also be spotted for AlexNet, VGG, DenseNet, etc. If the readers are interested in more details, we can include more experiments in the future revision.
The main difficulty of this paper is to prove the following two technical theorems. The first one is about the gradient bounds for points that are sufficiently close to the random initialization:
Theorem 3 (no critical point).
With probability over randomness , it satisfies for every , every , and every with ,
Most notably, the second property of Theorem 3 says that as long as the objective is large, the gradient norm is also large. (See also Figure 1(a).) This means, when we are sufficiently close to the random initialization, there is no saddle point or critical point of any order. This gives us hope to find global minima of the objective .
Unfortunately, Theorem 3 itself is enough. Even if we follow the negative gradient direction of , how can we guarantee that the objective truly decreases? Classically in optimization theory, one relies on the smoothness property (e.g. Lipscthiz smoothness ) to derive such objective-decrease guarantee. Unfortunately, smoothness property at least requires the objective to be twice differentiable, but ReLU activation is not.
To deal with this issue, we prove the following “semi-smoothness” property of the objective.
Theorem 4 (semi-smoothness).
With probability at least over the randomness of , we have for every with , and for every with ,
Quite different from classical smoothness, we still have a first-order term on the right hand side, but classical smoothness only has a second-order term . As one can see in our final proofs, as goes larger (so when we over-parameterize), the effect of the first-order term becomes smaller and smaller comparing to the second-order term. This brings Theorem 4 closer and closer, but still not identical, to the classical Lipschitz smoothness.
In our proofs, we show that GD and SGD can converge fast enough and thus the weights stay close to random initialization, by a seemingly small spectral norm bound . In fact this bound is large enough to totally change the outputs and fit the training data, because weights are randomly initialized (per entry) at around for being large.
In practice, we acknowledge that one often goes beyond this theory-predicted spectral-norm boundary. However, quite interestingly, we still observe Theorem 3 and 4 happen in practice at least for vision tasks. In Figure 1(a), we show the typical landscape near a point on the SGD training trajectory. The gradient is sufficiently large and going in its direction can indeed decrease the objective; in contrast, though the objective is non-convex, the negative curvature of its “Hessian” is not significant comparing to gradient. From Figure 1(a) we also see that the objective function is sufficiently smooth (at least in the two interested dimensions that we plot).
3.2 Main Techniques
Step 1: properties at random initialization. Let be at random initialization and and be defined with respect to . We first show that forward propagation neither explode or vanish. That is, for all and . This is basically because for a fixed , we have is around , and if its signs are sufficiently random, then ReLU activation kills half of the norm, that is . Then applying induction finishes the proof.
Analyzing forward propagation is not enough. We also need spectral norm bounds on the backward matrix , and on the intermediate matrix
for every . Note that if one naively bounds the spectral norm by induction, then and it will exponentially blow up! Our more careful analysis ensures that even when layers are stacked together, there is no exponential blow up in .
The final lemma in this step proves that, as long as , then for each layer it also satisfies . This can be proved by a careful induction. Details are in Section 4.
Step 2: stability after adversarial perturbation. We show that for every that is “close” to initialization, meaning for every and for some , then the number of sign changes is at most , and the perturbation amount is at most . We emphasize here that may depend on the randomness of so one cannot use union bound. We call this “forward stability”, and it is one of the most technical proof of this paper.
Another main result in this step is to show that the backward matrix does not change by more than in spectral norm. (Recall that in the Step 1 we shown that this matrix is of spectral norm ; thus as long as , this change is somewhat negligible. Details are in Section 5.
Step 3: gradient bound. The hard part of Theorem 3 is to show gradient lower bound. For this purpose, recall from Fact 2.6 that each term in the gradient can be written as where the backward matrix is applied to a loss vector . To show that this is large, intuitively, one wishes to show and are both vectors with large Euclidean norm. However, the main difficulty is that in calculating gradient, different samples
may form different gradient matrices and, when summing together, they could in principle each other and possibly even form a zero matrix. To deal with this issue, we usefrom Step 1. In other words, even if the gradient matrix with respect to one sample is fixed, that with respect to other samples still have sufficient randomness so as the final gradient matrix will not be zero. This idea comes from the prior work  and helps us prove Theorem 3.444This is the only technical idea that we borrowed from Li and Liang , which is the over-parameterization theory for 2-layer neural networks. Details in Appendix 6 and 7.
Step 4: smoothness. In order to prove Theorem 4, one needs to argue, if we are currently at and perturb it by , then how much does the objective change in second and higher order terms. This is different from our stability theory in Step 2, because Step 2 is regarding having a perturbation on ; in contrast, in Theorem 4 we need a (small) perturbation on top of , which may already be a point perturbed from . Nevertheless, we still manage to show that, if is calculated on and is calculated on , then . This is proportional to the small perturbation so, along with other properties to prove, ensures smoothness. This explains Theorem 4 and details are in Section 8.
3.3 Notable Extensions
Our Step 1 through Step 4 in Section 3.2 in fact give rise to a general plan for proving the training convergence of any neural network (at least with respect to the ReLU activation). Thus, it is expected that it can be generalized to many other settings. Not only we can have different number of neurons each layer, our theorems can be extended at least in the following three major directions.555In principle, each such proof may require a careful rewriting of the main body of this paper. We choose to sketch only the proof difference in order to keep this paper short. If there is sufficient interest from the readers, we can consider adding the full proofs in the future revision of this paper.
Different loss functions. There is absolutely no need to restrict our attention only to regression loss. We prove in Appendix A that, for any Lipschitz-smooth loss function ):
If is cross-entropy for multi-label classification, then we achieve training accuracy in at most iterations.
If is gradient dominant (a.k.a. Polyak-Łojasiewicz) but possibly non-convex, we still have linear convergence.666Note that the loss function when combined with the neural network together is not gradient dominant. Therefore, one cannot apply classical theory on gradient dominant functions to derive our same result.
If is convex, then we have convergence rate .
If is non-convex, then we have convergence rate for finding .777Again, this cannot be derived from classical theory of finding approximate saddle points for non-convex functions, because weights with small is a very different (usually much harder) task comparing to having small gradient with respect to for the entire composite function .
Convolutional neural networks (CNN). There are lots of different ways to design CNN and each of them may require somewhat different proofs. In Appendix B, we study the case when are convolutional while and are fully connected. We assume for notational simplicity that each hidden layer has points each with channels. (In vision tasks, a point is a pixel). In the most general setting, these values and can vary across layers. Our Theorem 5 says that, as long as is polynomially large, GD and SGD find an -error solution for regression in iterations.
Residual neural networks (ResNet). There are lots of different ways to design ResNet and each of them may require somewhat different proofs. In symbols, between two layers, one may study , , or even . Since the main purpose here is to illustrate the generality of our techniques but not to attack each specific setting, in Appendix C, we choose to consider the simplest residual setting (that was also studied for instance by theoretical work ). With appropriately chosen random initialization, our Theorem C shows that one can also have linear convergence rate in the over-parameterized setting.
4 Properties at Random Initialization
Throughout this section we assume and are randomly generated according to Def. 2.3. The diagonal sign matrices are also determined according to this random initialization.
4.1 Forward Propagation
Lemma 4.1 (forward propagation).
If , with probability at least over the randomness of and , we have
Lemma 4.1 is in fact trivial to prove if the allowed failure probability is instead (by applying concentration inequality layer by layer).
Before proving Lemma 4.1 we note a simple mathematical fact:
Let be fixed vectors and , be random matrix with i.i.d. entries
be random matrix with i.i.d. entries, and vector defined as . Then,
follows i.i.d. from the following distribution: with half probability , and with the other half probability follows from folded Gaussian distributions .
is in distribution identical to
(chi-square distribution of order) where follows from binomial distribution .
Proof of Fact 4.2.
We assume each vector is generated by first generating a gaussian vector and then setting where the sign is chosen with half-half probability. Now, only depends on , and is in distribution identical to . Next, after the sign is determined, the indicator is with half probability and with another half. Therefore, satisfies the aforementioned distribution. As for , letting be the variable indicator how many indicators are 1, then and . ∎
Proof of Lemma 4.1.
We only prove Lemma 4.1 for a fixed and because we can apply union bound at the end. Below, we drop the subscript for notational convenience, and write and as and respectively.
Letting , we can write
According to Fact 4.2, fixing any and letting be the only source of randomness, we have where . For such reason, for each , we can write where and . In the analysis below, we condition on the event that ; this happens with probability for each layer . To simplify our notations, if this event does not hold, we set .
Expectation. One can verify that where is the digamma function. Using the bound of digamma function, we have
Whenever , we can write
It is easy to verify and . Therefore,
Combining everything together, along with the fact that , we have (when is sufficiently larger than a constant)
Subgaussian Tail. By standard tail bound for chi-square distribution, we know that
Since we only need to focus on , this means
On the other hand, by Chernoff-Hoeffding bound, we also have
Together, using the definition (or if ), we obtain
Now, let us make another simplification: define if and otherwise. In this way, (4.2) implies that is an
-subgaussian random variable.
4.2 Intermediate Layers
Lemma 4.3 (intermediate layers).
Again we prove the lemma for fixed and because we can take a union bound at the end. We drop the subscript for notational convenience.
Let be any fixed unit vector, and define . According to Fact 4.2 again, fixing any and letting be the only source of randomness, defining , we have that is distributed according to a where . Therefore, we have
Using exactly the same proof as Lemma 4.1, we have
with probability at least . As a result, if we fix a subset of cardinality , taking -net, we know that with probability at least , it satisfies
for all vectors whose coordinates are zeros outside . Now, for an arbitrary unit vector , we can decompose it as where , each is non-zero only at coordinates, and the vectors are non-zeros on different coordinates. We can apply (4.3) for each each such