Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. Our bounds also shed light on the advantage of using ResNet over the fully connected feedforward architecture; our bound requires the number of neurons per layer scaling exponentially with depth for feedforward networks whereas for ResNet the bound only requires the number of neurons per layer scaling polynomially with depth. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.READ FULL TEXT VIEW PDF
One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitrary(Zhang et al., 2016). Over-parameterization is widely believed to be the main reason for this phenomenon as only if the neural network has a sufficiently large capacity, it is possible for this neural network to fit all the training data. In practice, many neural network architectures are highly over-parameterized. For example, Wide Residual Networks have 100x parameters than the number of training data (Zagoruyko and Komodakis, 2016).
The second mysterious phenomenon in training deep neural networks is “deeper networks are harder to train.” To solve this problem, He et al. (2016) proposed the deep residual network (ResNet) architecture which enables randomly initialized first order method to train neural networks with an order of magnitude more layers. Theoretically, Hardt and Ma (2016) showed that residual links in linear networks prevent gradient vanishing in a large neighborhood of zero, but for neural networks with non-linear activations, the advantages of using residual connections are not well understood.
In this paper, we demystify these two mysterious phenomena. We consider the setting where there are data points, and the neural network has layers with width
. We focus on the least-squares loss and assume the activation function is Lipschitz and smooth. This assumption holds for many activation functions including the soft-plus. Our contributions are summarized below.
Next, we consider the ResNet architecture. We show as long as , then randomly initialized gradient descent converges to zero training loss at a linear rate. Comparing with the first result, the dependence on the number of layers improves exponentially for ResNet. This theory demonstrates the advantage of using residual connections.
Lastly, we apply the same technique to analyze convolutional ResNet. We show if where is the number of patches, then randomly initialized gradient descent achieves zero training loss.
Our proof builds on two ideas from previous work on gradient descent for two-layer neural networks. First, we use the observation by Li and Liang (2018) that if the neural network is over-parameterized, every weight matrix is close to its initialization. Second, following Du et al. (2018b)
, we analyze the dynamics of the predictions whose convergence is determined by the least eigenvalue of the Gram matrix induced by the neural network architecture and to lower bound the least eigenvalue, it is sufficient to bound the distance of each weight matrix from its initialization.
Different from these two works, in analyzing deep neural networks, we need to exploit more structural properties of deep neural networks and develop new techniques. See Section 6 and the Appendix for more details.
This paper is organized as follows. In Section 2, we formally state the problem setup. In Section 3, we give our main result for the deep fully-connected neural network. In Section 4, we give our main result for the ResNet. In Section 5, we give our main result for the convolutional ResNet. In Section 6, we present a unified proof strategy for these three architectures. We conclude in Section 7 and defer all proofs to the appendix.
Recently, many works try to study the optimization problem in deep learning. Since optimizing a neural network is a non-convex problem, one approach is first to develop a general theory for a class of non-convex problems which satisfy desired geometric properties and then identify that the neural network optimization problem belongs to this class. One promising candidate class is the set of functions that satisfy all local minima are global and there exists a negative curvature for every saddle point. For this function class, researchers have shown gradient descent (Jin et al., 2017; Ge et al., 2015; Lee et al., 2016; Du et al., 2017a) can find a global minimum. Many previous works thus try to study the optimization landscape of neural networks with different activation functions (Soudry and Hoffer, 2017; Safran and Shamir, 2018, 2016; Zhou and Liang, 2017; Freeman and Bruna, 2016; Hardt and Ma, 2016; Nguyen and Hein, 2017; Kawaguchi, 2016; Venturi et al., 2018; Soudry and Carmon, 2016; Du and Lee, 2018; Soltanolkotabi et al., 2018; Haeffele and Vidal, 2015). However, even for a deep linear network, there exists a saddle point that does not have a negative curvature (Kawaguchi, 2016), so it is unclear whether this approach can be used to obtain the global convergence guarantee of first-order methods.
Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks (Tian, 2017; Soltanolkotabi, 2017; Brutzkus and Globerson, 2017; Du et al., 2018a; Li and Yuan, 2017; Du et al., 2017b). Some local convergence results have also been proved (Zhong et al., 2017a, b; Zhang et al., 2018). In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.
The most related papers are Li and Liang (2018); Du et al. (2018b) who observed that when training a two-layer full connected neural network, most of the patterns do not change over iterations, which we also use to show the stability of the Gram matrix. They used this observation to obtain the convergence rate of gradient descent on a two-layer over-parameterized neural network for the cross-entropy and least-squares loss. More recently, Allen-Zhu et al. (2018a) generalizes ideas from Li and Liang (2018)
to derive convergence rates of training recurrent neural networks. Our work extends these previous results in several ways: a) we consider deep networks, b) we generalize to ResNet architectures, and c) we generalize to convolutional networks. To improve the width dependenceon sample size
, we utilize a smooth activation (e.g. smooth ReLU). For example, our results specialized to depthimprove upon Du et al. (2018b) in the required amount of overparametrization from to . See Theorem 3.1 for the precise statement.
Chizat and Bach (2018); Wei et al. (2018); Mei et al. (2018) used optimal transport theory to analyze gradient descent on over-parameterized models. However, their results are limited to two-layer neural networks and may require an exponential amount of over-parametrization.
developed the connection between deep neural networks with kernel methods and showed stochastic gradient descent can learn a function that is competitive with the best function in the conjugate kernel space of the network.Andoni et al. (2014)
showed that gradient descent can learn networks that are competitive with polynomial classifiers. However, these results do not imply gradient descent can find a global minimum for the empirical loss minimization problem.
Finally in concurrent work, Allen-Zhu et al. (2018b) also analyze gradient descent on deep neural networks. The primary difference between the two papers are that we analyze general smooth activations, and Allen-Zhu et al. (2018b) develop specific analysis for ReLU activation. The two papers also differ significantly on their data assumptions. We wish to emphasize a fair comparison is not possible due to the difference in setting and data assumptions. We view the two papers as complementary since they address different neural net architectures.
For ResNet, the primary focus of this manuscript, the required width per layer222In all comparisons, we ignore the polynomial dependency on data-dependent parameters which only depends on the input data and the activation function. The two papers use different measures and in our manuscript and in Allen-Zhu et al. (2018b), so they are not directly comparable. for Allen-Zhu et al. (2018b) is and for Theorem 4.1 is 333 and does not depend on the depth . We substituted this into Theorem 4.1. Our paper requires a width that does not depend on the desired accuracy . As a consequence, Theorem 4.1 guarantees the convergence of gradient descent to a global minimizer. The iteration complexity of Allen-Zhu et al. (2018b) is and of Theorem 4.1 is .
For fully-connected networks, Allen-Zhu et al. (2018b) requires width and iteration complexity . Theorem 3.1 requires width and iteration complexity . The primary difference is for very deep fully-connected networks, Allen-Zhu et al. (2018b) requires less width and iterations, but we have much milder dependence on sample size . Commonly used fully-connected networks such as VGG are not deep (
), yet the dataset size such as ImageNet () is very large.
In a second concurrent work, Zou et al. (2018)
also analyzed the convergence of gradient descent on fully-connected networks with ReLU activation. The emphasis is on different loss functions (e.g. hinge loss), so the results are not directly comparable. BothZou et al. (2018) and Allen-Zhu et al. (2018b) also analyze stochastic gradient descent.
We Let . Given a set , we use
to denote the uniform distribution over. We use
to denote the standard Gaussian distribution. For a matrix, we use to denote its
-th entry. For a vector, we use to denote the Euclidean norm. For a matrix we use to denote the Frobenius norm and to denote the operator norm. If a matrix is positive semi-definite, we use to denote its smallest eigenvalue. We use to denote the standard Euclidean inner product between two vectors or matrices. We use to denote the activation function, which in this paper we assume is Lipschitz continuous and smooth (Lipschitz gradient). The guiding example is softplus: whose Lipschitz and smoothness constants are bounded by . We will use (e.g. ) to denote universal constants that only depend on the activation. Lastly, let and denote standard Big-O and Big-Omega notations, only hiding absolute constants.
In this paper, we focus on the empirical risk minimization problem with the quadratic loss function
where are the training inputs, are the labels, is the parameter we optimize over and is the prediction function, which in our case is a neural network. We consider the following architectures.
Multilayer fully-connected neural networks: Let be the input, is the first weight matrix, is the weight at the -th layer for , is the output layer and is the activation function.444We assume intermediate layers are square matrices for simplicity. It is not difficult to generalize our analysis to rectangular weight matrices. The prediction function is defined as
where is a scaling factor to normalize the input in the initialization phase.
ResNet555We will refer to this architecture as ResNet, although this differs by the standard ResNet architecture since the skip-connections at every layer, instead of every two layers. This architecture was previously studied in Hardt and Ma (2016). We study this architecture for the ease of presentation and analysis. It is not hard to generalize our analysis to architectures with skip-connections are every two or more layers. : We use the same notations as the multilayer fully connected neural networks. We define the prediction recursively.
where is a constant that only depends on the activation, , and are constants specified in Section 4. Note here we use a scaling. This scaling plays an important role in guaranteeing the width per layer only needs to scale polynomially with . In practice, the small scaling is enforced by a small initialization of the residual connection (Hardt and Ma, 2016; Anonymous, 2018), which obtains state-of-art performance for deep residual network. We choose to use an explicit scaling, instead of altering the initialization scheme for notational convenience.
Convolutional ResNet: Lastly, we consider the convolutional ResNet architecture. Again we define the prediction function in a recursive way. Let be the input, where is the number of input channels and is the number of pixels. For , we let the number of channels be and number of pixels be . Given for , we first use an operator to divide into patches. Each patch has size and this implies
. For example, when the stride isand
where we defined
, i.e., zero-padding. Note this operator has the property
because each element from at least appears once and at most appears times. In practice, is often small like , so throughout the paper we treat as a constant in our theoretical analysis. To proceed, let , we have
where is a small constant depending only on , , and are constants we will specify in Section 5. Finally, for , the output is defined as
Note here we use the similar scaling as ResNet.
To learn the deep neural network, we consider the randomly initialized gradient descent algorithm to find the global minimizer of the empirical loss (1). Specifically, we use the following random initialization scheme. For every level , each entry is sampled from a standard Gaussian distribution, . For multilayer fully-connected neural networks and ResNet, each entry of the output layer is sampled from a Rademacher distribution, . For convolutional ResNet, for , we sample and set , i.e., we make each channel have the same corresponding output weight. In this paper, we fix the output layer and train lower layers by gradient descent, for and
where is the step size. Hoffer et al. (2018) show that fixing the last layer results in little to no loss in test accuracy.
In this section, we show gradient descent with a constant positive step size converges to the global minimum with a linear rate. To formally state our assumption, we first define the following population Gram matrices in a recursive way. For , let
Intuitively, represents the Gram matrix after compositing times kernel induced by the activation function. As will be apparent in the proof, this Gram matrix natural comes up as goes to infinity. Based on these definitions, we make the following assumption.
The first part of the assumption states that the Gram matrix at the last layer is strictly positive definite and this least eigenvalue determines the convergence rate of the gradient descent algorithm. Note this assumption is a generalization of the non-degeneracy assumption used in Du et al. (2018b) in which they considered a two-layer neural network and assumed the Gram matrix from the second layer is strictly positive definite. The second part of the assumption states every two by two sub-matrix of every layer has a lower bounded eigenvalue. The inverse of this quantity can be viewed as a measure of the stability of the population Gram matrix. As will be apparent in the proof, this condition guarantees that if is large, at the initialization phase, our Gram matrix is close to the population Gram matrix.
Now we are ready to state our main theorem for deep multilayer fully-connected neural networks.
Assume for all , , for some constant and the number of hidden nodes per layer .
If we set the step size , then with probability at least
, then with probability at leastover the random initialization we have for
This theorem states that if the width is large enough and we set step size appropriately then gradient descent converges to the global minimum with zero loss at linear rate. The width depends on , , and . The dependency on and the least eigenvalue is only polynomial, which is the same as previous work on shallow neural networks (Du et al., 2018b; Li and Liang, 2018; Allen-Zhu et al., 2018a). However, the dependency on the number of layers and stability parameter of the Gram matrices is exponential. As will be clear in Section 6 and proofs, this exponential dependency results from the amplification factor of multilayer fully-connected neural network architecture.
In this section we consider the convergence of gradient descent for training a ResNet. We focus on how much over-parameterization is needed to ensure the global convergence of gradient descent. Similar to the previous section, we define the population Gram matrices, for
These are the asymptotic Gram matrices as goes to infinity. We make the following assumption which determines the convergence rate and the amount of over-parameterization.
Note defined here is different from that of the deep fully-connected neural network because here only depends on . In general, unless there are two data points that are parallel, is always positive here. Now we are ready to state our main theorem for ResNet.
Assume for all , , for some constant and the number of hidden nodes per layer
If we set the step size , then with probability at least over the random initialization we have for
In sharp contrast to Theorem 3.1, this theorem is fully polynomial in the sense that both the number of neurons and the convergence rate is polynomially in and . Note the amount of over-parameterization depends on which is the smallest eigenvalue of the -th layer’s Gram matrix. In Section D.1, we show , and does not depend inverse exponentially on . The main reason that we do not have any exponential factor here is that the skip connection block makes the overall architecture more stable in both the initialization phase and the training phase. See Section B for details.
In this section we present convergence result of gradient descent for convolutional ResNet. Again, we focus on how much over-parameterization is needed to ensure the global convergence of gradient descent. Similar to previous sections, we first define the population Gram matrix in order to formally state our assumptions. These are the asymptotic Gram matrices as goes to infinity.
where . We make the following assumption which determines the convergence rate and the amount of over-parameterization.
Note this assumption is basically the same as Assumption 4.1. Now we state our main convergence theorem for the convolutional ResNet.
Assume for all , , for some constant and the number of hidden nodes per layer If we set the step size , then with probability at least over the random initialization we have for
This theorem is similar to that of ResNet. The number of neurons required per layer is only polynomial in the depth and the number of data points and step size is only polynomially small. The analysis is similar to ResNet and we refer readers to Section C for details.
and we denote . Note with this notation, we can write the loss as
Our induction hypothesis is just the following convergence rate of empirical loss.
At the -th iteration, we have
Note this condition implies the conclusions we want to prove. To prove Condition 6.1, we consider one iteration on the loss function.
This equation shows if , the loss decreases. Note both terms involves , which we now more carefully analyze. To simplify notations, we define
We look one coordinate of .
Using Taylor expansion, we have
Denote and and so . We will show the term, which is proportional to , drives the loss function to decrease and the term, which is a perturbation term but it is proportional to so it is small. We further unpack the term,
where with -th entry being . Now we unpack this matrix. Recall
Define with . Note by definition is a Gram matrix and thus it is positive semi-definite. Therefore we have
Now we analyze . We can write in a more compact form with .
Now observe that
Now recall the progress of loss function in Equation (7):
For the perturbation terms, through standard calculations, we can show both and are proportional to so if we set sufficiently small, this term is smaller than and thus the loss function decreases with a linear rate. See Lemmas A.6 and A.7.
Therefore, to prove the induction hypothesis, it suffices to prove for , where is independent of . To analyze the least eigenvalue, we first look at the initialization. Using assumptions of the population kernel matrix and concentration inequalities, we can show at the beginning , which implies
Now for the -th iteration, by matrix perturbation analysis, we know it is sufficient to show . To do this, we use a similar approach as in Du et al. (2018b). We show as long as is large enough, every weight matrix is close its initialization in a relative error sense. Ignoring all other parameters except , , and thus the average per-neuron distance from initialization is which tends to zero as increases. See Lemma A.5 for precise statements with all the dependencies.
This fact in turn shows is small (see Lemma A.4). The main difference from Du et al. (2018b) is that we are considering deep neural networks, and when translating the small deviation, to , there is an amplification factor which depends on the neural network architecture.
For deep fully connected neural networks, we show this amplification factor is exponential in . On the other hand, for ResNet and convolutional ResNet we show this amplification factor is only polynomial in . We further show the width required is proportional to this amplification factor.
In this paper, we show that gradient descent on deep overparametrized networks can obtain zero training loss. The key technique is to show that the Gram matrix is increasingly stable under overparametrization, and so every step of gradient descent decreases the loss at a geometric rate.
We list some directions for future research:
The current paper focuses on the train loss, but does not address the test loss. It would be an important problem to show that gradient descent can also find solutions of low test loss. In particular, existing work only demonstrate that gradient descent works under the same situations as kernel methods and random feature methods (Daniely, 2017; Li and Liang, 2018).
The width of the layers is polynomial in all the parameters for the ResNet architecture, but still very large. Realistic networks have number of parameters, not width, a large constant multiple of . We consider improving the analysis to cover commonly utilized networks an important open problem.
The current analysis is for gradient descent, instead of stochastic gradient descent. We believe the analysis can be extended to stochastic gradient, while maintaining the linear convergence rate.
We thank Lijie Chen and Ruosong Wang for useful discussions.
online stochastic gradient for tensor decomposition.In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
We first derive the formula of the gradient for the multilayer fully connected neural network is
are the derivative matrices induced by the activation function and
is the output of the -th layer.
Through standard calculation, we can get the expression of of the following form
We first present a lemma shows with high probability the feature of each layer is approximately normalized.
If is Lipschitz and , where , then with probability at least over random initialization, for every and , we have
We follow the proof sketch described in Section 6. We first analyze the spectral property of at the initialization phase. The following lemma lower bounds its least eigenvalue. This lemma is a direct consequence of Theorem D.1 and Remark D.4.
If , we have
Now we proceed to analyze the training process. We prove the following lemma which characterizes how the perturbation from weight matrices propagates to the input of each layer. This Lemma is used to prove the subsequent lemmas.
Suppose for every , , and for some constant and . If is Lipschitz, we have
Here the assumption of can be shown using Lemma E.4 and taking union bound over . Next, we show with high probability over random initialization, perturbation in weight matrices leads to small perturbation in the Gram matrix.
Suppose is Lipschitz and smooth. Suppose for , , , if where and