Neural networks trained by first order methods have achieved a remarkable impact on many applications, but their theoretical properties are still mysteries. One of the empirical observation is even though the optimization objective function is non-convex and non-smooth, randomly initialized first order methods like stochastic gradient descent can still find a global minimum. Surprisingly, this property is not correlated with labels. InZhang et al. (2016), authors replaced the true labels with randomly generated labels, but still found randomly initialized first order methods can always achieve zero training loss.
A widely believed explanation on why a neural network can fit all training labels is because the neural network is over-parameterized. For example, Wide ResNet (Zagoruyko and Komodakis, ) uses 100x parameters than the number of training data and thus there must exist one such neural network of this architecture that can fit all training data. However, the existence does not imply why the network found by a randomly initialized first order method can fit all the data. The objective function is neither smooth nor convex, which makes traditional analysis technique from convex optimization not useful in this setting. To our knowledge, only the convergence to a stationary point is known (Davis et al., 2018).
In this paper we demystify this surprising phenomenon on two-layer neural networks with rectified linear unit (ReLU) activation. Formally, we consider a neural network of the following form.
where is the input, is the weight vector of the first layer, is the output weight and
is the ReLU activation function:if and if .
We focus on the empirical risk minimization problem with a quadratic loss. Given a training data set , we want to minimize
To do this, we fix the second layer and apply gradient descent (GD) on the first layer weights matrix
where is the step size. Here the gradient formula for each weight vector is 111 Note ReLU is not continuously differentiable. One can view as a convenient notation for the right hand side of (4) and this is indeed the update rule used in practice.
Though this is only a shallow fully connected neural network, the objective function is still non-smooth and non-convex due to the use of ReLU activation function. Even for this simple function, why randomly initialized first order method can achieve zero training error is not known. In fact, many previous work have tried to answer this question or similar ones. Attempts include landscape analysis (Soudry and Carmon, 2016)Mei et al., ), analysis of the dynamics of the algorithm (Li and Yuan, 2017), optimal transport theory (Chizat and Bach, 2018), to name a few. These results often rely strong assumptions on the labels and input distributions, or do not imply why randomly initialized first order method can achieve zero training loss. See Section 2 for detailed comparisons between our result and previous ones.
In this paper, we rigorously prove that as long as the data set is not degenerate and is large enough, with properly randomly initialized and , GD achieves zero training loss at a linear convergence rate, i.e., it finds a solution with in iterations.222Here we omit the polynomial dependency on and other data dependent quantities. Thus, our theoretical result not only shows the global convergence but also gives a quantitative convergence rate in terms of the desired accuracy.
Analysis Technique Overview
Our proof relies on the following insights. First we directly analyze the dynamics of each individual prediction for . This is different from many previous work (Du et al., 2017b; Li and Yuan, 2017) which tried to analyze the dynamics of the parameter () we are optimizing. Note because the objective function is non-smooth and non-convex, analysis of the parameter space dynamics is very difficult. In contrast, we find the dynamics of prediction space is governed by the spectral property of a Gram matrix (which can vary in each iteration, c.f. (5
)) and as long as this Gram matrix’s least eigenvalue is lower bounded, gradient descent enjoys a linear rate. Furthermore, previous work has shown in the initialization phase this Gram matrix does has lower bounded least eigenvalue as long as the data is not degenerate(Xie et al., 2017). Thus the problem reduces to showing the Gram matrix at later iterations is close to that in the initialization phase. Our second observation is this Gram matrix is only related to the activation patterns () and we can use matrix perturbation analysis to show if most of the patterns do not change, then this Gram matrix is close to its initialization. Our third observation is we find over-parameterization, random initialization and the linear convergence jointly restrict every weight vector
is close to its initialization. Then we can use this property to show most of the patterns do not change. Combining these insights we prove the first global quantitative convergence result of gradient descent on ReLU activated neural networks for the empirical risk minimization problem. Notably, our proof only uses linear algebra and standard probability bounds so we believe it can be easily generalized to analyze deep neural networks.
We use bold-faced letters for vectors and matrices. We Let . Given a set , we use
to denote the uniform distribution over. Given an event , we use to be the indicator on whether this event happens. We use
to denote the standard Gaussian distribution. For a matrix, we use to denote its -th entry. We use to denote the Euclidean norm of a vector, and use to denote the Frobenius norm of a matrix. If a matrix is positive semi-definite, we use to denote its smallest eigenvalue. We use to denote the standard Euclidean inner product between two vectors. Lastly, let and denote standard Big-O and Big-Omega notations, only hiding absolute constants.
2 Comparison with Previous Results
In this section we survey an incomplete list of previous attempts in analyzing why first order methods can find a global minimum.
A popular way to analyze non-convex optimization problems is to identify whether the optimization landscape has some benign geometric properties. Recently, researchers found if the objective function is smooth and satisfies (1) all local minima are global and (2) for every saddle point, there exists a negative curvature, then the noise-injected (stochastic) gradient descent (Jin et al., 2017; Ge et al., 2015; Du et al., 2017a) can find a global minimum in polynomial time. This algorithmic finding encouraged researchers to study whether the deep neural networks also admit these properties.
For the objective function define in (2), some partial results were obtained. Soudry and Carmon (2016) showed if , then at every differentiable local minimum, the training error is zero. However, since the objective is non-smooth, it is hard to show gradient descent actually convergences to a differentiable local minimum. Xie et al. (2017)
studied the same problem and related the loss to the gradient norm through the least singular value of the “extended feature matrix”at the stationary points. However, they did not prove the convergence rate of gradient norm. Interestingly, our analysis relies on the Gram matrix which is actually .
Landscape analyses of ReLU activated neural networks for other settings have also been studied in many previous work (Ge et al., 2017; Safran and Shamir, 2018, 2016; Zhou and Liang, 2017; Freeman and Bruna, 2016; Hardt and Ma, 2016). These works establish favorable landscape properties such as all local minimizers are global, but do not ensure that gradient descent converges to a global minimizer of the empirical risk. For other activation functions, some previous work show the landscape does have the desired geometric properties (Du and Lee, 2018; Soltanolkotabi et al., 2018; Nguyen and Hein, 2017; Kawaguchi, 2016; Haeffele and Vidal, 2015; Andoni et al., 2014; Venturi et al., 2018). However, it is unclear how to extend their analyses in our setting.
Analysis of Algorithm Dynamics
Another way to prove convergence result is to directly analyze the dynamics of first order methods. Our paper also belongs to this category. Many previous work assumed (1) the input distribution is Gaussian and (2) the label is generated according to a planted neural network. Based on these two (unrealistic) conditions, it can be shown that randomly initialized (stochastic) gradient descent can learn a ReLU (Tian, 2017; Soltanolkotabi, 2017), a single convolutional filter (Brutzkus and Globerson, 2017)
, a convolutional neural network with one filter and one output layer(Du et al., 2018b) and residual network with small spectral norm weight matrix (Li and Yuan, 2017).333Since these work assume the label is realizable, converging to global minimum is equivalent to recover the underlying model. Beyond Gaussian input distribution, Du et al. (2017b) showed for learning a convolutional filter, the Gaussian input distribution assumption can be relaxed but they still require the label is generated from an underlying true filter. Comparing with these work, our paper does not try to recover the underlying true neural network. Instead, we focus on providing theoretical justification on why randomly initialized gradient descent can achieve zero training loss, which is what we can observe and verify in practice.
The most related paper is by Li and Liang (2018) who observed that when training a two-layer full connected neural network, most of the patterns () do not change over iterations, which we also use to show the stability of the Gram matrix. They used this observation to obtain the convergence rate of GD on a two-layer over-parameterized neural network for the cross-entropy loss. They need number of hidden nodes scales with where is the desired accuracy. Thus unless the number of hidden nodes , they result does not imply GD can achieve zero training loss. We improve by allowing the amount of over-parameterization to be independent of the desired accuracy and show GD can achieve zero training loss. Furthermore, our proof is much simpler and more transparent so we believe it can be easily generalized to analyze other neural network architectures.
Other Analysis Approaches
Chizat and Bach (2018) used optimal transport theory to analyze continuous time gradient descent on over-parameterized models. However, their results on ReLU activated neural network is only at the formal level. Mei et al. showed the dynamics of SGD can be captured by a partial differential equation in the suitable scaling limit. They listed some specific examples on input distributions including mixture of Gaussians. However, it is still unclear whether this framework can explain why first order methods can minimize the empirical risk. Daniely (2017) built connection between neural networks with kernel methods and showed stochastic gradient descent can learn a function that is competitive with the best function in the conjugate kernel space of the network. Again this work does not imply why first order method can achieve zero training loss.
3 Continuous Time Analysis
In this section, we present our result for gradient flow, i.e., gradient descent with infinitesimal step size. In the next section, we will modify the proof and give a quantitative bound for gradient descent with positive step size. Formally, we consider the ordinary differential equation444Strictly speaking, this should be differential inclusion (Davis et al., 2018) defined by :
for . We denote the prediction on input at time and we let be the prediction vector at time . Our main result in this section is the following theorem.
Theorem 3.1 (Convergence Rate of Gradient Flow).
Assume for all , and for some constant and the matrix with satisfies . Then if we initialize , for and set the number of hidden nodes , with high probability over random initialization we have
We first discuss our assumptions. We assume and for simplicity. We can easily modify the proof by properly scaling the initialization. The key assumption is the least eigenvalue of the matrix is strictly positive. Interestingly, various properties of this matrix has been thoroughly studied in previous work (Xie et al., 2017; Tsuchida et al., 2017). In general, unless the data is degenerate, the smallest eigenvalue of is strictly positive. We refer readers to Xie et al. (2017); Tsuchida et al. (2017) for a detailed characterization of this matrix.
The number of hidden nodes required is , which depends on the number of samples and . As will be apparent in the proof, over-parameterization, i.e., the fact , plays a crucial role in guaranteeing gradient descent to find the global minimum. We believe using a more refined analysis, this dependency can be further improved. Lastly, note the convergence rate is linear because decreases to exponentially fast. The specific rate also depends on but independent of number of hidden nodes .
3.1 Proof of Theorem 3.1
Our first step is to calculate the dynamics of each prediction.
where is an matrix with -th entry
With this matrix, we can write the dynamics of prediction in a compact way:
Note is a time-dependent symmetric matrix. We first analyze its property when . The following lemma shows if is large then has lower bounded least eigenvalue with high probability. The proof is by standard concentration bound so we defer it to appendix.
If , we have with probability at least , .
Our second step is to show is stable in terms of . Formally, the following lemma shows if is close to the initialization , is close and has a lower bounded least eigenvalue.
Suppose for a given time , for all , for some small positive constant . Then we have with high probability over initialization, .
This lemma plays a crucial in our analysis so we give the proof below.
Proof of Lemma 3.2 We define the event
Note this event happens if and only if . Recall . By anti-concentration inequality of Gaussian, we have Therefore, we can bound the entry-wise deviation on in expectation:
Summing over , we have Thus by Markov inequality, with high probability, we have where is a large absolute constant. Next, we use matrix perturbation theory to bound the deviation from the initialization
Lastly, we lower bound the smallest eigenvalue by plugging in
The next lemma shows two facts if the least eigenvalue of is lower bounded. First, the loss converges to at a linear convergence rate. Second, is close to the initialization for every . This lemma clearly demonstrates the power of over-parameterization.
Suppose for , . Then we have and for any ,
Proof of Lemma 3.3 Recall we can write the dynamics of predictions as We can calculate the loss function dynamics
Thus we have is a decreasing function with respect to . Using this fact we can bound the loss
Therefore, exponentially fast. Now we bound the gradient. Recall for ,
Integrating the gradient, we can bound the distance from the initialization
If , we have for all , , for all , and .
Thus it is sufficient to show which is equivalent to We bound
Thus by Markov’s inequality, we have with high probability . Plugging in this bound we prove the theorem. ∎
4 Discrete Time Analysis
In this section, we show randomly initialized gradient descent with a constant positive step size converges to the global minimum at a linear rate. We first present our main theorem.
Theorem 4.1 (Convergence Rate of Gradient Descent).
Assume for all , , for some constant and the matrix with satisfies . If we initialize , for and the number of hidden nodes and we set the step size then with high probability over the random initialization we have for
Theorem 4.1 shows even though the objective function is non-smooth and non-convex, gradient descent with a constant positive step size still enjoys a linear convergence rate. Our assumptions on the least eigenvalue and the number of hidden nodes are exactly the same as the theorem for gradient flow. Notably, our choice of step size is independent of number of hidden nodes in contrast to the previous work (Li and Liang, 2018).
4.1 Proof of Theorem 4.1
We prove Theorem 4.1 by induction. Our induction hypothesis is just the following convergence rate of empirical loss.
At the -th iteration, we have
A directly corollary of this condition is the following bound of deviation from the initialization. The proof is similar to that of Lemma 3.3 so we defer it to appendix.
If Condition 4.1 holds for , then we have for every
Our strategy is similar to the proof of Theorem 3.1. We define the event
where for some small positive constant . Different from gradient flow, for gradient descent we need a more refined analysis. We let and . The following Lemma bounds the sum of sizes of . The proof is similar to the analysis used in Lemma 3.2. See Section A for the whole proof.
With high probability over initialization, we have for some positive constant .
Next, we calculate the difference of predictions between two consecutive iterations, analogue to term in Section 3.
Here we divide the right hand side into two parts. accounts for terms that the pattern does not change and accounts for terms that pattern may change.
We view as a perturbation and bound its magnitude. Because ReLU is a -Lipschitz function and , we have
To analyze , by Corollary 4.1, we know and for all . Furthermore, because , we know . Thus we can find a more convenient expression of for analysis
where is just the -th entry of a discrete version of Gram matrix defined in Section 3 and is a perturbation matrix. Let be the matrix with -th entry being . Using Lemma 4.1, with high probability we obtain an upper bound of the operator norm
Similar to the classical analysis of gradient, we also need bound the quadratic term.
With these estimates at hand, we are ready to prove the induction hypothesis.
The third equality we used the decomposition of . The first inequality we used the Lemma 3.2, the bound on the step size, the bound on , the bound on and the bound on . The last inequality we used the bound of the step size and the bound of . Therefore Condition 4.1 holds for . Now by induction, we prove Theorem 4.1. ∎
5 Conclusion and Discussion
In this paper we show with over-parameterization, gradient descent provable converges to the global minimum of the empirical loss at a linear convergence rate. The key proof idea is to show the over-parameterization makes Gram matrix remain positive definite for all iterations, which in turn guarantees the linear convergence. Here we list some future directions.
First, we believe the number of hidden nodes required can be reduced. For example, previous work (Soudry and Carmon, 2016) showed is enough to make all differentiable local minima global. In our setting, using advanced tools from probability and matrix perturbation theory to analyze , we may able to tighten the bound.
Another direction is to prove the global convergence of gradient descent on multiple layer neural networks and convolutional neural networks. We believe our approach is still applicable because for a fixed a neural network architecture, when the number of hidden nodes is large and the initialization scheme is random Gaussian with proper scaling, the Gram matrix is also positive definite, which ensures the linear convergence of the empirical loss (at least at the begining). We believe combing our proof idea with the discovery of balancedness between layers in Du et al. (2018a) is a promising approach.
Lastly, in our paper we used the empirical loss as a potential function to measure the progress. If we use another potential function, we may able to prove the convergence rate of accelerated methods. This technique has been exploited in Wilson et al. (2016) for analyzing convex optimization. It would be interesting to bring their idea to analyze other first order methods.
We thank Wei Hu, Jason D. Lee and Ruosong Wang for useful discussions.
- Andoni et al. (2014) Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In International Conference on Machine Learning, pages 1908–1916, 2014.
- Brutzkus and Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a ConvNet with gaussian inputs. In International Conference on Machine Learning, pages 605–614, 2017.
- Chizat and Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
- Daniely (2017) Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
- Davis et al. (2018) Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. arXiv preprint arXiv:1804.07795, 2018.
- Du and Lee (2018) Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. Proceedings of the 35th International Conference on Machine Learning, pages 1329–1338, 2018.
- Du et al. (2017a) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017a.
- Du et al. (2017b) Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017b.
- Du et al. (2018a) Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arXiv preprint arXiv:1806.00900, 2018a.
- Du et al. (2018b) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. Proceedings of the 35th International Conference on Machine Learning, pages 1339–1348, 2018b.
- Freeman and Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
Ge et al. (2015)
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points
online stochastic gradient for tensor decomposition.In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- Ge et al. (2017) Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
- Haeffele and Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt and Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.
- Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pages 586–594, 2016.
- Li and Liang (2018) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
- Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- (20) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences.
- Nguyen and Hein (2017) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International Conference on Machine Learning, pages 2603–2612, 2017.
- Safran and Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
- Safran and Shamir (2018) Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer ReLU neural networks. In International Conference on Machine Learning, pages 4433–4441, 2018.
- Soltanolkotabi (2017) Mahdi Soltanolkotabi. Learning ReLus via gradient descent. In Advances in Neural Information Processing Systems, pages 2007–2017, 2017.
- Soltanolkotabi et al. (2018) Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 2018.
- Soudry and Carmon (2016) Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413, 2017.
- Tsuchida et al. (2017) Russell Tsuchida, Farbod Roosta-Khorasani, and Marcus Gallagher. Invariance of weight distributions in rectified mlps. arXiv preprint arXiv:1711.09090, 2017.
- Venturi et al. (2018) Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
- Wilson et al. (2016) Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.
- Xie et al. (2017) Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. In Artificial Intelligence and Statistics, pages 1216–1224, 2017.
- (32) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. NIN, 8:35–67.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Zhou and Liang (2017) Yi Zhou and Yingbin Liang. Critical points of neural networks: Analytical forms and landscape properties. arXiv preprint arXiv:1710.11205, 2017.
Appendix A Technical Proofs
Proof of Lemma 3.1.
For note every fixed pair,
is an average of independent random variables. Therefore, by Hoeffding inequality, we have with probability,
Setting and applying union bound over pairs, we have for every pair with probability at least
Thus we have
Thus if we have the desired result. ∎
Proof of Lemma 3.4.
For the other case, at time , we know we know there exists
The rest of the proof is the same as the previous case. ∎
Proof of Corollary 4.1.
We use the norm of gradient to bound this distance.
Proof of Lemma 4.1.
For a fixed and , by anti-concentration inequality, we know . Thus we can bound the size of in expectation.
Summing over , we have
Thus by Markov’s inequality, we have with high probability
for some large positive constant . ∎