Gradient Descent Provably Optimizes Over-parameterized Neural Networks

10/04/2018 ∙ by Simon S. Du, et al. ∙ 30

One of the mystery in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and the data is non-degenerate, randomly initialized gradient descent converges a globally optimal solution with a linear convergence rate for the quadratic loss function. Our analysis is based on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks trained by first order methods have achieved a remarkable impact on many applications, but their theoretical properties are still mysteries. One of the empirical observation is even though the optimization objective function is non-convex and non-smooth, randomly initialized first order methods like stochastic gradient descent can still find a global minimum. Surprisingly, this property is not correlated with labels. In

Zhang et al. (2016), authors replaced the true labels with randomly generated labels, but still found randomly initialized first order methods can always achieve zero training loss.

A widely believed explanation on why a neural network can fit all training labels is because the neural network is over-parameterized. For example, Wide ResNet (Zagoruyko and Komodakis, ) uses 100x parameters than the number of training data and thus there must exist one such neural network of this architecture that can fit all training data. However, the existence does not imply why the network found by a randomly initialized first order method can fit all the data. The objective function is neither smooth nor convex, which makes traditional analysis technique from convex optimization not useful in this setting. To our knowledge, only the convergence to a stationary point is known (Davis et al., 2018).

In this paper we demystify this surprising phenomenon on two-layer neural networks with rectified linear unit (ReLU) activation. Formally, we consider a neural network of the following form.

(1)

where is the input, is the weight vector of the first layer, is the output weight and

is the ReLU activation function:

if and if .

We focus on the empirical risk minimization problem with a quadratic loss. Given a training data set , we want to minimize

(2)

To do this, we fix the second layer and apply gradient descent (GD) on the first layer weights matrix

(3)

where is the step size. Here the gradient formula for each weight vector is 111 Note ReLU is not continuously differentiable. One can view as a convenient notation for the right hand side of (4) and this is indeed the update rule used in practice.

(4)

Though this is only a shallow fully connected neural network, the objective function is still non-smooth and non-convex due to the use of ReLU activation function. Even for this simple function, why randomly initialized first order method can achieve zero training error is not known. In fact, many previous work have tried to answer this question or similar ones. Attempts include landscape analysis (Soudry and Carmon, 2016)

, partial differential equations 

(Mei et al., ), analysis of the dynamics of the algorithm  (Li and Yuan, 2017), optimal transport theory (Chizat and Bach, 2018), to name a few. These results often rely strong assumptions on the labels and input distributions, or do not imply why randomly initialized first order method can achieve zero training loss. See Section 2 for detailed comparisons between our result and previous ones.

In this paper, we rigorously prove that as long as the data set is not degenerate and is large enough, with properly randomly initialized and , GD achieves zero training loss at a linear convergence rate, i.e., it finds a solution with in iterations.222Here we omit the polynomial dependency on and other data dependent quantities. Thus, our theoretical result not only shows the global convergence but also gives a quantitative convergence rate in terms of the desired accuracy.

Analysis Technique Overview

Our proof relies on the following insights. First we directly analyze the dynamics of each individual prediction for . This is different from many previous work (Du et al., 2017b; Li and Yuan, 2017) which tried to analyze the dynamics of the parameter () we are optimizing. Note because the objective function is non-smooth and non-convex, analysis of the parameter space dynamics is very difficult. In contrast, we find the dynamics of prediction space is governed by the spectral property of a Gram matrix (which can vary in each iteration, c.f. (5

)) and as long as this Gram matrix’s least eigenvalue is lower bounded, gradient descent enjoys a linear rate. Furthermore, previous work has shown in the initialization phase this Gram matrix does has lower bounded least eigenvalue as long as the data is not degenerate 

(Xie et al., 2017). Thus the problem reduces to showing the Gram matrix at later iterations is close to that in the initialization phase. Our second observation is this Gram matrix is only related to the activation patterns () and we can use matrix perturbation analysis to show if most of the patterns do not change, then this Gram matrix is close to its initialization. Our third observation is we find over-parameterization, random initialization and the linear convergence jointly restrict every weight vector

is close to its initialization. Then we can use this property to show most of the patterns do not change. Combining these insights we prove the first global quantitative convergence result of gradient descent on ReLU activated neural networks for the empirical risk minimization problem. Notably, our proof only uses linear algebra and standard probability bounds so we believe it can be easily generalized to analyze deep neural networks.

Notations

We use bold-faced letters for vectors and matrices. We Let . Given a set , we use

to denote the uniform distribution over

. Given an event , we use to be the indicator on whether this event happens. We use

to denote the standard Gaussian distribution. For a matrix

, we use to denote its -th entry. We use to denote the Euclidean norm of a vector, and use to denote the Frobenius norm of a matrix. If a matrix is positive semi-definite, we use to denote its smallest eigenvalue. We use to denote the standard Euclidean inner product between two vectors. Lastly, let and denote standard Big-O and Big-Omega notations, only hiding absolute constants.

2 Comparison with Previous Results

In this section we survey an incomplete list of previous attempts in analyzing why first order methods can find a global minimum.

Landscape Analysis

A popular way to analyze non-convex optimization problems is to identify whether the optimization landscape has some benign geometric properties. Recently, researchers found if the objective function is smooth and satisfies (1) all local minima are global and (2) for every saddle point, there exists a negative curvature, then the noise-injected (stochastic) gradient descent (Jin et al., 2017; Ge et al., 2015; Du et al., 2017a) can find a global minimum in polynomial time. This algorithmic finding encouraged researchers to study whether the deep neural networks also admit these properties.

For the objective function define in (2), some partial results were obtained. Soudry and Carmon (2016) showed if , then at every differentiable local minimum, the training error is zero. However, since the objective is non-smooth, it is hard to show gradient descent actually convergences to a differentiable local minimum. Xie et al. (2017)

studied the same problem and related the loss to the gradient norm through the least singular value of the “extended feature matrix”

at the stationary points. However, they did not prove the convergence rate of gradient norm. Interestingly, our analysis relies on the Gram matrix which is actually .

Landscape analyses of ReLU activated neural networks for other settings have also been studied in many previous work (Ge et al., 2017; Safran and Shamir, 2018, 2016; Zhou and Liang, 2017; Freeman and Bruna, 2016; Hardt and Ma, 2016). These works establish favorable landscape properties such as all local minimizers are global, but do not ensure that gradient descent converges to a global minimizer of the empirical risk. For other activation functions, some previous work show the landscape does have the desired geometric properties (Du and Lee, 2018; Soltanolkotabi et al., 2018; Nguyen and Hein, 2017; Kawaguchi, 2016; Haeffele and Vidal, 2015; Andoni et al., 2014; Venturi et al., 2018). However, it is unclear how to extend their analyses in our setting.

Analysis of Algorithm Dynamics

Another way to prove convergence result is to directly analyze the dynamics of first order methods. Our paper also belongs to this category. Many previous work assumed (1) the input distribution is Gaussian and (2) the label is generated according to a planted neural network. Based on these two (unrealistic) conditions, it can be shown that randomly initialized (stochastic) gradient descent can learn a ReLU (Tian, 2017; Soltanolkotabi, 2017), a single convolutional filter (Brutzkus and Globerson, 2017)

, a convolutional neural network with one filter and one output layer 

(Du et al., 2018b) and residual network with small spectral norm weight matrix (Li and Yuan, 2017).333Since these work assume the label is realizable, converging to global minimum is equivalent to recover the underlying model. Beyond Gaussian input distribution, Du et al. (2017b) showed for learning a convolutional filter, the Gaussian input distribution assumption can be relaxed but they still require the label is generated from an underlying true filter. Comparing with these work, our paper does not try to recover the underlying true neural network. Instead, we focus on providing theoretical justification on why randomly initialized gradient descent can achieve zero training loss, which is what we can observe and verify in practice.

The most related paper is by Li and Liang (2018) who observed that when training a two-layer full connected neural network, most of the patterns () do not change over iterations, which we also use to show the stability of the Gram matrix. They used this observation to obtain the convergence rate of GD on a two-layer over-parameterized neural network for the cross-entropy loss. They need number of hidden nodes scales with where is the desired accuracy. Thus unless the number of hidden nodes , they result does not imply GD can achieve zero training loss. We improve by allowing the amount of over-parameterization to be independent of the desired accuracy and show GD can achieve zero training loss. Furthermore, our proof is much simpler and more transparent so we believe it can be easily generalized to analyze other neural network architectures.

Other Analysis Approaches

Chizat and Bach (2018) used optimal transport theory to analyze continuous time gradient descent on over-parameterized models. However, their results on ReLU activated neural network is only at the formal level. Mei et al. showed the dynamics of SGD can be captured by a partial differential equation in the suitable scaling limit. They listed some specific examples on input distributions including mixture of Gaussians. However, it is still unclear whether this framework can explain why first order methods can minimize the empirical risk. Daniely (2017) built connection between neural networks with kernel methods and showed stochastic gradient descent can learn a function that is competitive with the best function in the conjugate kernel space of the network. Again this work does not imply why first order method can achieve zero training loss.

3 Continuous Time Analysis

In this section, we present our result for gradient flow, i.e., gradient descent with infinitesimal step size. In the next section, we will modify the proof and give a quantitative bound for gradient descent with positive step size. Formally, we consider the ordinary differential equation

444Strictly speaking, this should be differential inclusion (Davis et al., 2018) defined by :

for . We denote the prediction on input at time and we let be the prediction vector at time . Our main result in this section is the following theorem.

Theorem 3.1 (Convergence Rate of Gradient Flow).

Assume for all , and for some constant and the matrix with satisfies . Then if we initialize , for and set the number of hidden nodes , with high probability over random initialization we have

We first discuss our assumptions. We assume and for simplicity. We can easily modify the proof by properly scaling the initialization. The key assumption is the least eigenvalue of the matrix is strictly positive. Interestingly, various properties of this matrix has been thoroughly studied in previous work (Xie et al., 2017; Tsuchida et al., 2017). In general, unless the data is degenerate, the smallest eigenvalue of is strictly positive. We refer readers to Xie et al. (2017); Tsuchida et al. (2017) for a detailed characterization of this matrix.

The number of hidden nodes required is , which depends on the number of samples and . As will be apparent in the proof, over-parameterization, i.e., the fact , plays a crucial role in guaranteeing gradient descent to find the global minimum. We believe using a more refined analysis, this dependency can be further improved. Lastly, note the convergence rate is linear because decreases to exponentially fast. The specific rate also depends on but independent of number of hidden nodes .

3.1 Proof of Theorem 3.1

Our first step is to calculate the dynamics of each prediction.

where is an matrix with -th entry

(5)

With this matrix, we can write the dynamics of prediction in a compact way:

Note is a time-dependent symmetric matrix. We first analyze its property when . The following lemma shows if is large then has lower bounded least eigenvalue with high probability. The proof is by standard concentration bound so we defer it to appendix.

Lemma 3.1.

If , we have with probability at least , .

Our second step is to show is stable in terms of . Formally, the following lemma shows if is close to the initialization , is close and has a lower bounded least eigenvalue.

Lemma 3.2.

Suppose for a given time , for all , for some small positive constant . Then we have with high probability over initialization, .

This lemma plays a crucial in our analysis so we give the proof below.

Proof of Lemma 3.2 We define the event

Note this event happens if and only if . Recall . By anti-concentration inequality of Gaussian, we have Therefore, we can bound the entry-wise deviation on in expectation:

Summing over , we have Thus by Markov inequality, with high probability, we have where is a large absolute constant. Next, we use matrix perturbation theory to bound the deviation from the initialization

Lastly, we lower bound the smallest eigenvalue by plugging in

The next lemma shows two facts if the least eigenvalue of is lower bounded. First, the loss converges to at a linear convergence rate. Second, is close to the initialization for every . This lemma clearly demonstrates the power of over-parameterization.

Lemma 3.3.

Suppose for , . Then we have and for any ,

Proof of Lemma 3.3 Recall we can write the dynamics of predictions as We can calculate the loss function dynamics

Thus we have is a decreasing function with respect to . Using this fact we can bound the loss

Therefore, exponentially fast. Now we bound the gradient. Recall for ,

Integrating the gradient, we can bound the distance from the initialization

The next lemma shows we show if , the conditions in Lemma 3.2 and 3.3 hold for all . The proof is by contradiction and we defer it to appendix.

Lemma 3.4.

If , we have for all , , for all , and .

Thus it is sufficient to show which is equivalent to We bound

Thus by Markov’s inequality, we have with high probability . Plugging in this bound we prove the theorem. ∎

4 Discrete Time Analysis

In this section, we show randomly initialized gradient descent with a constant positive step size converges to the global minimum at a linear rate. We first present our main theorem.

Theorem 4.1 (Convergence Rate of Gradient Descent).

Assume for all , , for some constant and the matrix with satisfies . If we initialize , for and the number of hidden nodes and we set the step size then with high probability over the random initialization we have for

Theorem 4.1 shows even though the objective function is non-smooth and non-convex, gradient descent with a constant positive step size still enjoys a linear convergence rate. Our assumptions on the least eigenvalue and the number of hidden nodes are exactly the same as the theorem for gradient flow. Notably, our choice of step size is independent of number of hidden nodes in contrast to the previous work (Li and Liang, 2018).

4.1 Proof of Theorem 4.1

We prove Theorem 4.1 by induction. Our induction hypothesis is just the following convergence rate of empirical loss.

Condition 4.1.

At the -th iteration, we have

A directly corollary of this condition is the following bound of deviation from the initialization. The proof is similar to that of Lemma 3.3 so we defer it to appendix.

Corollary 4.1.

If Condition 4.1 holds for , then we have for every

(6)

Now we show Condition 4.1 for every . For the base case , by definition Condition 4.1 holds. Suppose for , Condition 4.1 holds and we want to show Condition 4.1 holds for .

Our strategy is similar to the proof of Theorem 3.1. We define the event

where for some small positive constant . Different from gradient flow, for gradient descent we need a more refined analysis. We let and . The following Lemma bounds the sum of sizes of . The proof is similar to the analysis used in Lemma 3.2. See Section A for the whole proof.

Lemma 4.1.

With high probability over initialization, we have for some positive constant .

Next, we calculate the difference of predictions between two consecutive iterations, analogue to term in Section 3.

Here we divide the right hand side into two parts. accounts for terms that the pattern does not change and accounts for terms that pattern may change.

We view as a perturbation and bound its magnitude. Because ReLU is a -Lipschitz function and , we have

To analyze , by Corollary 4.1, we know and for all . Furthermore, because , we know . Thus we can find a more convenient expression of for analysis

where is just the -th entry of a discrete version of Gram matrix defined in Section 3 and is a perturbation matrix. Let be the matrix with -th entry being . Using Lemma 4.1, with high probability we obtain an upper bound of the operator norm

Similar to the classical analysis of gradient, we also need bound the quadratic term.

With these estimates at hand, we are ready to prove the induction hypothesis.

The third equality we used the decomposition of . The first inequality we used the Lemma 3.2, the bound on the step size, the bound on , the bound on and the bound on . The last inequality we used the bound of the step size and the bound of . Therefore Condition 4.1 holds for . Now by induction, we prove Theorem 4.1. ∎

5 Conclusion and Discussion

In this paper we show with over-parameterization, gradient descent provable converges to the global minimum of the empirical loss at a linear convergence rate. The key proof idea is to show the over-parameterization makes Gram matrix remain positive definite for all iterations, which in turn guarantees the linear convergence. Here we list some future directions.

First, we believe the number of hidden nodes required can be reduced. For example, previous work (Soudry and Carmon, 2016) showed is enough to make all differentiable local minima global. In our setting, using advanced tools from probability and matrix perturbation theory to analyze , we may able to tighten the bound.

Another direction is to prove the global convergence of gradient descent on multiple layer neural networks and convolutional neural networks. We believe our approach is still applicable because for a fixed a neural network architecture, when the number of hidden nodes is large and the initialization scheme is random Gaussian with proper scaling, the Gram matrix is also positive definite, which ensures the linear convergence of the empirical loss (at least at the begining). We believe combing our proof idea with the discovery of balancedness between layers in Du et al. (2018a) is a promising approach.

Lastly, in our paper we used the empirical loss as a potential function to measure the progress. If we use another potential function, we may able to prove the convergence rate of accelerated methods. This technique has been exploited in Wilson et al. (2016) for analyzing convex optimization. It would be interesting to bring their idea to analyze other first order methods.

Acknowledgments

We thank Wei Hu, Jason D. Lee and Ruosong Wang for useful discussions.

References

Appendix A Technical Proofs

Proof of Lemma 3.1.

For note every fixed pair,

is an average of independent random variables. Therefore, by Hoeffding inequality, we have with probability

,

Setting and applying union bound over pairs, we have for every pair with probability at least

Thus we have

Thus if we have the desired result. ∎

Proof of Lemma 3.4.

Suppose the conclusion does not hold at time . If there exists , or , then by Lemma 3.3 we know there exists such that . By Lemma 3.2 we know there exists

Thus at , there exists , . Now by Lemma 3.2, we know for . However, by Lemma 3.3, we know . Contradiction.

For the other case, at time , we know we know there exists

The rest of the proof is the same as the previous case. ∎

Proof of Corollary 4.1.

We use the norm of gradient to bound this distance.

Proof of Lemma 4.1.

For a fixed and , by anti-concentration inequality, we know . Thus we can bound the size of in expectation.

(7)

Summing over , we have

Thus by Markov’s inequality, we have with high probability

(8)

for some large positive constant . ∎