Gradient-based methods are widely used in optimizing neural networks. One crucial component in gradient methods is the learning rate (a.k.a. step size) hyper-parameter, which determines the convergence speed of the optimization procedure. A large learning rate can speed up the convergence but if it is larger than a threshold, the optimization algorithm cannot converge. This is by now well-understood for convex problems; excellent works on this topic include nash1991numerical, bertsekas1999nonlinear, nesterov2005smooth, haykin2005cognitive, bubeck2015convex, and the recent review for large-scale stochastic optimization to bottou2018optimization. However, there is still limited work on the convergence analysis for nonsmooth and nonconvex problems, which includes over-parameterized neural networks.
Recently, a series of breakthrough papers showed that (stochastic) gradient descent can provably converge to the global minima for over-parameterized neural networks(du2018gradient; du2018deep; li2018learning; allen2018convergence; zou2018stochastic). However, these papers all require the step size to be sufficiently small to guarantee the global convergence. In practice, these optimization algorithms can use a much larger learning rate while still converging to the global minimum. This leads to the following question:
What is the optimal learning rate in optimizing neural networks?
While finding the optimal step size is important theoretically for identifying the optimal convergence rate, the optimal learning rate often depends on certain unknown parameters of the problem. For example, for a convex and -smooth objective function, the optimal learning rate is where is often unknown to practitioners. To solve this problem, adaptive methods (duchi2011adaptive; mcmahan2010adaptive) are proposed so that they can change the learning rate on-the-fly according to gradient information received along the way. Though these methods often introduce additional hyper-parameters, compared to gradient descent methods with well-tuned stepsize, the adaptive methods are often robust to their hyper-parameters in the sense that these methods can still converge modulo (slightly) slower convergence rate. For this reason, adaptive gradient methods are widely used by practitioners in neural network optimization.
On the other hand, the theoretical investigation in adaptive methods in optimizing neural networks is limited. Existing analyses only deal with general (non)-convex and smooth functions, and thus, only concern convergence to first-order stationary points. However, a neural network is neither smooth nor convex. And yet, adaptive gradient methods are widely used in this setting as they converge without requiring a fine-tuned learning rate schedule. This leads to the following question:
What is the convergence rate of adaptive gradient methods in over-parameterized networks?
In this paper, we make progress on these two problems for the two-layer over-parameterized ReLU-activated neural networks setting.
First, we show the learning rate of gradient descent can be improved to where is a Gram matrix that only depends on the data. Note that this upper bound is independent of the number of parameters. As a result, using this stepsize, we show gradient descent enjoys a faster convergence rate. This choice of stepsize directly leads to an improved convergence rate compared to du2018gradient.
We develop an adaptive gradient method, which can be viewed as a variant of the “norm” version of AdaGrad. We prove this adaptive gradient method converges to the global minimum in polynomial time and does so robustly, in the sense that for any choice of hyper-parameters used in this method, our method is guaranteed to converge to the global minimum in polynomial time. The choice of hyper-parameters only affect the rate but not the convergence. To our knowledge, this is the first polynomial time global convergence result for an adaptive gradient method in the non-convex setting.
Challenges and Our Techniques
To verify the improved learning rate of gradient descent, we use a more subtle analysis of the dynamics of predictions considered in du2018gradient. Our analysis shows that the dynamics are close to a linear one. This observation allows us to choose the improved learning rate.
For the adaptive method, there are two big challenges. First, because the learning rate (induced by the hyper-parameters and the dynamics) is changing at every iteration, we need to lower and upper bound the learning rate. The lower bound is required to guarantee the algorithm will converge in polynomial time and the upper bound is required to guarantee the algorithm will not diverge. The second challenge is that if at the beginning the learning rate is too large, the loss may increase at the beginning. The proof of du2018gradient for gradient descent with well-tuned stepsize highly depends on the fact that the loss is decreasing geometrically at each iteration, so that proof cannot be adapted to our setting.
In this paper, we use induction with a carefully constructed hypothesis which implies both the upper and the lower bounds of the learning rate. Furthermore, utilizing the particular property induced by our proposed adaptive algorithm, the learning rate learns from feedback from previous iterations and thus perseveres the distance of the updated weight matrix and its initialization (Lemma 4.2) while does not vanishes to zero (Lemma 4.1). This property, together with the effect of over-parameterization, we show that the loss may only increase by a bounded amount and then decreases to zero eventually. Resolving these issues, we are able to prove the first global convergence result for an adaptive gradient method in optimizing neural networks.
1.1 Related Work
Global Convergence of Neural Networks
Recently, a series of papers showed that gradient based methods can provably reduce the training error to for over-parameterized neural networks (du2018gradient; du2018deep; li2018learning; allen2018convergence; zou2018stochastic) . In this paper we study the same setting considered in du2018gradient which showed that for learning rate , gradient descent finds an -suboptimal global minimum in iterations for the two-layer over-parameterized ReLU-activated neural network. As a by-product of the analysis in this paper, we show that the learning rate can be improved to which results in faster convergence. We believe that the proof techniques developed in this paper can be extended to deep neural networks, following the recent works (du2018deep; allen2018convergence; zou2018stochastic).
Adaptive Gradient Methods
Adaptive Gradient (AdaGrad) Methods, first introduced independently by duchi2011adaptive and mcmahan2010adaptive, are now widely used in practice for online learning due in part to their robustness to the choice of stepsize. The first convergence guarantees, proved in duchi2011adaptive
, were for the setting of online convex optimization where the loss function may change from iteration to iteration. Later convergence results for the variants of AdaGrad were proved inlevy2017online and mukkamala2017variants for offline convex and strongly convex settings. In the general non-convex and smooth setting, ward2018adagrad and li2018convergence prove that the same “norm” version of AdaGrad converges to a stationary point at rate for stochastic gradient descent and at rate for batch gradient descent.
Many modifications to AdaGrad have been proposed, namely, RMSprop(hinton2012neural), AdaDelta (zeiler2012adadelta), Adam (kingma2014adam), AdaFTRL(orabona2015scale), SGD-BB(tan2016barzilai), AdaBatch (defossez2017adabatch), signSGD (pmlr-v80-bernstein18a), SC-Adagrad (mukkamala2017variants; Shah2018MinimumNS), WNGrad (wu2018wngrad), AcceleGrad (OnlineLevy2018), Yogi (aheer2018adaptive1), Padam (chen2018closing), to name a few. More recently, acccelerated adaptive gradient methods have also been proved to converge to stationary points (barakat2018convergence; chen2018on; ma2018adaptive; zhou2018convergence; zou2018sufficient).
Our work is inspired by the analysis of ward2018adagrad and wu2018wngrad which quantifies the auto-tuning property in the learning rate in AdaGrad. We propose a new adaptive algorithm for the stepsize in the setting of over-parameterized neural networks and show global polynomial convergence guarantee.
2 Problem Setup
In this paper we consider the same setup as du2018gradient. We are given data points, . Following du2018gradient, to simplify the analysis, we make the following assumption on the training data.
For , and .
The assumption on the input is only for the ease of presentation and analysis. See discussions in du2018gradient. The second assumption on labels is satisfied in most real world datasets.
We predict labels using a two-layer neural network of the following form
where is the input, for , the weight vector of the first layer and is the output weight and
is ReLU activation function. For, we initialize the first layer vector and output weight . We fix the second layer and train the first layer with the quadratic loss
We will use iterative gradient-based algorithms to train . The gradient of each weight vector has the following form:
We use to denote the parameters at the -th iteration.
In this paper, we will study the dynamics of . Here we use for indexing because is induced by . According to du2018gradient, the matrix below determines the convergence rate of the randomly initialized gradient descent.
The matrix is defined as follows. For .
This matrix represents the kernel matrix induced by Gaussian initialization and ReLU activation function. We make the following assumption on .
The matrix in Definition 2.1 satisfies .
du2018gradient showed that this condition holds as long as the training data is not degenerate. We also define the following empirical version of this Gram matrix, which will be used in our analysis. For :
3 Warm up: Improved Learning Rate for Gradient Descent
Before presenting our adaptive method, we first revisit the gradient descent algorithm. At each iteration , we update the weight matrix according to
where is the learning rate. du2018gradient showed if , then gradient descent achieves training loss at a linear rate. We improve the upper bound of learning rate used in du2018gradient. This improved analysis also gives tighter bound for the adaptive method we will discuss in the next section. Our main result for gradient descent is the following theorem.
Theorem 3.1 (Convergence Rate of Gradient Descent with Improved Learning Rate).
Comparing with du2018gradient, we improve the maximum allowable learning rate from to . Note since , Theorem 3.1 gives an improvement. The improved learning also gives a tighter iteration complexity bound comparing to the bound in du2018gradient. Empirically, we found that if the data matrix is approximately orthogonal, then (see Figure 1 in Appendix E). Therefore, in certain scenarios, the iteration complexity of gradient descent is independent of .
Note even though gradient descent gives fast convergence, one needs to set the learning rate appropriately to achieve the fast convergence rate. In practice, is unknown to users so it would be better if the learning rate can be automatically adjusted. We address this problem in the next section.
Proof Sketch of Theorem 3.1
Our main observation is the following recursion formula.
The first approximation we used over-parameterization ( is large enough) for which the width becomes larger the approximation becomes more accurate. In Section B, we will give precise perturbation analysis. The first inequality we used the fact that and the two symmetric matrices and
share same eigenvectors. The second inequality we usedagain. Note this recursion formula shows the loss converges to at a linear rate and if we plug in we prove theorem. The details are in Section B.
4 An Adaptive Method for Over-parameterized Neural Networks
In this section we present our new adaptive gradient algorithm for optimizing over-parameterized neural networks. At the high level, we use the same paradigm as existing adaptive methods (duchi2011adaptive). There are three positive hyper-parameters, in the algorithm. is to ensure the homogeneity and that the units match. is the initialization of a monotonically increasing sequence such that is updated at -th iteration. To control the rate of this update, we use the parameter . Note is not the learning rate to update the parameter . At -th iteration, we first use and the information received to obtain , then use to update the parameters. Here is the effective learning rate at the -th iteration.
In practice, we would like an adaptive method that is robust to the choices of hyper-parameters. That is, we want this method guaranteed to converge in polynomial time for any choice of hyper-parameters.222 The convergence rate will, of course, depend on the choices of the hyper-parameters. The convergence of the ideal adaptive algorithm only depends polynomially on the these hyper-parameters. The key challenge for the adaptive method is how to design an appropriate update rule for to achieve the goal. Our algorithm uses the following update rule:
Here one can just view and together as one constant. Using is for matching the scale of and using is for the ease of comparison with other adaptive gradient methods that we further discuss in Section 5. The key for this update is instead of its square. Note this is sharp contrast to duchi2011adaptive where the scheme to update the effective learning rate can be equivalently written as . The main reason is that our convergence analysis requires analyzing both over-parameterization and the dynamics of the adaptive stepsize at the same time. See Section 5 for more discussions. We list pseudo codes in Algorithm 1.
The following theorem characterizes the convergence rate of our proposed algorithm.
Theorem 4.1 (Convergence Rate of AdaLoss).
Then Algorithm 1 admits the following convergence results.
To our knowledge, this is first global convergence guarantee for the adaptive gradient method. Now we unpack the statements of Theorem 4.1. Our theorem applies to two cases. In the first case, the effective learning rate at the beginning is smaller than the threshold that guarantees the global convergence of gradient descent (c.f. Theorem 3.1). In this case, the convergence has two terms, the first term is standard gradient descent rate if we use as the learning rate. Note this term is the same as Theorem 3.1 if . The second term comes from the upper bound of in the effective learning rate (c.f. Lemma 4.1). This case shows that if is relatively small that the second term is smaller than the first term, then we have the same rate as gradient descent. See Remark 4.1 for more discussion.
In the second case, the initial effective learning is greater than the threshold that guarantees the convergence of gradient descent. Our algorithm will guarantee either of the followings happens after iterations. (1) The loss is already small, so we can stop training. This corresponds the first term . (2) The loss is still large, which will make the effective stepsize decrease with a good rate. That is, if (2) keeps happening, the stepsize will decrease till and we are in the first case. Note the first term is the same as the second term of the first case. The third term is slightly worse than the rate in the gradient descent. The reason is the loss may increase due to the large learning rate at the beginning. (c.f. Lemma C.1).
To summarize, these two cases together show that our algorithm is robust to hyper-parameter choices. The bad choices of hyper-parameters will only hurt the constant in the convergence rate but the global polynomial time convergence is still guaranteed.
It is difficult to set the parameters with optimal values due to the fact that the maximum and minimum eigenvalues of the matrix are computational costly and so generally unknown. According to Theorem 3.1, since is an upper bound of , one may use gradient descent by setting and have the convergence rate of .
However, this choice of step size is not optimal when is much smaller than . Using adaptive gradient algorithm with the small initialization on the effective learning rate would results in better complexity. Indeed, for instance, let the target training error be , the typical statistical target error and set , . Now in the scenario that and , the convergence rate of our adaptive method is comparing to the convergence rate of gradient descent which is .
4.1 Proof Sketch of Theorem 4.1
We prove by induction. Our induction hypothesis is the following.
Recall the key Gram matrix at -th iteration
We prove two cases and separately.
Note this represents the number of iterations to make Case (2) reduce to Case (1). We first give an upper bound of . If
applying Lemma E.1 with parameters , and we have after step,
If , we are done. Note this bound incurs the first term of iteration complexity of the Case (2) in Theorem 4.1.
Similar to Case (1), we use induction for the proof. Again the base case holds by the definition. Now suppose for , Condition 4.1 holds and we will show it also holds for . There are two scenarios.
When , we have contraction bound as in Case (1) and then same argument follows but with the different initial values and . We first analyze and . By Lemma C.1, we know only increases an additive factor from . Furthermore, by Lemma 4.2, we know for
Now we consider -th iteration. Applying Lemma 4.1, we have
Now we have proved the induction. The last step is to use Condition 4.1 to prove the convergence rate. Observe that for any , we have
With some algebra, one can show this bound corresponds to the second and the third term of iteration complexity of the Case (2) in Theorem 4.1.
4.1.1 Ingredients of Proof
Suppose Condition 4.1 holds for and is updated by Algorithm 1. Let be the first index such that . Then for every and ,
Thus, the upper bound for ,
As for the upper bound of
Let be the first index such that . Then for every , we have for ,
5 Discussion on Variants of AdaGrad
In this section we compare our proposed algorithm AdaLoss with existing adaptive methods. Algorithm 1 can be viewed as a variant of the standard AdaGrad algorithm proposed by duchi2011adaptive, where the norm version of the update is
Our algorithm AdaLoss is similar to AdaGrad, but is distinctly different from AdaGrad: we update using the norm of the loss instead of the squared norm of the gradient. We considered the AdaLoss update instead of AdaGrad because, in the setting considered here, the modifications allowed for dramatically better theoretical convergence rate.
Why the Loss instead of the Gradient?
Indeed, our update of is not too different from the following update rule using the gradient
The AdaLoss update can be upper and lower bounded by and the norm of the gradient, i.e.,
If , then 666Proof is given in Appendix D
However, we use instead of using the gradient to update because our convergence analysis requires lower and upper bounding the dynamics , in terms of . If were instead updated using (11), then
The above lower bound of results in a larger in Case (2) by a factor of . Using the loss instead of the gradient to update is independently useful as reusing the already computed loss information for each iteration can save some computation cost and thus make the update more efficient.
Why the norm and not the squared-norm?
For ease of comparison with Algorithm 1, we switch from gradient information to loss and compare with two close variants:
Equation (12) using the “square” rule update is the standarad AdaGrad proposed by duchi2011adaptive
and has been widely recognized as important optimizer in deep learning – especially for training sparse datasets. For our over-parameterized models, this update rule does give a better convergence result in Case 1 when777The convergence proof is straightforward and similar to the first case in Theorem 4.1. However, when the initialization , we were only able to prove convergence in case the level of over-parameterization (i.e., ) depends on the training error , the bottleneck resulting from the attempting to prove the analog of Lemma 4.2 (see Proposition 5.2 below).
Let be the first index such that . Consider the update of in (12). Then for every , we have for ,
On the other hand, the update rule in (13) can resolve the problem because the growth of is larger than (12) such that the upper bound of , is better than that in Proposition 5.2 and even Lemma 4.2 if for some small . However, the growth of remains too fast once the critical value of has been reached – the upper bound we were able to show is exponential in and also in the hyper-parameters ,, and , resulting in an extremely large compared to Case (2) in Thoeorem 4.1.
This work was partially done while all 3 authors were with the Simons Institute for the Theory of Computing at UC Berkeley. We thank the institute for the financial support and the organizers of the program on “Foundation of Data Science”. We would also thank Facebook AI Research for partial support of Rachel Ward’s Research.
Appendix A Experiments
We first plot the eigenvalues of the matrices and then provide the details.
We use two simulated Gaussian data sets: i.i.d. Gaussian (the red curves) and multivariate Gaussian (the blue curves). Observe the red curves in Figure 1 that the largest maximum eigenvalue is around and minimum eigenvalues is around within 100 iterations, while the maximum and minimum eigenvalues for the blue curves are around and respectively. To some extend, i.i.d. Gaussian data illustrates the case where the data points are pairwise uncorrelated such that , while correlated Gaussian data set implies the situation when the samples are highly correlated with each other .
In the experiments, we simulate Gaussian data with training sample and the dimension . Figure 1
plots the histogram of the eigenvalues of the co-variances for each dataset. Note that the eigenvalues are different from the eigenvalues in the top plots. We use the two-layer neural networks. Although here is far smaller than what Theorem 3.1 requires, we found it sufficient for our purpose to just illustrate the maximum and minimum eigenvalues of for iteration . Set the learning rate for i.i.d. Gaussian and for correlated Gaussian. The training error is also given in Figure 1.
Appendix B Proof for Theorem 3.1
We prove Theorem 3.1 by induction888Note that we use the same structure as in du2018gradient. For the sake of completeness in the proof, we will use most of their lemmas, of which the proofs can be found in technical section or otherwise in their paper .. Our induction hypothesis is the following convergence rate of empirical loss.
At the -th iteration, we have for such that with probability ,
Now we show Condition B.1 for every . For the base case , by definition Condition B.1 holds. Suppose for , Condition B.1 holds and we want to show Condition B.1 holds for . We first prove the order of and then the contraction of .
b.1 The order of at iteration
Note that the contraction for is mainly controlled by the smallest eigenvalue of the sequence of matrices . It requires that the minimum eigenvalues of matrix are strictly positive, which is equivalent to ask that the update of is not far away from initialization for . This requirement can be fulfilled by the large hidden nodes .
The first lemma (Lemma B.1) gives smallest in order to have . The next two lemmas concludes the order of so that