Second-order Information in First-order Optimization Methods

by   Yuzheng Hu, et al.
Peking University

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics—Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm—AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on CIFAR-10! This phenomenon casts doubt on the convention view that the square root is crucial and training without it will lead to terrible performance. As far as we have concerned, so long as the algorithm tries to explore second or even higher information of the loss surface, then proper scaling of the learning rate alone will guarantee fast training and good generalization performance. To the best of our knowledge, this is the first paper that seriously considers the necessity of square root among all adaptive methods. We believe that our work can shed light on the importance of higher-order information and inspire the design of more powerful algorithms in the future.




Hi! Could I contact you?


page 1

page 2

page 3

page 4


CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity

Most optimizers including stochastic gradient descent (SGD) and its adap...

Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training

We study the effect of mini-batching on the loss landscape of deep neura...

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

We introduce AdaHessian, a second order stochastic optimization algorith...

Do optimization methods in deep learning applications matter?

With advances in deep learning, exponential data growth and increasing m...

Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

Adam is an adaptive gradient method that has experienced widespread adop...

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootst...

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications

Under StepDecay learning rate strategy (decaying the learning rate after...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern machine learning field, with the explosive growth of computing power and the resurgence of deep learning, people are beginning to design new models with extremely high capacity, some of which reaching millions of parameters (ResNet). Training such models require the use of gradients or Hessians, however, it is difficult to compute and store the Hessian in high dimensionality. Thus, second-order methods such as Newton method do not apply to deep learning and first-order methods naturally become the center of modern optimization.

11footnotetext: Co-first authors contributed equally to this paper.

Over the past decades, a number of first-order methods have been brought up and widely used in practice. The primitive method is simply gradient descent, which updates the parameters of the model in the opposite direction of the gradient of the objective function. The idea of gradient descent is quite naive and can be viewed as a greedy algorithm: when the learning rate is fixed, gradient descent just chooses a direction that forces the objective function to decrease most at current position. All other gradient-based algorithms are founded upon this simple intuition and they make small but delicate modifications upon Gradient Descent. When adding noise to the gradients, we immediately have SGD. When enforcing the gradients to be stable along the trajectory, we get SGD plus momentum and Nesterov Accelerated Gradient. When considering the choice of different learning rate at different stage of training, we have a series of important work—all referred to as adaptive method, for example, Adagrad and Adam.

Although these variants of gradient methods look reasonable at first glance and have decent performance in practice, several questions remain largely unknown even to theoretists. For example, why does SGD and Adam constantly perform better in practice? Another intriguing question is: how does Momentum help with training and why Nesterov Accelerate Gradient is regarded as the strongest first-order method ? On top of that, the key motivation of this work is based on a simple observation that is often dismissed easily but somehow inspires our curiosity: why do adaptive methods use the square root of second moment of past gradients instead of directly using the second moment?

There are indeed quite a lot interpretations concerning the pros and cons of these modified versions of GD and partially answer the questions raised above. For example, it is claimed that SGD works good because of the injected noise that potentially leads the algorithm to escape from saddle points or bad local minimum. Momentum are believed to accelerate the training process since oscillation is neutralized along the optimization trajectory. Also, adaptive methods are popular because they set different learning rate for different coordinates, which make perfect sense for models such as NLP. Nevertheless, these arguments are mostly vague and debatable and rigorous theoretical guarantees are eagerly needed. All these heuristic accounts of the advantage of each algorithm might be adequate for programmers, but it is far from satisfactory for mathematicians or statisticians.

In this paper, we aim to uncover part of these mysteries through a novel perspective: second-order information. More precisely, we believe that some of the first-order methods are implicitly exploring second-order information. It is well known that higher-order method can accelerate the rate of convergence, thus making use of higher-order information about the objective function will definitely help training and generalization. We consider two types of first-order methods: momentum and adaptive method, among which we choose three typical algorithms: Nesterov Accelerated Gradient and Adagrad & Adam to analyze. For starters, we consider Nesterov Accelerated Gradient and rigorously prove that this algorithm explores second order information by making use of the difference of past and current gradients. Furthermore, we rigorously prove that adaptive method (Adagrad, Adam) can actually be regarded as relaxations of natural gradient descent—a well-known second-order method in computation statistics. Based on this observation, we design a new algorithm, AdaSqrt, which outperforms Adam on MNIST and CIFAR10. This result casts doubt on the tradition view that the square root in the denominator of the adaptive method is crucial and training without it will lead to terrible performance.

We believe that our contributions lie in at least four aspects:

  1. We shed light on the importance of second-order information for training. In spite of the huge success of SGD, we still believe that going beyond gradient is a promising approach with huge practical value that deserves more attention in the future.

  2. We propose a new algorithm, AdaSqrt, which outperforms the most popular method for training neural networks, Adam, on MNIST and CIFAR10. For lacking of computing resources, we do not have enough time to carry out our experiments on ImageNet but we plan to do this work in Computing Center after the course.

  3. The success of our new algorithm demonstrates that the convention view: the square root of denominator in the adaptive method is crucial for training, is likely to be untrue. This result inspires us to rethink the essence of these powerful algorithms: we believe that as long as the algorithm seeks to explore second-order information of the objective function, then proper scaling of the learning rate alone can guarantee good training and generalization performance. To the best of our knowledge, this is the first paper that consider the necessity of square root in adaptive methods.

  4. Our new perspective of analyzing first-order optimization method through second-order information potentially leads to the design of more powerful algorithms that seek to explore higher-order information in future.

The remaining of this paper unfolds as follows. In Section2 we will summary related works. In section3 we will analyze Nesterov Accelerated Gradient and reveal the second-order essence of this algorithm. In Section4 and 5 we will dig into adaptive method, uncover the relationship between Adam(Adagrad) and NGD and state our new algorithm AdaSqrt with a convergence theorem. In Section6 we will present our experiment results. In Section7 we make conclusions and propose some future works.

2 Related Works

2.1 First-order optimization methods

First order optimization methods have been studied extensively since the resurgence of deep learning [1][2][3]

. The most classical methods are Gradient Descent and Stochastic Gradient Descent

[4]. Gradient Descent works perfect in traditional machine learning models, especially when the objective function has some nice properties, such as convex or strongly-convex. For deep neural networks, since the loss surface is highly-nonconvex, it is intuitively that Gradient Descent will get stuck in saddle points or bad local minimum and thus result in bad training and generalization performance. Recently, however, it is reported that gradient descent with Gaussian initialization can achieve zero loss in ultra-wide neural networks[5]. For Stochastic Gradient Descent, a common understanding of this algorithm is that it has the power of implicit regularization—that is, without explicitly penalizing the norm of the weights, Stochastic Gradient Descent naturally finds a solution with low ”complexity”[6]. However, theoretical understanding of SGD remains largely unknown. There is little theoretical guarantee that SGD will converge to zero on a neural network with moderate width, nor are there sufficient explanations for the good generalization performance of SGD, aside from its highly chaotic trajectory.

In this paper, however, we leave alone the huge empirical success of SGD and mainly focus on adaptive methods [7][8][9][10][11] and Nesterov Accelerated Gradient Algorithm[12][13] [14]. Adaptive methods are the ensemble of a series of algorithms—these algorithms all use adaptive learning rates rather than a fixed one. Although people manually adjust the learning rate of SGD or GD in practice, adaptive methods have the power of choosing learning automatically. More specifically, Adagrad[11] uses the inverse square root of second moment of past gradients as a scaling factor. Noticing that the inverse square root of second moment potentially points to zero as the training time approaches infinity, RMSprop[8] alleviates this risk by adjusting the coefficients of past gradients. Adam[9] makes addition modification on RMSprop by using momentum on past gradients. Among these three algorithms, Adam works best in practice and has become the most popular optimization method since its emergence. In recent years, more adaptive methods were brought up by researchers in hope of defeating Adam. For example, some observed that the learning rate of Adam oscillates in the mid-to-late phase of training, thus they proposed to clip the stepsize in this stage to stabilize the training procedure, which yields Adabound[15].

One should bear in mind that though some adaptive methods employ the idea of using past gradients, the core of these algorithms still lie in ”adaptive” —that is, these algorithms explicitly explore the geometry property of loss surface and make use of it when choosing new learning rates. On the other hand, momentum[16][17][18]

alone already forms a line of research. The intuition behind this method is to avoid the fluctuation of changing gradients, since the vanilla SGD potentially leads to bouncing trajectory within a local valley, thus slowing down the training process. Momentum naturally enforces the training trajectory to be smooth, as the new update directions involved the history of past gradient. Nesterov Accelerated Gradient is a splendid adaption of momentum. It not only makes use of past gradients, but also utilizes the envision of future. More specifically, NAG estimates the future location by using information of past gradients, and directly uses the foreseen gradient to update the weights. This small modification leads to considerable improvement in certain frameworks


2.2 Natural Gradient Descent

Natural Gradient Descent can be traced back to Amari’s work in 1985 [20], where the notation of information geometry was first put forward. Amari combined statistics with differential geometry, hence methods in geometry can be applied to study the properties of distributions. In fact, a distribution family can be viewed as a Riemannian manifold, whose metric is given by the Fisher Information Matrix[21]. Under the Fisher metric, we can also solve optimization problems. Therefore, given the probabilistic interpretation of any statistical model, gradient methods can be applied to it, where the ”gradient” is taken over Fisher metric. Here, the gradient is called ”Natural Gradient”, which can be rigorously defined as the inverse of Fisher Information Matrix times the ordinary gradient.

Natural Gradient Descent, or NGD, is considered to be an efficient alternative to traditional gradient descent methods. Since the resurgence of deep learning, NGD has been applied to various kinds of problems related to deep learning, such as neural networks[22][23][24][25][26] [27]

, reinforcement learning

[28], etc. NGD is thought to be better than ordinary gradient methods mainly for two reasons. The first reason is that NGD possesses nice properties such as parameterization invariance, while ordinary gradient methods highly depend on the parametrization of the model[29]. To some extent, NGD is more robust. The second reason is that NGD can be viewed as a type of second order method [30]. NGD reveals information of the distribution space (manifold), which tends to make the optimization procedure quicker and smoother. However, in high dimensional case, the inverse of the Fisher Information matrix is hard to compute, so it is important to design provable algorithms that approximate the Fisher Matrix appropriately in practice.[30]

There are also other optimization methods closely related to NGD: Hessian Free Optimization[31], Krylov Subspace Descent [32] The relationship between those methods are discussed in [29].

3 A Quick Warmup: Momentum and Nesterov Accelerated Gradient

3.1 Momentum

In this section, we focus on a branch of first-order optimization method: Momentum. Momentum was first brought up by Geoffrey Hinton in 1986 in light of the hard training of traditional algorithm. In 1999, Ning Qian combined Momentum with the most popular first-order method: Gradient Descent, resulting in a dramatic improvement in the convergence rate of training. In deep learning, Momentum updates the weights of neural network as follows:

Algorithm 1 Momentum

Here, represents the weights of neural network and represents the update direction at time

. Alpha is simply the learning rate and beta signifies the decaying rate of past gradients. They are both hyperparameters belonging to

that require fine-tuning.

The idea of Momentum is quite simple: by enforcing the past gradients to play a part in the decision of new update direction, the algorithm naturally induces smoothness over the trajectory. One quick example: imagine a person sliding down from a -th floor building, then it would be unlikely for him to reverse the direction upon reaching

-nd floor. But how does smoothing trajectories help with training? Recall that the loss surface contains a huge number of ravines, i.e. areas where the surface curves much more steeply in one dimension than in anothertwoproblems. In such case, GD or SGD will oscillate across the slopes of the ravine while only make hesitant progress along the bottom towards the local optimum with high probability. By using Momentum, however, the oscillation within the ravine is neutralized and the algorithm rush down to the local minimum steadily with an extremely fast speed. Figure 1 and 2 show that SGD with momentum indeed converges faster than ordinary SGD method at the beginning of training.

Figure 1: SGD Without Momentum
Figure 2: SGD With Momentum

3.2 Nesterov Accelerated Gradient (NAG)

Although Momentum has achieved significant improvements upon Gradient Descent and Stochastic Gradient Descent, it is still far from satisfactory in practice. Nesterov Accelerated Gradient, however, is extremely favored by engineers and often referred to as the best first-order optimization . Surprisingly, this algorithm only makes a simple modification over the location of gradient upon the original Momentum:

Algorithm 2 Nesterov Accelerated Gradient

The high level idea of this algorithm is to apply prescience to the one-step update of parameters. Since the new direction will surely contain the momentum term by the updating rules, we can explore the future gradient in advance by cheating, or equivalently, by using momentum. In other words, NAG employs past information to predict the future. But how do we formalize this advantage in mathematics? Intuitively, since the difference between NAG and the original Momentum is in fact the difference between two gradients, it is natural for us to conjecture that NAG implicitly uses second-order information. Formally, we have the following theorem:

Theorem 1 (Equivalence Theorem).  The original NAG algorithm is equivalent to the following Algorithm:

Algorithm 3 An Equivalent Form of NAG

Proof: Consiter the original Momentum:
We have


By (*) we have
Notice that



Combining and yields the desired result.

On the surface, NAG works better because it uses forward gradient rather than current gradient and this somewhat cheating gradient potentially leads to faster training. However, according to our Equivalence Theorem, this advance in fact plays a minor role in the empirical success of NAG and does not reveal the intrinsic property of this algorithm. As a matter of fact, we believe that the key issue here lies in the order of information: the crucial difference between NAG and the original Momentum is that NAG uses second-order information while training. More specifically, NAG adds an additional term: the difference between current gradient and formal gradient to the update rules, which can be treated as an approximation of the Hessian of the objective function. Thus, NAG can possibly reach a convergence rate of order two while the original Momentum can by no means realize such speed. This is exactly why NAG works so well in practice.

4 Natural Gradient Descent

4.1 A Probability Interpreration of Neural Networks

From this section, we will dig into another branch of optimization: adaptive method. For starters, we need to give an interpretation of neural networks from the probabilistic perspective and reveal the similarity between the adaptive methods and a powerful technique in computing statistics: natural gradient descent.

4.1.1 Regression Problem

We first consider the regression problem in machine learning. Given a training data set , where , consider as input and consider as output, our goal is to predict the value of for a new value of . We assume that there is a statistic model

where be a stochastic error, and be the deterministic fuction we want to derive, be the parameters. For a new value of , our prediction of will be .
In the neural networks framework, is supposed to have the form


be the activate function(i.e., ReLU, sigmoid), and

be the weight matrices, which can be viewed as parameters of

. We can make a further assumption to specify the problem. If we assume that the stochastic error is a Gaussian Noise with zero mean and precision(inverse variance)

, then we can write

Making the assumption that the data points are drawn independently from the distribution above, we obtained the likelihood fuction

where Taking the logrithm of the likelihood function and making use of the form of Guassian Distribution, we have

where the ”square loss” is defined by

Hence, maximizing the log-likelihood function is equivalent to minimizing the square loss , which is exactly what we do to solve ordinary regression problems in machine learning.

4.1.2 Classification Problem

The similar interpretation can be applied to classification problems.Suppose now we want to solve a K-class classification problem. Analogously, we can assume there is a statistical model:

where is a neural network with output of size and the sum of output’s entries is . And our goal is to derive the deterministic function . In addition, we suppose has a K points distribution with probability and the samples are i.i.d. drawn from this true K-point distribution. Therefore, we can write down the MLE:

where is the -th entry of our predicted function . Take the logarithm and we obtain the log-likelihood:

where denotes the number of samples belong to class k. Divide both sides by

and use the law of large number, we have

where is the -th entry of the true function and is the -th entry of the predicted function . Note that is exactly the categorical entropy loss function which is commonly used in machine learning problem for classification. Therefore, when n is large, minimizing the categorical entropy is equivalent to maximizing the MLE in this setting.

4.2 Derivation of Natural Gradient Descent

As stated before, neural networks have a probability interpretation. From this point of view, the optimization of objective function(loss function) with respect to parameters can be viewed as the optimization of objective function with respect to distributions. We have shown that each

(the values of the parameters) represents a distribution . When we use gradient methods to optimize the objective function, we usually choose as the descent direction, since can be interpreted as the ”steepest” descent direction in the sense that it yields the most reduction of per unit of change in . Here, the change in is measured in standard Eucilidean norm . Formally ,we have

Notice that this formula depends on the Eucilidian metric of the parameter space, we may derive similar formula by using other metric of the parameter space. Since we have stated that the parameter space can be viewed as a space of distributions, we recall the well known metric of distribution: KL-divergence, where

We can approximate the KL-divergence by Taylor expansion:

where denote the Fisher Information Matrix at . More generally, the space of distributions is in fact a Riemannian manifold whose metric is given by the Fisher Information Matrix . Thus, we define the Natural Gradient by

where the norm is defined by (The norm is well-defined since is positive definite). We can show that

In fact,


If we choose the negative natural gradient as the descent direction, we then derive the algorithm called ”natural gradient descent”:

where the Fisher Information Matrix

4.3 Advantages of Natural Gradient Descent

In this section, we summarize some advantages of NGD(Natural Gradient Descent).

First, NGD is ”natural”, since it optimizes the objective function in the distribution space. As we’ve showed before, a set of certain values of parameters are actually a representation of a distribution in our statistical model. We may reparametrized the model, then the parameters may change, but the corresponding distribution remains the same. We can say that the natural gradient reflects some geometry properties of the distribution space, for example, by using NGD, we can jump over plateaus of . Since the loss function is a function of for certain , plateaus of usually match those of and hence NGD can jump over plateaus of the error function too. On the other hand, NGD not only jump over plateaus, but also avoid jumping too far in each step, which may prevent overfitting to some extent. (The distributions which overfit the model are often very ”rough”, so in the sense of KL-divergence, they are likely far from the initial point of optimization.)

Another advantage is that, NGD can be viewed as a second order method, since is the expectation of the Hessian of the log-likelihood function. Second order methods often perform better than first order methods, since they reveal more information about the loss function. But as we state in the introduction part, Hessian in high-dimensionality is hard to compute, which prevent us from directly using second order methods in fitting neural networks. But fortunately, though superficially second-order, NGD can in fact be written in a form of first order method. This is derived from the special property of Fisher Information Matrix:

Thus we do not need to compute the hessian, instead we compute the gradient of the log-likelihood function, and approximate the expectation by some kind of ”average” of . The details will be shown in the latter parts.

5 AdaSqrt: A New Adaptive Optimization Algorithm

5.1 From Previous Adapative Methods to AdaSqrt

The adaptive methods, though varying from each other in certain aspects, share the same characteristic: the denominator of contains a sum (or average) of , where denotes the gradient of . For example, in adagrad,

The original paper[11] explains the intuition for adagrad, that is, in the denominator enforces the algorithm to descend faster in the directions where the parameter have been scarcely moved. Though reasonable, this statement cannot explain why we need to calculate the square root of . Tradition view simply claim that the square root is indeed important and updating the parameters without using it will lead to terrible training performance[an overview of ***]. Both rigorous proofs or heuristic explanations are largely lacking.
In this section, we will give another explanation for , and show that the square root in the denominator is actually unnecessary under proper scaling. Recall the formula of NGD:

where the Fisher Information Matrix Notice that for a fixed , , thus

Hence, to some extent, can be viewed as an approximation of the Fisher Matrix, which explains the occurence of in the adaptive methods, from the view of NGD. Moreover, the sqrt over is unnecessary, since there are no sqrt over the Fisher Matrix. We will then derive a new algorithm base on this observation. More precisely, an expectation has to be taken over . But in practice, we have no access to the distribution of . Therefore, some average of is necessary for approximating the expectation. We may approximate the Fisher Matrix by , where is a tuning parameter. Choosing , we then derive our algorithm, AdaSqrt:

1Require: : Stepsize
2 Require: : Stochastic objective function with parameters
3 Require: : A small constant

(Initial parameter vector)

5(initialize sum of square of gradients)
6(Initialize the scaling parameter)
7(Initialize timestep)
8   while not converge do
14   end while
return (Resulting parameters)
Algorithm 4 AdaSqrt

where is the learning rate, , is the derivative of loss function at -th iteration and is a small constant to ensure that the denominator is positive. Besides, in this algorithm, all updates are performed entrywisely.

We can see from the update formula that the denominator goes to infinity when t is large, which is similar to Adagrad. However, we know one of Adagrad’s potential weakness is that the learning rate would eventually become very small due to accumulation of square of gradients. Thus, in some sense, our algorithm can be interpreted as an adaptive method adding a term in numerator to counteract the rapidly increasing denominator, which would result in a more reasonable and effective learning rate.

In the following part we will prove a theoretical result of AdaSqrt and in section experiments are provided to examine the effect of our algorithm empirically and backup our theoretical result.

5.2 Convergence Analysis

We analyze the convergence of AdaSqrt under the ordinary deep learning framework, with several mild assumptions. Given an convex objective function (loss function) which we want to minimize, at each time ,our goal is to update the parameter using AdaSqrt, that is,

where , and be the elementwise square of . Since the sequence of is unknown in advance, we will evaluate our algorithm using the regret , which is defined as

where . Define . We will show that the asymptotic behavior of is crucial for convergence. If does not vary a lot for large , then with , AdaSqrt has a bound. With different value of , the bound may be different. The theorem and related discussions will be shown in the latter part.

Theorem 2 (Convergence Theorem). Assume that the distance between and is bounded, generated by AdaSqrt. Let , and assume that monotonously decreases to . Then AdaSqrt achieves the following guarantee,

Proof: Since is convex,


Then we have

where be the corresponding element. Since our algorithm, AdaSqrt, is actually elementwise, we will then consider each element. For convenience, for a fixed , we will omit the subscript for next several lines. Recall the update rule

Subtract the scalar and square both sides, we have

Rearrage the above equation, then

Sum the above equation for , we have

When , since converges to , the first term converges to a constant. For the second term, since monotonously decreases,

Hence we’ve shown that


which completes the proof.                                                                                           

We can deduce from the proof that different will lead to different upper bounds. It seems that if we choose a smaller , which means a larger learning rate, we will get a tighter bound. But the crucial point is that, we need the assumptions to be achieved. In training, we usually expect the actual learning rate to decrease in order for the convergence of algorithm. Intuitively, should not vary a lot, which can be seen from our proof. Some related works have shown that adaptive methods such as Adam, occasionally use extreme learning rate–too large or too small in the late phase of training [15], which results in poor performance. However, our algorithm seeks to maintain a moderate learning rate according to the experiments. We believe that the tuning parameter shows some information of the geometry of the loss surface, and we leave this to future works.

6 Experimental Results

We perform our experiments on two real-world datasets MNIST and CIFAR-10. In experiments, we compare our method and other first-order optimization methods like SGD, Adagrad and Adam’s performance under certain neural network structure including multi-layer fully-connected neural networks and convolutional neural networks. We use same batches of samples for all these methods and evaluate the training loss and prediction error after training. The step sizes for these methods are chosen such that each algorithm have relatively optimal performance in experiments.

Using these complicated models and large datasets, we demonstrate that our method can efficiently solve certain practical deep learning problems. Furthermore, our method even outperforms others under certain neural network structure for real problems.

6.1 Experiments on MNIST

MNIST is a real database containing a training set of black and white handwritten digits as well as a test set of handwritten digits. Each image has a size of and a corresponding label which indicates the true number in the image. Our task is to use the training set to fit a neural network such that the prediction error is as small as possible. For this ten-classification problem, we use sparse categorical entropy as loss function where is the ground truth and is the predictive probability. In experiments, we construct a simple two-layer fully connected neural network with hidden layers of 300 units and 100 units respectively. We use activation in both layers and a minibatch size of

for training. After choosing optimal step sizes for each methods, we train the model using these optimization methods with same batches of samples and compare the training loss and prediction error after several epochs. The result are shown as follows:

It is easy to see that Adam rapidly reduces the training loss and prediction error at the beginning of training(first three epochs) while prediction error of others oscillate relatively larger at first. However, when the predicted model is close to the real one(after five epochs), Adam would suffer from high variance due to the non-convergent learning rate. On the other hand, Adagrad’s learning rate would converge to zero and therefore the loss and error are more stable than other methods. Furthermore, although SGD converges relatively slow at first, its strong ability of exploring the local structure makes it more powerful when the model is near the minimum.

As for the proposed method, we can see that it has the best performance in epochs. Since the parameters are relatively easy to converge, the sum of square of gradients in denominator of the learning-rate term is sometimes not large enough to counteract the in numerator, which results in a relatively large step size and high variance in the final few epochs. A natural way improve our algorithm is to set up a threshold w.r.t. the gradients and use it to fix a in the numerator or reduce the exponential coefficient when is large.

However, although our method might have certain drawbacks in this problem, it still has almost the same effect as other optimization methods in this practical problem such as less than two percentages’ prediction error, rapid convergence and low computational cost. To some extent, our adaptive method can be applied as an alternative for certain optimization problems.

6.2 Experiments on CIFAR-10

We then compare our method with other optimization methods on CIFAR-10. CIFAR-10 is a dataset containing a training set of color images and a testing set of color images. All these images belong to one of ten classes and are labeled with corresponding digits. Our goal is to minimize the ten-class classification problem’s predictive error through learning from training set. Although our task is similar to previous section’s, training on CIFAR-10 is far more difficult than on MNIST since those color images have more intricate local structure handwritten digits. Therefore, we construct deep convolutional neural networks for this problem.

In the problem, we construct a neural network with 3 convolutional layers each of which followed by a Maxpooling layer respectively and a fully connected layer with 128 hidden units. Due to the intrinsic complexity of this classification problem, we can only achieve approximately predictive accuracy after iterations over the entire dataset. However, this would provide us a better chance to compare various optimization methods’ performance since the model is hard to converge. In the following experiments, we train the model with a minibatch of size 100 and run iterations such that we can approximately cover the training set times. Following is our result:

It is clear from the graphic that in the first 5 epochs, Adam reduces the training loss faster than other three methods. However, after that, our method has similar average training loss like Adam and finally outperforms Adam and becomes the best method among them to reduce both the training loss and the predictive error. Furthermore, we can see that our method always have better performance than Adagrad and SGD with fixed step size during the entire training process. Since our algorithm is adapted from Adagrad, we can conclude that our method indeed have better performance than the original version in this practical problem and has similar or even better performance than other powerful method like Adam in late epochs.

How can we improve our AdaSqrt algorithm’s performance in first few epochs? Although we have proved that the exponential coefficient is optimal in some sense, we can also use or other suitable real number as alternatives. These replacements can accelerate our algorithm at first but would result in high variance. Therefore, a probably feasible method is to use different exponential coefficients in the training procedure(large at first and small later). For lacking of time, we leave the design of such algorithms to future work.

In conclusion, based on our algorithm’s efficient performance, it is convincing for us to claim that the in the denominator of Adam, Adagrad and other adaptive methods are indeed not necessary. In order to obtain an efficient first order optimization algorithm, it suffices to let the algorithm taking advantage of second order information of the loss function and the use of square root can be compensated by proper scaling. Although there is an interesting observation that taking the square root can make those adaptive methods homogeneous and the parameters’ update process would therefore keep invariant under scaling of loss function, we can overcome the defect of non-homogeneity by tuning the learning rate manually like SGD, Nestrov. Furthermore, the learning rate in our algorithm is less likely to explode since the denominator would increase rapidly when some gradients are large and we can also fix a as numerator when the model almost converges.

7 Conclusions and Future works

In this article, we mainly focus on uncovering second-order information in first-order optimization algorithms. For momentum, we rigorously prove that Nesterov Accelerated Gradient explores second-order information by using the difference of gradients while training hence is better than the original Momentum. For adaptive methods, we rigorously prove that Adagrad and Adam can be regarded as relaxations of Natural Gradient Descent, a well-known second-order technique in computation statistics, only with a slight difference in the square root of denominator. Based on this observation, we design a new algorithm, AdaSqrt, which outperforms Adam on MNIST and CIFAR10. Our algorithm demonstrates that the tradition view concerning the importance of square root might be completely wrong—with proper scaling, , we can achieve even better performance.

We strongly believe that going beyond gradient is a promising direction with huge practical value that deserves more attention in the future. Of course, by claiming this, we are referring to methods without directly computing Hessian. To the best of our knowledge, there are at least four possible means to achieve this goal. The first approach is to use some technical tricks in matrix operation—which is exactly what BFGS algorithm does—to approximate the Hessian. The second approach is to use the idea of Newton method or Natural Gradient Descent, which motivates the design of AdaSqrt as in this paper. The third approach is to mimic the insight of Nesterov Accelerated Gradient, which predicts the location of one-step future by using past gradients. We notice that a long-run prediction might also be made, though in certain sacrifice of accuracy, and starting from a two-steps future prediction sounds plausible. In such case, the new algorithm actually explores higher-order information, which further accelerates the training. Due to the limit of time, we leave the design of such algorithm to future work. The last approach is probably the most difficult but also the most intriguing one: we encourage the readers to uncover high-dimensional geometry properties of the energy landscape of deep neural networks. So long as we have a better understanding of the loss surface (for example, the loss surface might be a union of low-dimensional manifolds), we can design provable algorithms that have both faster training speed and better generalization performance.


The authors are partially supported by the elite undergraduate training program of School of Mathematical Sciences in Peking University. They want to thank Huiyuan Wang for his helpful discussion and Putian Li for his suggestion in typesetting.