# First-order and second-order variants of the gradient descent: a unified framework

In this paper, we provide an overview of first-order and second-order variants of the gradient descent methods commonly used in machine learning. We propose a general framework in which 6 of these methods can be interpreted as different instances of the same approach. These methods are the vanilla gradient descent, the classical and generalized Gauss-Newton methods, the natural gradient descent method, the gradient covariance matrix approach, and Newton's method. Besides interpreting these methods within a single framework, we explain their specificities and show under which conditions some of them coincide.

There are no comments yet.

## Authors

• 6 publications
• 9 publications
• 27 publications
• ### Structured second-order methods via natural gradient descent

In this paper, we propose new structured second-order methods and struct...
07/22/2021 ∙ by Wu Lin, et al. ∙ 0

• ### Learning Preconditioners on Lie Groups

We study two types of preconditioners and preconditioned stochastic grad...
09/26/2018 ∙ by Xi-Lin Li, et al. ∙ 0

• ### Gravilon: Applications of a New Gradient Descent Method to Machine Learning

Gradient descent algorithms have been used in countless applications sin...
08/26/2020 ∙ by Chad Kelterborn, et al. ∙ 0

• ### SGD momentum optimizer with step estimation by online parabola model

In stochastic gradient descent, especially for neural network training, ...
07/16/2019 ∙ by Jarek Duda, et al. ∙ 0

• ### Adaptive Gradient Descent Methods for Computing Implied Volatility

In this paper, a new numerical method based on adaptive gradient descent...
08/16/2021 ∙ by Yixiao Lu, et al. ∙ 0

• ### Efficient Implementation Of Newton-Raphson Methods For Sequential Data Prediction

We investigate the problem of sequential linear data prediction for real...
01/19/2017 ∙ by Burak C. Civek, et al. ∙ 0

• ### Revisiting Natural Gradient for Deep Networks

We evaluate natural gradient, an algorithm originally proposed in Amari ...
01/16/2013 ∙ by Razvan Pascanu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning generally amounts to solving an optimization problem where a loss function has to be minimized. As the problems tackled are getting more and more complex (nonlinear, nonconvex, etc.), fewer efficient algorithms exist, and the best recourse seems to rely on iterative schemes that exploit first-order or second-order derivatives of the loss function to get successive improvements and converge towards a local minimum. This explains why variants of gradient descent are becoming increasingly ubiquitous in machine learning and have been made widely available in the main deep learning libraries, being the tool of choice to optimize deep neural networks. Other types of local algorithms exist when no derivatives are known

(Sigaud & Stulp, 2018), but in this paper we assume that some derivatives are available and only consider first order gradient-based or second order Hessian-based methods.

Among these methods, vanilla gradient descent strongly benefits from its computational efficiency as it simply computes a local gradient from a derivative at each step of an iterative process. Though it is widely used, it is limited for two main reasons: it depends on arbitrary parameterizations and it may diverge or converge very slowly if the step size is not properly tuned. To address these issues, several lines of improvement exist. Here, we focus on two of them. On the one hand, first-order methods such as the natural gradient introduce particular metrics to restrict gradient steps and make them independent from parameterization choices (Amari, 1998). On the other hand, second-order methods use the Hessian matrix of the loss or its approximations to take into account the local curvature of that loss.

Both types of approaches enhance the vanilla gradient descent update, multiplying it by the inverse of a particular matrix. We propose a simple framework that unifies these first-order or second-order improvements of the gradient descent, and use it to study precisely the similarities and differences between 6 such methods: vanilla gradient descent itself, the classical and generalized Gauss-Newton methods, the natural gradient descent method, the gradient covariance matrix approach, and Newton’s method. The framework uses a first-order approximation of the loss and constrains the step with a quadratic norm. Therefore, each modification

of the vector of parameters

is computed via an optimization problem of the following form:

 {minδθ∇θL(θ)TδθδθTM(θ)δθ≤ϵ2, (1)

where is the gradient of the loss , and a symmetric positive-definite matrix. The 6 methods differ by the matrix , which has an effect not only on the size of the steps, but also on the direction of the steps, as illustrated in Figure 1.

The solution of the minimization problem (1) has the following form (see Appendix A):

In Section 3, we show how the vanilla gradient descent method, the classical Gauss-Newton method and the natural gradient descent method fit into the proposed framework. It can be noted that these 3 approaches constrain the steps in a way that is independent from the loss function. In Section 4, we consider approaches that depend on the loss, namely the gradient covariance matrix approach, Newton’s method and the generalized Gauss-Newton method, and show that they also fit into the framework. Table 1 summarizes the different values of for all 6 approaches.

Providing a unifying view of several first-order and second-order variants of the gradient descent, the framework presented in this paper makes the connections between the different approaches more obvious, and can hopefully give new insights on these connections and help clarifying some of the literature. Finally, we believe that it can facilitate the selection between these methods when given a specific problem.

## 2 Problem statement and notations

We consider a general learning framework with parameterized objects that are functions of inputs . In this context, learning means optimizing a vector of parameters so as to minimize a loss function estimated over a dataset of samples . is expressed as the expected value of a more atomic loss, , computed over individual samples:

 L(θ)=IEx[l(x,hθ(x))].

Typically, iterative learning algorithms estimate the expected value with an empirical mean over a batch of samples , so the loss actually used can be written , the gradient of which being directly expressible from the gradient of . In the remainder of the paper, we keep the expressions based on the expected value , knowing that at every iteration it is replaced by an empirical mean over a (new) batch of samples.

### Table of notations

Notation Description
a sample
we use and when samples have two distinct parts (e.g. an input and a label)
dimension of the samples:
the vector of parameters
the scalar loss to minimize
parameterized function that takes samples as input, and on which the loss depends
the atomic loss of which is the average over the samples:
small update of computed at every iteration
transpose operator
Euclidean norm of the vector :
when is a vector, Jacobian of the function
when

is a probability density function, we rewrite it

with samples of the form , if depends only on , we rewrite it
average value of , when follows the distribution defined by
average value of , assuming that follows the true distribution over samples
gradient over (e.g. is the gradient of )
Fisher information matrix for the family of distributions (with fixed)
empirical Fisher matrix,
Kullback-Leibler divergence between the probability distributions and
Hessian of the loss , defined by
Hessian of the function in

## 3 Vanilla, classical Gauss-Newton and natural gradient descent

All variants of the gradient descent are iterative numerical optimization methods: they start with a random initial and attempt to decrease the value of over iterations by adding a small increment vector to at each step. The core of all these algorithms is to determine the direction and magnitude of . The definition of is always based on local information, with no knowledge of the landscape of optima around, which is why such methods are never guaranteed to converge to a global optimum.

### 3.1 Vanilla gradient descent

The so-called “vanilla” gradient descent is a first-order method that relies on a Taylor approximation of the loss function of degree 1:

 L(θ+δθ)≃L(θ)+∇θL(θ)Tδθ. (2)

Therefore, at each iteration the objective becomes the minimization of . If the gradient is non-zero, the value of this term is unbounded below: it suffices for instance to set with arbitrarily large. As a result, constraints are needed to avoid making excessively large steps. In vanilla approaches, the Euclidean metric () is used to bound the increments . The optimization problem solved at every step of the scheme is:

 (3)

where is a user-defined upper bound. This is indeed an instance of the general framework (1). As shown in Appendix A, the solution of this system is , with .

To set the size of the step, instead of tuning , the most common approach is to use the expression and directly tune , which is called the learning rate. An interesting property with this approach is that, as gets closer to an optimum, the norm of the gradient decreases, so the corresponding to the fixed decreases as well. This means that the steps tend to become smaller and smaller, which is a desirable property as far as asymptotic convergence is concerned.

### 3.2 Classical Gauss-Newton

The atomic loss function depends on the object , not directly on the parameters . This object is an output of , which is a function parameterized arbitrarily in

. It can for instance be a combination of polynomials, a multilayer perceptron, a radial basis function network, or any other type of parameterized function. For relatively similar functions represented in different ways, and thus with possibly very different values of

, one vanilla gradient descent step can result in completely different modifications of the function. In fact, the constraint relies on the Euclidean distance to measure the parameter changes, so it acts as if all components of had the same importance, which is not necessarily the case depending on how is parameterized. Some components of might have much smaller effects on than others, and this is not taken into account with the vanilla gradient descent method. The method typically performs badly with unbalanced parameterizations.

A way to make the updates independent from the parameterization is to measure and bound the effect of on the object itself. For instance, if is an object of finite dimension (i.e. a vector), we can simply bound the expected squared Euclidean distance between and :

Using again a first-order approximation, we have , where is the Jacobian of the function . The constraint can be rewritten:

 IEx[∥J(x,θ)δθ∥2]=δθTIEx[J(x,θ)TJ(x,θ)]δθ≤ϵ2,

resulting in the optimization problem:

 (4)

which fits into the general framework (1) if the matrix is symmetric positive-definite.

#### Damping.

The structure of the matrix makes it symmetric and positive semi-definite, but not necessarily definite-positive. To ensure the definite-positiveness, a regularization or damping term can be added, resulting in the constraint , which can be rewritten:

 δθTMCGN(θ)δθ+λδθTδθ≤ϵ2.

We see that this kind of damping, often called Tikhonov damping (Martens & Sutskever, 2012), regularizes the constraint with a term proportional to the squared Euclidean norm of . If a large value of is chosen (which also requires increasing ), the method becomes similar to vanilla gradient descent.

#### The common but more specific definition of the classical Gauss-Newton method.

We can remark that the same step direction is obtained with a second-order approximation of the loss, when its expression has the form of a squared error: .

Indeed, in that case:

 l(x,hθ+δθ(x))=12(hθ(x)+J(x,θ)δθ+O(δθ2))T(hθ(x)+J(x,θ)δθ+O(δθ2))=l(x,hθ(x))+hθ(x)TJ(x,θ)δθ+12δθTJ(x,θ)TJ(x,θ)δθ+O(δθ3).

is the gradient of the loss in , so the equality can be rewritten:

 l(x,hθ+δθ(x))=l(x,hθ(x))+∇θl(x,hθ(x))Tδθ+12δθTJ(x,θ)TJ(x,θ)δθ+O(δθ3).

For the loss , by averaging over the samples, we get:

 L(θ+δθ)=L(θ)+∇θL(θ)Tδθ+12δθTIEx[J(x,θ)TJ(x,θ)]δθ+O(δθ3).

Minimizing this second-order Taylor approximation is called the classical Gauss-Newton method (Bottou et al., 2018). Assuming that is positive-definite, as shown in Appendix B the minimum is obtained for

which is in the same direction as the step obtained from the optimization problem (4), without damping. The second-order approximation is the usual way to derive the classical Gauss-Newton method, but it works only for a particular type of squared loss function. Our approach based on the optimization problem (4) shows that the same step direction makes sense for any type of loss.

#### Learning rate.

As shown in Appendix A, the general framework (1) has a unique solution , with , and in the present case , or if we ignore the damping. We have seen that the more common approach yields , which is indeed in the same direction, but differs by the absence of the coefficient as shown in Appendix A, which we referred to as the learning rate in Section 3.1. This slight difference appears for several of the 6 methods presented in this paper as instances of the proposed framework. However, it is in fact not a very significant difference since in practice, the value of is usually redefined separately.

For instance, in the common classical Gauss-Newton approach, despite the solution being exactly , an is often introduced to make smaller steps. Another example is when is a very large matrix, in which case is often estimated via drastic approximations. If values of can be evaluated with finer approximations, a line search can be used to find a value of

for which it is verified with more precision that the corresponding step size is reasonable. This line search is an important component of the popular reinforcement learning algorithm TRPO

(Schulman et al., 2015).

Finally, with the proposed framework, the theoretical value of leads to steps of constant size (when measured with the metric associated to ), and even though it may be interesting in the beginning of the gradient descent, convergence can only be obtained if the step size tends toward zero, which is why defining without taking into account the theoretical value can be preferable.

### 3.3 Natural gradient

As it can be relevant for many applications to consider stochastic models, a case of significant importance is when the object

is the probability density function of a continuous random variable, which we denote more conveniently by

. It is in this context that Amari proposed and popularized the notion of natural gradient (Amari, 1997, 1998). To bound the modification from to in the gradient step, this approach is based on a matrix called the Fisher information matrix, defined by:

 Ix(θ)=IEa∼hθ,x[∇θlog(hθ,x(a))∇θlog(hθ,x(a))T].

It can be used to measure a “distance” between two infinitesimally close probability distributions and as follows:

 dℓ2(hθ,x,hθ+δθ,x)=δθTIx(θ)δθ.

Averaging over the samples, we extrapolate a measure of distance between and :

where is the averaged Fisher information matrix.

Often, the samples can be divided into two parts: , and the probability density function is used to estimate the conditional probability of given : . depends only on , so we rewrite it , and is the estimate of . The goal of the learning algorithm is to make this estimation increasingly accurate. Classical examples include classification ( is the input and

the label) and regression analysis (

is the independent variable and the dependent variable). In this context, it is common to approximate the mean by the empirical mean over the samples, which reduces the above expression to

The term is called the empirical Fisher matrix (Martens, 2014). We denote it by . Putting an upper bound on results in the following optimization problem:

 {minδθ∇θL(θ)TδθδθTF(θ)δθ≤ϵ2, (5)

which yields natural gradient steps of the form

provided that is invertible. is always positive semi-definite. Therefore, as in Section 3.2 with the classical Gauss-Newton approach, a damping term can be added to ensure invertibility. In some sense, the Fisher information matrix is defined uniquely by the property of invariance to reparameterization of the metric it induces (Čencov, 1982), and it can be obtained from many different derivations. But a particularly interesting fact is that corresponds to the second-order approximation of the Kullback-Leibler divergence (Kullback, 1997; Akimoto & Ollivier, 2013). Hence, the terms and

share some of the properties of the Kullback-Leibler divergence. For instance, when the variance of the probability distribution

decreases, the same parameter modification tends to result in increasingly large measures (see Figure 2).

Consequently, if the bound of Equation (5) is kept constant, the possible modifications of become somehow smaller when the overall variance of the outputs of decreases. Thus the natural gradient iterations slow down when the variance becomes small, which is a desirable property when keeping some amount of variability is important. Typically, in the context of reinforcement learning, this variability can be related to exploration, and it should not vanish too early. This is one of the reasons why several reinforcement learning algorithms using stochastic policies benefit from the use of natural gradient steps (Peters & Schaal, 2008; Schulman et al., 2015; Wu et al., 2017).

#### Relation between natural gradient and classical Gauss-Newton approaches.

Let us consider a very simple case where

is a multivariate normal distribution with fixed covariance matrix

. The only variable parameter on which the distribution depends is its mean , so we can use it as representation of the distribution itself and write .

It can be shown that the Kullback-Leibler divergence between two normal distributions of equal variance and different means is proportional to the squared Euclidean distance between the means. More precisely, the Kullback-Leibler divergence between and is equal to . For small values of , this expression is approximately equal to the measure obtained with the true Fisher information matrix:

 12β2∥hθ+δθ(x)−hθ(x)∥2≈δθTIx(θ)δθ.

Bounding the average over the samples of the right term is the motivation of the natural gradient descent method. Besides, we have seen in Section 3.2 that the classical Gauss-Newton method can be considered as a way to bound , which is equal to the average of the left term over the samples, up to a multiplicative constant. Hence, even though both methods introduce slightly different approximations, we conclude that, in this context, the classical Gauss-Newton and natural gradient descent methods are very similar. This property is used in Pascanu & Bengio (2013)

to perform natural gradient descent on deterministic neural networks, by interpreting their output as the mean of a conditional Gaussian distribution with a fixed variance.

## 4 Gradient covariance matrix, Newton’s method and generalized Gauss-Newton

For the computation of the direction of the parameter updates, the approaches seen in Section 3 all fit the general framework (1):

 {minδθ∇θL(θ)TδθδθTM(θ)δθ≤ϵ2,

with matrices that do not depend on the loss function. But since the loss is typically based on quantities that are relevant for the task to achieve, it can be a good idea to exploit it to constrain the steps. We now present three approaches that fit into the same framework but with matrices that depend on the loss, namely the gradient covariance matrix method, Newton’s method, and the generalized Gauss-Newton method.

### 4.1 Gradient covariance matrix

The simplest way to use the loss to measure the magnitude of a change due to parameter modifications is to consider the expected squared difference between and :

 IEx[(l(x,hθ(x))−l(x,hθ+δθ(x)))2].

For a single sample , changing slightly the object does not necessarily modify the loss , but in many cases it can be assumed that the loss becomes different for at least some samples, yielding a positive value for which quantifies in some sense the amount of change introduced by with respect to the objective. It is often a meaningful measure as it usually depends on the most relevant features of for the task at hand. Let us replace by a first-order approximation:

 l(x,hθ+δθ(x))≃l(x,hθ(x))+∇θl(x,hθ(x))Tδθ.

The above expectation simplifies to:

 IEx[(∇θl(x,hθ(x))Tδθ)2]=δθTIEx[∇θl(x,hθ(x))∇θl(x,hθ(x))T]δθ.

The term is called the gradient covariance matrix (Bottou & Bousquet, 2008). It can also be called the outer product metric (Ollivier, 2015). Putting a bound on , the iterated optimization becomes:

 (6)

It results in updates of the form:

 δθ=−αIEx[∇θl(x,hθ(x))∇θl(x,hθ(x))T]−1∇θL(θ).

Again, a regularization term may be added to ensure the invertibility of the matrix.

Let us assume, as in Section 3.3, that is a probability density function that aims to model the conditional probability of given , with . As in Section 3.3, we denote it by . In this context, it is very common for the atomic loss to be the estimation of the negative log-likelihood of given :

 l(x,hθ(x))=−log(hθ,x1(x2)).

It follows that the empirical Fisher matrix, as defined in Section 3.3, is equal to , which is exactly the definition of the gradient covariance matrix, thus the two approaches are identical in this case. Several algorithms use this identity for natural gradient computation, e.g. George et al. (2018).

### 4.2 Newton’s method

Let us consider a second-order approximation of the loss:

 L(θ+δθ)≈L(θ)+∇θL(θ)Tδθ+12δθTH(θ)δθ,

where is the Hessian matrix: . One can argue that the first-order approximation, i.e. (which is used as minimization objective for gradient descent), is most likely good as long as the second-order term remains small. Therefore, it makes sense to directly put an upper bound on this quantity to restrict , as follows:

 δθTH(θ)δθ≤ϵ2.

This bound can only define a trust region if the matrix is symmetric positive-definite. However, is symmetric but not even necessarily positive semi-definite, unlike the matrices obtained with the previous approaches. Therefore, the damping required to make it definite-positive may be larger than with other methods. It leads to the following optimization problem:

 (7)

and to updates of the form:

 δθ=−α(H(θ)+λI)−1∇θL(θ).

#### The more usual derivation of Newton’s method.

The same update direction is obtained by directly minimizing the damped second-order approximation:

 L(θ)+∇θL(θ)Tδθ+12δθT(H(θ)+λI)δθ.

When is symmetric positive-definite, as shown in Appendix B, the minimum of this expression is obtained for:

 δθ=−(H(θ)+λI)−1∇θL(θ).

### 4.3 Generalized Gauss-Newton

As is equal to , it actually does not depend directly on but on the output of . Here we assume that these outputs are vectors of finite dimension. Posing , a second-order Taylor expansion of can be written:

where is the Hessian matrix of the atomic loss with respect to variations of , and is the gradient of w.r.t. variations of . Using the equality (see the definition of in Section 3.2), we get

 l(x,hθ+δθ(x))= l(x,hθ(x))+∂l∂h(x,hθ(x))TJ(x,θ)δθ+∂l∂h(x,hθ(x))TO(δθ2)+12δθTJ(x,θ)TH(x,hθ(x))J(x,θ)δθ+O(δθ3).

The generalized Gauss-Newton approach is an approximation dropping the term . Averaging over the samples yields:

Noticing that , it results in the following approximation:

 L(θ+δθ)≈L(θ)+∇θL(θ)Tδθ+12δθTIEx[J(x,θ)TH(x,hθ(x))J(x,θ)]δθ.

As for Newton’s method, the usual way to derive the generalized Gauss-Newton method is to directly minimize this expression (see Martens (2014)), but we can also put a bound on the quantity so as to define a trust region for the validity of the first-order approximation, provided that is symmetric positive-definite. If the loss is convex in (which is often true), the matrix is at least positive semi-definite, so a small damping term suffices to make it positive-definite. If a non-negligible portion of the matrices are full rank, the damping term may be added to rather than to the full matrix. See Martens & Sutskever (2012) for an extensive discussion on different options for damping and their benefits and drawbacks. With the damping on the full matrix, the iterated optimization problem becomes:

 {minδθ∇θL(θ)TδθδθT(IEx[J(x,θ)TH(x,hθ(x))J(x,θ)]+λI)δθ≤ϵ2, (8)

resulting in updates of the form:

 δθ=−α(IEx[J(x,θ)TH(x,hθ(x))J(x,θ)]+λI)−1∇θL(θ).

## 5 Summary and conclusion

In Sections 3 and 4, we motivated and derived 6 different ways to compute parameter updates, that can all be interpreted as solving an optimization problem of the type:

 {minδθ∇θL(θ)TδθδθTM(θ)δθ≤ϵ2,

resulting in updates of the form:

being symmetric positive-definite. The quadratic term of the inequality corresponds to a specific metric defined by used to measure the magnitude of the modification induced by . To evaluate this magnitude, the focus can simply be on the norm of , or on the effect of on the loss, or on the effect of on the objects , resulting in various approaches, with various definitions of . We gave 6 examples corresponding to popular variants of gradient descent, summarized in Table 1. Unifying several first-order or second-order variants of the gradient descent method enabled us to reveal links between these different approaches, and contexts in which some of them are equivalent. The proposed framework gives a new perspective on the common variants of gradient descent, and hopefully can help choosing adequately between them depending on the problem to solve. Perhaps, it can also help designing new variants or combining existing ones to obtain new desired features.

## Appendix A Solution of the optimization problem (1)

The Lagrangian of the optimization problem (1) is

 L(δθ)=L(θ)+∇θL(θ)Tδθ+μ(δθTM(θ)δθ−ϵ2),

where the scalar is a Lagrange multiplier. An optimal increment anneals the gradient of the Lagrangian w.r.t , which is equal to . Since is symmetric positive-definite, and therefore invertible, the unique solution is given by , which we rewrite as follows:

Plugging this expression in problem (1) yields:

 {minα−α∇θL(θ)TM(θ)−1∇θL(θ)α2∇θL(θ)TM(θ)−1∇θL(θ)≤ϵ2,

and assuming that the gradient is non-zero, the optimum is reached for:

 α=ϵ√∇θL(θ)TM(θ)−1∇θL(θ).

## Appendix B Minimization of a quadratic form

Let us consider a function , where is a scalar, a vector and a symmetric positive-definite matrix. The gradient of is:

 ∇δθf(δθ)=g+M(θ)δθ.

being invertible, this gradient has a unique zero that corresponds to the global minimum of :

 δθ∗=−M(θ)−1g.