1 Introduction
Machine learning generally amounts to solving an optimization problem where a loss function has to be minimized. As the problems tackled are getting more and more complex (nonlinear, nonconvex, etc.), fewer efficient algorithms exist, and the best recourse seems to rely on iterative schemes that exploit firstorder or secondorder derivatives of the loss function to get successive improvements and converge towards a local minimum. This explains why variants of gradient descent are becoming increasingly ubiquitous in machine learning and have been made widely available in the main deep learning libraries, being the tool of choice to optimize deep neural networks. Other types of local algorithms exist when no derivatives are known
(Sigaud & Stulp, 2018), but in this paper we assume that some derivatives are available and only consider first order gradientbased or second order Hessianbased methods.Among these methods, vanilla gradient descent strongly benefits from its computational efficiency as it simply computes a local gradient from a derivative at each step of an iterative process. Though it is widely used, it is limited for two main reasons: it depends on arbitrary parameterizations and it may diverge or converge very slowly if the step size is not properly tuned. To address these issues, several lines of improvement exist. Here, we focus on two of them. On the one hand, firstorder methods such as the natural gradient introduce particular metrics to restrict gradient steps and make them independent from parameterization choices (Amari, 1998). On the other hand, secondorder methods use the Hessian matrix of the loss or its approximations to take into account the local curvature of that loss.
Both types of approaches enhance the vanilla gradient descent update, multiplying it by the inverse of a particular matrix. We propose a simple framework that unifies these firstorder or secondorder improvements of the gradient descent, and use it to study precisely the similarities and differences between 6 such methods: vanilla gradient descent itself, the classical and generalized GaussNewton methods, the natural gradient descent method, the gradient covariance matrix approach, and Newton’s method. The framework uses a firstorder approximation of the loss and constrains the step with a quadratic norm. Therefore, each modification
of the vector of parameters
is computed via an optimization problem of the following form:(1) 
where is the gradient of the loss , and a symmetric positivedefinite matrix. The 6 methods differ by the matrix , which has an effect not only on the size of the steps, but also on the direction of the steps, as illustrated in Figure 1.
In Section 3, we show how the vanilla gradient descent method, the classical GaussNewton method and the natural gradient descent method fit into the proposed framework. It can be noted that these 3 approaches constrain the steps in a way that is independent from the loss function. In Section 4, we consider approaches that depend on the loss, namely the gradient covariance matrix approach, Newton’s method and the generalized GaussNewton method, and show that they also fit into the framework. Table 1 summarizes the different values of for all 6 approaches.
Corresponding approach  

vanilla gradient descent  
classical GaussNewton  
natural gradient (with empirical Fisher matrix)  
gradient covariance matrix  
Newton’s method  
generalized GaussNewton 
Providing a unifying view of several firstorder and secondorder variants of the gradient descent, the framework presented in this paper makes the connections between the different approaches more obvious, and can hopefully give new insights on these connections and help clarifying some of the literature. Finally, we believe that it can facilitate the selection between these methods when given a specific problem.
2 Problem statement and notations
We consider a general learning framework with parameterized objects that are functions of inputs . In this context, learning means optimizing a vector of parameters so as to minimize a loss function estimated over a dataset of samples . is expressed as the expected value of a more atomic loss, , computed over individual samples:
Typically, iterative learning algorithms estimate the expected value with an empirical mean over a batch of samples , so the loss actually used can be written , the gradient of which being directly expressible from the gradient of . In the remainder of the paper, we keep the expressions based on the expected value , knowing that at every iteration it is replaced by an empirical mean over a (new) batch of samples.
Table of notations
Notation  Description 

a sample  
we use and when samples have two distinct parts (e.g. an input and a label)  
dimension of the samples:  
the vector of parameters  
the scalar loss to minimize  
parameterized function that takes samples as input, and on which the loss depends  
the atomic loss of which is the average over the samples:  
small update of computed at every iteration  
transpose operator  
Euclidean norm of the vector :  
when is a vector, Jacobian of the function  
when is a probability density function, we rewrite it 

with samples of the form , if depends only on , we rewrite it  
average value of , when follows the distribution defined by  
average value of , assuming that follows the true distribution over samples  
gradient over (e.g. is the gradient of )  
Fisher information matrix for the family of distributions (with fixed)  
empirical Fisher matrix,  
KullbackLeibler divergence between the probability distributions and  
Hessian of the loss , defined by  
Hessian of the function in 
3 Vanilla, classical GaussNewton and natural gradient descent
All variants of the gradient descent are iterative numerical optimization methods: they start with a random initial and attempt to decrease the value of over iterations by adding a small increment vector to at each step. The core of all these algorithms is to determine the direction and magnitude of . The definition of is always based on local information, with no knowledge of the landscape of optima around, which is why such methods are never guaranteed to converge to a global optimum.
3.1 Vanilla gradient descent
The socalled “vanilla” gradient descent is a firstorder method that relies on a Taylor approximation of the loss function of degree 1:
(2) 
Therefore, at each iteration the objective becomes the minimization of . If the gradient is nonzero, the value of this term is unbounded below: it suffices for instance to set with arbitrarily large. As a result, constraints are needed to avoid making excessively large steps. In vanilla approaches, the Euclidean metric () is used to bound the increments . The optimization problem solved at every step of the scheme is:
(3) 
where is a userdefined upper bound. This is indeed an instance of the general framework (1). As shown in Appendix A, the solution of this system is , with .
To set the size of the step, instead of tuning , the most common approach is to use the expression and directly tune , which is called the learning rate. An interesting property with this approach is that, as gets closer to an optimum, the norm of the gradient decreases, so the corresponding to the fixed decreases as well. This means that the steps tend to become smaller and smaller, which is a desirable property as far as asymptotic convergence is concerned.
3.2 Classical GaussNewton
The atomic loss function depends on the object , not directly on the parameters . This object is an output of , which is a function parameterized arbitrarily in
. It can for instance be a combination of polynomials, a multilayer perceptron, a radial basis function network, or any other type of parameterized function. For relatively similar functions represented in different ways, and thus with possibly very different values of
, one vanilla gradient descent step can result in completely different modifications of the function. In fact, the constraint relies on the Euclidean distance to measure the parameter changes, so it acts as if all components of had the same importance, which is not necessarily the case depending on how is parameterized. Some components of might have much smaller effects on than others, and this is not taken into account with the vanilla gradient descent method. The method typically performs badly with unbalanced parameterizations.A way to make the updates independent from the parameterization is to measure and bound the effect of on the object itself. For instance, if is an object of finite dimension (i.e. a vector), we can simply bound the expected squared Euclidean distance between and :
Using again a firstorder approximation, we have , where is the Jacobian of the function . The constraint can be rewritten:
resulting in the optimization problem:
(4) 
which fits into the general framework (1) if the matrix is symmetric positivedefinite.
Damping.
The structure of the matrix makes it symmetric and positive semidefinite, but not necessarily definitepositive. To ensure the definitepositiveness, a regularization or damping term can be added, resulting in the constraint , which can be rewritten:
We see that this kind of damping, often called Tikhonov damping (Martens & Sutskever, 2012), regularizes the constraint with a term proportional to the squared Euclidean norm of . If a large value of is chosen (which also requires increasing ), the method becomes similar to vanilla gradient descent.
The common but more specific definition of the classical GaussNewton method.
We can remark that the same step direction is obtained with a secondorder approximation of the loss, when its expression has the form of a squared error: .
Indeed, in that case:
is the gradient of the loss in , so the equality can be rewritten:
For the loss , by averaging over the samples, we get:
Minimizing this secondorder Taylor approximation is called the classical GaussNewton method (Bottou et al., 2018). Assuming that is positivedefinite, as shown in Appendix B the minimum is obtained for
which is in the same direction as the step obtained from the optimization problem (4), without damping. The secondorder approximation is the usual way to derive the classical GaussNewton method, but it works only for a particular type of squared loss function. Our approach based on the optimization problem (4) shows that the same step direction makes sense for any type of loss.
Learning rate.
As shown in Appendix A, the general framework (1) has a unique solution , with , and in the present case , or if we ignore the damping. We have seen that the more common approach yields , which is indeed in the same direction, but differs by the absence of the coefficient as shown in Appendix A, which we referred to as the learning rate in Section 3.1. This slight difference appears for several of the 6 methods presented in this paper as instances of the proposed framework. However, it is in fact not a very significant difference since in practice, the value of is usually redefined separately.
For instance, in the common classical GaussNewton approach, despite the solution being exactly , an is often introduced to make smaller steps. Another example is when is a very large matrix, in which case is often estimated via drastic approximations. If values of can be evaluated with finer approximations, a line search can be used to find a value of
for which it is verified with more precision that the corresponding step size is reasonable. This line search is an important component of the popular reinforcement learning algorithm TRPO
(Schulman et al., 2015).Finally, with the proposed framework, the theoretical value of leads to steps of constant size (when measured with the metric associated to ), and even though it may be interesting in the beginning of the gradient descent, convergence can only be obtained if the step size tends toward zero, which is why defining without taking into account the theoretical value can be preferable.
3.3 Natural gradient
As it can be relevant for many applications to consider stochastic models, a case of significant importance is when the object
is the probability density function of a continuous random variable, which we denote more conveniently by
. It is in this context that Amari proposed and popularized the notion of natural gradient (Amari, 1997, 1998). To bound the modification from to in the gradient step, this approach is based on a matrix called the Fisher information matrix, defined by:It can be used to measure a “distance” between two infinitesimally close probability distributions and as follows:
Averaging over the samples, we extrapolate a measure of distance between and :
where is the averaged Fisher information matrix.
Often, the samples can be divided into two parts: , and the probability density function is used to estimate the conditional probability of given : . depends only on , so we rewrite it , and is the estimate of . The goal of the learning algorithm is to make this estimation increasingly accurate. Classical examples include classification ( is the input and
the label) and regression analysis (
is the independent variable and the dependent variable). In this context, it is common to approximate the mean by the empirical mean over the samples, which reduces the above expression toThe term is called the empirical Fisher matrix (Martens, 2014). We denote it by . Putting an upper bound on results in the following optimization problem:
(5) 
which yields natural gradient steps of the form
provided that is invertible. is always positive semidefinite. Therefore, as in Section 3.2 with the classical GaussNewton approach, a damping term can be added to ensure invertibility. In some sense, the Fisher information matrix is defined uniquely by the property of invariance to reparameterization of the metric it induces (Čencov, 1982), and it can be obtained from many different derivations. But a particularly interesting fact is that corresponds to the secondorder approximation of the KullbackLeibler divergence (Kullback, 1997; Akimoto & Ollivier, 2013). Hence, the terms and
share some of the properties of the KullbackLeibler divergence. For instance, when the variance of the probability distribution
decreases, the same parameter modification tends to result in increasingly large measures (see Figure 2).Consequently, if the bound of Equation (5) is kept constant, the possible modifications of become somehow smaller when the overall variance of the outputs of decreases. Thus the natural gradient iterations slow down when the variance becomes small, which is a desirable property when keeping some amount of variability is important. Typically, in the context of reinforcement learning, this variability can be related to exploration, and it should not vanish too early. This is one of the reasons why several reinforcement learning algorithms using stochastic policies benefit from the use of natural gradient steps (Peters & Schaal, 2008; Schulman et al., 2015; Wu et al., 2017).
Relation between natural gradient and classical GaussNewton approaches.
Let us consider a very simple case where
is a multivariate normal distribution with fixed covariance matrix
. The only variable parameter on which the distribution depends is its mean , so we can use it as representation of the distribution itself and write .It can be shown that the KullbackLeibler divergence between two normal distributions of equal variance and different means is proportional to the squared Euclidean distance between the means. More precisely, the KullbackLeibler divergence between and is equal to . For small values of , this expression is approximately equal to the measure obtained with the true Fisher information matrix:
Bounding the average over the samples of the right term is the motivation of the natural gradient descent method. Besides, we have seen in Section 3.2 that the classical GaussNewton method can be considered as a way to bound , which is equal to the average of the left term over the samples, up to a multiplicative constant. Hence, even though both methods introduce slightly different approximations, we conclude that, in this context, the classical GaussNewton and natural gradient descent methods are very similar. This property is used in Pascanu & Bengio (2013)
to perform natural gradient descent on deterministic neural networks, by interpreting their output as the mean of a conditional Gaussian distribution with a fixed variance.
4 Gradient covariance matrix, Newton’s method and generalized GaussNewton
For the computation of the direction of the parameter updates, the approaches seen in Section 3 all fit the general framework (1):
with matrices that do not depend on the loss function. But since the loss is typically based on quantities that are relevant for the task to achieve, it can be a good idea to exploit it to constrain the steps. We now present three approaches that fit into the same framework but with matrices that depend on the loss, namely the gradient covariance matrix method, Newton’s method, and the generalized GaussNewton method.
4.1 Gradient covariance matrix
The simplest way to use the loss to measure the magnitude of a change due to parameter modifications is to consider the expected squared difference between and :
For a single sample , changing slightly the object does not necessarily modify the loss , but in many cases it can be assumed that the loss becomes different for at least some samples, yielding a positive value for which quantifies in some sense the amount of change introduced by with respect to the objective. It is often a meaningful measure as it usually depends on the most relevant features of for the task at hand. Let us replace by a firstorder approximation:
The above expectation simplifies to:
The term is called the gradient covariance matrix (Bottou & Bousquet, 2008). It can also be called the outer product metric (Ollivier, 2015). Putting a bound on , the iterated optimization becomes:
(6) 
It results in updates of the form:
Again, a regularization term may be added to ensure the invertibility of the matrix.
Link with the natural gradient.
Let us assume, as in Section 3.3, that is a probability density function that aims to model the conditional probability of given , with . As in Section 3.3, we denote it by . In this context, it is very common for the atomic loss to be the estimation of the negative loglikelihood of given :
It follows that the empirical Fisher matrix, as defined in Section 3.3, is equal to , which is exactly the definition of the gradient covariance matrix, thus the two approaches are identical in this case. Several algorithms use this identity for natural gradient computation, e.g. George et al. (2018).
4.2 Newton’s method
Let us consider a secondorder approximation of the loss:
where is the Hessian matrix: . One can argue that the firstorder approximation, i.e. (which is used as minimization objective for gradient descent), is most likely good as long as the secondorder term remains small. Therefore, it makes sense to directly put an upper bound on this quantity to restrict , as follows:
This bound can only define a trust region if the matrix is symmetric positivedefinite. However, is symmetric but not even necessarily positive semidefinite, unlike the matrices obtained with the previous approaches. Therefore, the damping required to make it definitepositive may be larger than with other methods. It leads to the following optimization problem:
(7) 
and to updates of the form:
The more usual derivation of Newton’s method.
The same update direction is obtained by directly minimizing the damped secondorder approximation:
When is symmetric positivedefinite, as shown in Appendix B, the minimum of this expression is obtained for:
4.3 Generalized GaussNewton
As is equal to , it actually does not depend directly on but on the output of . Here we assume that these outputs are vectors of finite dimension. Posing , a secondorder Taylor expansion of can be written:
where is the Hessian matrix of the atomic loss with respect to variations of , and is the gradient of w.r.t. variations of . Using the equality (see the definition of in Section 3.2), we get
The generalized GaussNewton approach is an approximation dropping the term . Averaging over the samples yields:
Noticing that , it results in the following approximation:
As for Newton’s method, the usual way to derive the generalized GaussNewton method is to directly minimize this expression (see Martens (2014)), but we can also put a bound on the quantity so as to define a trust region for the validity of the firstorder approximation, provided that is symmetric positivedefinite. If the loss is convex in (which is often true), the matrix is at least positive semidefinite, so a small damping term suffices to make it positivedefinite. If a nonnegligible portion of the matrices are full rank, the damping term may be added to rather than to the full matrix. See Martens & Sutskever (2012) for an extensive discussion on different options for damping and their benefits and drawbacks. With the damping on the full matrix, the iterated optimization problem becomes:
(8) 
resulting in updates of the form:
5 Summary and conclusion
In Sections 3 and 4, we motivated and derived 6 different ways to compute parameter updates, that can all be interpreted as solving an optimization problem of the type:
resulting in updates of the form:
being symmetric positivedefinite. The quadratic term of the inequality corresponds to a specific metric defined by used to measure the magnitude of the modification induced by . To evaluate this magnitude, the focus can simply be on the norm of , or on the effect of on the loss, or on the effect of on the objects , resulting in various approaches, with various definitions of . We gave 6 examples corresponding to popular variants of gradient descent, summarized in Table 1. Unifying several firstorder or secondorder variants of the gradient descent method enabled us to reveal links between these different approaches, and contexts in which some of them are equivalent. The proposed framework gives a new perspective on the common variants of gradient descent, and hopefully can help choosing adequately between them depending on the problem to solve. Perhaps, it can also help designing new variants or combining existing ones to obtain new desired features.
References

Akimoto & Ollivier (2013)
Youhei Akimoto and Yann Ollivier.
Objective improvement in informationgeometric optimization.
In
Proceedings of the twelfth workshop on Foundations of genetic algorithms XII
, pp. 1–10. ACM, 2013.  Amari (1997) Shunichi Amari. Neural learning in structured parameter spacesnatural riemannian gradient. In Advances in neural information processing systems, pp. 127–133, 1997.
 Amari (1998) Shunichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 Bottou & Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pp. 161–168, 2008.
 Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Čencov (1982) N. N. Čencov. Statistical decision rules and optimal inference, volume 53 of Translations of Mathematical Monographs. American Mathematical Society, 1982. Translation from the Russian edited by Lev J. Leifman.
 George et al. (2018) Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kroneckerfactored eigenbasis. arXiv preprint arXiv:1806.03884, 2018.
 Kullback (1997) Solomon Kullback. Information theory and statistics. Dover Publications Inc., Mineola, NY, 1997. ISBN 0486696847. Reprint of the second (1968) edition.
 Martens (2014) James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Martens & Sutskever (2012) James Martens and Ilya Sutskever. Training deep and recurrent networks with hessianfree optimization. In Neural networks: Tricks of the trade, pp. 479–535. Springer, 2012.
 Ollivier (2015) Yann Ollivier. Riemannian metrics for neural networks I: Feedforward networks. Information and Inference, 4(2):108–153, 2015.
 Pascanu & Bengio (2013) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
 Peters & Schaal (2008) Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(7):1180 – 1190, 2008.
 Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
 Sigaud & Stulp (2018) Olivier Sigaud and Freek Stulp. Policy search in continuous action domains: an overview. arXiv preprint arXiv:1803.04706, 2018.
 Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pp. 5279–5288, 2017.
Appendix A Solution of the optimization problem (1)
The Lagrangian of the optimization problem (1) is
where the scalar is a Lagrange multiplier. An optimal increment anneals the gradient of the Lagrangian w.r.t , which is equal to . Since is symmetric positivedefinite, and therefore invertible, the unique solution is given by , which we rewrite as follows:
Plugging this expression in problem (1) yields:
and assuming that the gradient is nonzero, the optimum is reached for:
Appendix B Minimization of a quadratic form
Let us consider a function , where is a scalar, a vector and a symmetric positivedefinite matrix. The gradient of is:
being invertible, this gradient has a unique zero that corresponds to the global minimum of :
Comments
There are no comments yet.