Machine learning generally amounts to solving an optimization problem where a loss function has to be minimized. As the problems tackled are getting more and more complex (nonlinear, nonconvex, etc.), fewer efficient algorithms exist, and the best recourse seems to rely on iterative schemes that exploit first-order or second-order derivatives of the loss function to get successive improvements and converge towards a local minimum. This explains why variants of gradient descent are becoming increasingly ubiquitous in machine learning and have been made widely available in the main deep learning libraries, being the tool of choice to optimize deep neural networks. Other types of local algorithms exist when no derivatives are known(Sigaud & Stulp, 2018), but in this paper we assume that some derivatives are available and only consider first order gradient-based or second order Hessian-based methods.
Among these methods, vanilla gradient descent strongly benefits from its computational efficiency as it simply computes a local gradient from a derivative at each step of an iterative process. Though it is widely used, it is limited for two main reasons: it depends on arbitrary parameterizations and it may diverge or converge very slowly if the step size is not properly tuned. To address these issues, several lines of improvement exist. Here, we focus on two of them. On the one hand, first-order methods such as the natural gradient introduce particular metrics to restrict gradient steps and make them independent from parameterization choices (Amari, 1998). On the other hand, second-order methods use the Hessian matrix of the loss or its approximations to take into account the local curvature of that loss.
Both types of approaches enhance the vanilla gradient descent update, multiplying it by the inverse of a particular matrix. We propose a simple framework that unifies these first-order or second-order improvements of the gradient descent, and use it to study precisely the similarities and differences between 6 such methods: vanilla gradient descent itself, the classical and generalized Gauss-Newton methods, the natural gradient descent method, the gradient covariance matrix approach, and Newton’s method. The framework uses a first-order approximation of the loss and constrains the step with a quadratic norm. Therefore, each modification
of the vector of parametersis computed via an optimization problem of the following form:
where is the gradient of the loss , and a symmetric positive-definite matrix. The 6 methods differ by the matrix , which has an effect not only on the size of the steps, but also on the direction of the steps, as illustrated in Figure 1.
In Section 3, we show how the vanilla gradient descent method, the classical Gauss-Newton method and the natural gradient descent method fit into the proposed framework. It can be noted that these 3 approaches constrain the steps in a way that is independent from the loss function. In Section 4, we consider approaches that depend on the loss, namely the gradient covariance matrix approach, Newton’s method and the generalized Gauss-Newton method, and show that they also fit into the framework. Table 1 summarizes the different values of for all 6 approaches.
|vanilla gradient descent|
|natural gradient (with empirical Fisher matrix)|
|gradient covariance matrix|
Providing a unifying view of several first-order and second-order variants of the gradient descent, the framework presented in this paper makes the connections between the different approaches more obvious, and can hopefully give new insights on these connections and help clarifying some of the literature. Finally, we believe that it can facilitate the selection between these methods when given a specific problem.
2 Problem statement and notations
We consider a general learning framework with parameterized objects that are functions of inputs . In this context, learning means optimizing a vector of parameters so as to minimize a loss function estimated over a dataset of samples . is expressed as the expected value of a more atomic loss, , computed over individual samples:
Typically, iterative learning algorithms estimate the expected value with an empirical mean over a batch of samples , so the loss actually used can be written , the gradient of which being directly expressible from the gradient of . In the remainder of the paper, we keep the expressions based on the expected value , knowing that at every iteration it is replaced by an empirical mean over a (new) batch of samples.
Table of notations
|we use and when samples have two distinct parts (e.g. an input and a label)|
|dimension of the samples:|
|the vector of parameters|
|the scalar loss to minimize|
|parameterized function that takes samples as input, and on which the loss depends|
|the atomic loss of which is the average over the samples:|
|small update of computed at every iteration|
|Euclidean norm of the vector :|
|when is a vector, Jacobian of the function|
is a probability density function, we rewrite it
|with samples of the form , if depends only on , we rewrite it|
|average value of , when follows the distribution defined by|
|average value of , assuming that follows the true distribution over samples|
|gradient over (e.g. is the gradient of )|
|Fisher information matrix for the family of distributions (with fixed)|
|empirical Fisher matrix,|
|Kullback-Leibler divergence between the probability distributions and|
|Hessian of the loss , defined by|
|Hessian of the function in|
3 Vanilla, classical Gauss-Newton and natural gradient descent
All variants of the gradient descent are iterative numerical optimization methods: they start with a random initial and attempt to decrease the value of over iterations by adding a small increment vector to at each step. The core of all these algorithms is to determine the direction and magnitude of . The definition of is always based on local information, with no knowledge of the landscape of optima around, which is why such methods are never guaranteed to converge to a global optimum.
3.1 Vanilla gradient descent
The so-called “vanilla” gradient descent is a first-order method that relies on a Taylor approximation of the loss function of degree 1:
Therefore, at each iteration the objective becomes the minimization of . If the gradient is non-zero, the value of this term is unbounded below: it suffices for instance to set with arbitrarily large. As a result, constraints are needed to avoid making excessively large steps. In vanilla approaches, the Euclidean metric () is used to bound the increments . The optimization problem solved at every step of the scheme is:
To set the size of the step, instead of tuning , the most common approach is to use the expression and directly tune , which is called the learning rate. An interesting property with this approach is that, as gets closer to an optimum, the norm of the gradient decreases, so the corresponding to the fixed decreases as well. This means that the steps tend to become smaller and smaller, which is a desirable property as far as asymptotic convergence is concerned.
3.2 Classical Gauss-Newton
The atomic loss function depends on the object , not directly on the parameters . This object is an output of , which is a function parameterized arbitrarily in
. It can for instance be a combination of polynomials, a multilayer perceptron, a radial basis function network, or any other type of parameterized function. For relatively similar functions represented in different ways, and thus with possibly very different values of, one vanilla gradient descent step can result in completely different modifications of the function. In fact, the constraint relies on the Euclidean distance to measure the parameter changes, so it acts as if all components of had the same importance, which is not necessarily the case depending on how is parameterized. Some components of might have much smaller effects on than others, and this is not taken into account with the vanilla gradient descent method. The method typically performs badly with unbalanced parameterizations.
A way to make the updates independent from the parameterization is to measure and bound the effect of on the object itself. For instance, if is an object of finite dimension (i.e. a vector), we can simply bound the expected squared Euclidean distance between and :
Using again a first-order approximation, we have , where is the Jacobian of the function . The constraint can be rewritten:
resulting in the optimization problem:
which fits into the general framework (1) if the matrix is symmetric positive-definite.
The structure of the matrix makes it symmetric and positive semi-definite, but not necessarily definite-positive. To ensure the definite-positiveness, a regularization or damping term can be added, resulting in the constraint , which can be rewritten:
We see that this kind of damping, often called Tikhonov damping (Martens & Sutskever, 2012), regularizes the constraint with a term proportional to the squared Euclidean norm of . If a large value of is chosen (which also requires increasing ), the method becomes similar to vanilla gradient descent.
The common but more specific definition of the classical Gauss-Newton method.
We can remark that the same step direction is obtained with a second-order approximation of the loss, when its expression has the form of a squared error: .
Indeed, in that case:
is the gradient of the loss in , so the equality can be rewritten:
For the loss , by averaging over the samples, we get:
Minimizing this second-order Taylor approximation is called the classical Gauss-Newton method (Bottou et al., 2018). Assuming that is positive-definite, as shown in Appendix B the minimum is obtained for
which is in the same direction as the step obtained from the optimization problem (4), without damping. The second-order approximation is the usual way to derive the classical Gauss-Newton method, but it works only for a particular type of squared loss function. Our approach based on the optimization problem (4) shows that the same step direction makes sense for any type of loss.
As shown in Appendix A, the general framework (1) has a unique solution , with , and in the present case , or if we ignore the damping. We have seen that the more common approach yields , which is indeed in the same direction, but differs by the absence of the coefficient as shown in Appendix A, which we referred to as the learning rate in Section 3.1. This slight difference appears for several of the 6 methods presented in this paper as instances of the proposed framework. However, it is in fact not a very significant difference since in practice, the value of is usually redefined separately.
For instance, in the common classical Gauss-Newton approach, despite the solution being exactly , an is often introduced to make smaller steps. Another example is when is a very large matrix, in which case is often estimated via drastic approximations. If values of can be evaluated with finer approximations, a line search can be used to find a value of
for which it is verified with more precision that the corresponding step size is reasonable. This line search is an important component of the popular reinforcement learning algorithm TRPO(Schulman et al., 2015).
Finally, with the proposed framework, the theoretical value of leads to steps of constant size (when measured with the metric associated to ), and even though it may be interesting in the beginning of the gradient descent, convergence can only be obtained if the step size tends toward zero, which is why defining without taking into account the theoretical value can be preferable.
3.3 Natural gradient
As it can be relevant for many applications to consider stochastic models, a case of significant importance is when the object
is the probability density function of a continuous random variable, which we denote more conveniently by. It is in this context that Amari proposed and popularized the notion of natural gradient (Amari, 1997, 1998). To bound the modification from to in the gradient step, this approach is based on a matrix called the Fisher information matrix, defined by:
It can be used to measure a “distance” between two infinitesimally close probability distributions and as follows:
Averaging over the samples, we extrapolate a measure of distance between and :
where is the averaged Fisher information matrix.
Often, the samples can be divided into two parts: , and the probability density function is used to estimate the conditional probability of given : . depends only on , so we rewrite it , and is the estimate of . The goal of the learning algorithm is to make this estimation increasingly accurate. Classical examples include classification ( is the input and
the label) and regression analysis (is the independent variable and the dependent variable). In this context, it is common to approximate the mean by the empirical mean over the samples, which reduces the above expression to
The term is called the empirical Fisher matrix (Martens, 2014). We denote it by . Putting an upper bound on results in the following optimization problem:
which yields natural gradient steps of the form
provided that is invertible. is always positive semi-definite. Therefore, as in Section 3.2 with the classical Gauss-Newton approach, a damping term can be added to ensure invertibility. In some sense, the Fisher information matrix is defined uniquely by the property of invariance to reparameterization of the metric it induces (Čencov, 1982), and it can be obtained from many different derivations. But a particularly interesting fact is that corresponds to the second-order approximation of the Kullback-Leibler divergence (Kullback, 1997; Akimoto & Ollivier, 2013). Hence, the terms and
share some of the properties of the Kullback-Leibler divergence. For instance, when the variance of the probability distributiondecreases, the same parameter modification tends to result in increasingly large measures (see Figure 2).
Consequently, if the bound of Equation (5) is kept constant, the possible modifications of become somehow smaller when the overall variance of the outputs of decreases. Thus the natural gradient iterations slow down when the variance becomes small, which is a desirable property when keeping some amount of variability is important. Typically, in the context of reinforcement learning, this variability can be related to exploration, and it should not vanish too early. This is one of the reasons why several reinforcement learning algorithms using stochastic policies benefit from the use of natural gradient steps (Peters & Schaal, 2008; Schulman et al., 2015; Wu et al., 2017).
Relation between natural gradient and classical Gauss-Newton approaches.
Let us consider a very simple case where
is a multivariate normal distribution with fixed covariance matrix. The only variable parameter on which the distribution depends is its mean , so we can use it as representation of the distribution itself and write .
It can be shown that the Kullback-Leibler divergence between two normal distributions of equal variance and different means is proportional to the squared Euclidean distance between the means. More precisely, the Kullback-Leibler divergence between and is equal to . For small values of , this expression is approximately equal to the measure obtained with the true Fisher information matrix:
Bounding the average over the samples of the right term is the motivation of the natural gradient descent method. Besides, we have seen in Section 3.2 that the classical Gauss-Newton method can be considered as a way to bound , which is equal to the average of the left term over the samples, up to a multiplicative constant. Hence, even though both methods introduce slightly different approximations, we conclude that, in this context, the classical Gauss-Newton and natural gradient descent methods are very similar. This property is used in Pascanu & Bengio (2013)
to perform natural gradient descent on deterministic neural networks, by interpreting their output as the mean of a conditional Gaussian distribution with a fixed variance.
4 Gradient covariance matrix, Newton’s method and generalized Gauss-Newton
with matrices that do not depend on the loss function. But since the loss is typically based on quantities that are relevant for the task to achieve, it can be a good idea to exploit it to constrain the steps. We now present three approaches that fit into the same framework but with matrices that depend on the loss, namely the gradient covariance matrix method, Newton’s method, and the generalized Gauss-Newton method.
4.1 Gradient covariance matrix
The simplest way to use the loss to measure the magnitude of a change due to parameter modifications is to consider the expected squared difference between and :
For a single sample , changing slightly the object does not necessarily modify the loss , but in many cases it can be assumed that the loss becomes different for at least some samples, yielding a positive value for which quantifies in some sense the amount of change introduced by with respect to the objective. It is often a meaningful measure as it usually depends on the most relevant features of for the task at hand. Let us replace by a first-order approximation:
The above expectation simplifies to:
It results in updates of the form:
Again, a regularization term may be added to ensure the invertibility of the matrix.
Link with the natural gradient.
Let us assume, as in Section 3.3, that is a probability density function that aims to model the conditional probability of given , with . As in Section 3.3, we denote it by . In this context, it is very common for the atomic loss to be the estimation of the negative log-likelihood of given :
It follows that the empirical Fisher matrix, as defined in Section 3.3, is equal to , which is exactly the definition of the gradient covariance matrix, thus the two approaches are identical in this case. Several algorithms use this identity for natural gradient computation, e.g. George et al. (2018).
4.2 Newton’s method
Let us consider a second-order approximation of the loss:
where is the Hessian matrix: . One can argue that the first-order approximation, i.e. (which is used as minimization objective for gradient descent), is most likely good as long as the second-order term remains small. Therefore, it makes sense to directly put an upper bound on this quantity to restrict , as follows:
This bound can only define a trust region if the matrix is symmetric positive-definite. However, is symmetric but not even necessarily positive semi-definite, unlike the matrices obtained with the previous approaches. Therefore, the damping required to make it definite-positive may be larger than with other methods. It leads to the following optimization problem:
and to updates of the form:
The more usual derivation of Newton’s method.
The same update direction is obtained by directly minimizing the damped second-order approximation:
When is symmetric positive-definite, as shown in Appendix B, the minimum of this expression is obtained for:
4.3 Generalized Gauss-Newton
As is equal to , it actually does not depend directly on but on the output of . Here we assume that these outputs are vectors of finite dimension. Posing , a second-order Taylor expansion of can be written:
where is the Hessian matrix of the atomic loss with respect to variations of , and is the gradient of w.r.t. variations of . Using the equality (see the definition of in Section 3.2), we get
The generalized Gauss-Newton approach is an approximation dropping the term . Averaging over the samples yields:
Noticing that , it results in the following approximation:
As for Newton’s method, the usual way to derive the generalized Gauss-Newton method is to directly minimize this expression (see Martens (2014)), but we can also put a bound on the quantity so as to define a trust region for the validity of the first-order approximation, provided that is symmetric positive-definite. If the loss is convex in (which is often true), the matrix is at least positive semi-definite, so a small damping term suffices to make it positive-definite. If a non-negligible portion of the matrices are full rank, the damping term may be added to rather than to the full matrix. See Martens & Sutskever (2012) for an extensive discussion on different options for damping and their benefits and drawbacks. With the damping on the full matrix, the iterated optimization problem becomes:
resulting in updates of the form:
5 Summary and conclusion
resulting in updates of the form:
being symmetric positive-definite. The quadratic term of the inequality corresponds to a specific metric defined by used to measure the magnitude of the modification induced by . To evaluate this magnitude, the focus can simply be on the norm of , or on the effect of on the loss, or on the effect of on the objects , resulting in various approaches, with various definitions of . We gave 6 examples corresponding to popular variants of gradient descent, summarized in Table 1. Unifying several first-order or second-order variants of the gradient descent method enabled us to reveal links between these different approaches, and contexts in which some of them are equivalent. The proposed framework gives a new perspective on the common variants of gradient descent, and hopefully can help choosing adequately between them depending on the problem to solve. Perhaps, it can also help designing new variants or combining existing ones to obtain new desired features.
Akimoto & Ollivier (2013)
Youhei Akimoto and Yann Ollivier.
Objective improvement in information-geometric optimization.
Proceedings of the twelfth workshop on Foundations of genetic algorithms XII, pp. 1–10. ACM, 2013.
- Amari (1997) Shun-ichi Amari. Neural learning in structured parameter spaces-natural riemannian gradient. In Advances in neural information processing systems, pp. 127–133, 1997.
- Amari (1998) Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
- Bottou & Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pp. 161–168, 2008.
- Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- Čencov (1982) N. N. Čencov. Statistical decision rules and optimal inference, volume 53 of Translations of Mathematical Monographs. American Mathematical Society, 1982. Translation from the Russian edited by Lev J. Leifman.
- George et al. (2018) Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker-factored eigenbasis. arXiv preprint arXiv:1806.03884, 2018.
- Kullback (1997) Solomon Kullback. Information theory and statistics. Dover Publications Inc., Mineola, NY, 1997. ISBN 0-486-69684-7. Reprint of the second (1968) edition.
- Martens (2014) James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
- Martens & Sutskever (2012) James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade, pp. 479–535. Springer, 2012.
- Ollivier (2015) Yann Ollivier. Riemannian metrics for neural networks I: Feedforward networks. Information and Inference, 4(2):108–153, 2015.
- Pascanu & Bengio (2013) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
- Peters & Schaal (2008) Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7):1180 – 1190, 2008.
- Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
- Sigaud & Stulp (2018) Olivier Sigaud and Freek Stulp. Policy search in continuous action domains: an overview. arXiv preprint arXiv:1803.04706, 2018.
- Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pp. 5279–5288, 2017.
Appendix A Solution of the optimization problem (1)
The Lagrangian of the optimization problem (1) is
where the scalar is a Lagrange multiplier. An optimal increment anneals the gradient of the Lagrangian w.r.t , which is equal to . Since is symmetric positive-definite, and therefore invertible, the unique solution is given by , which we rewrite as follows:
Plugging this expression in problem (1) yields:
and assuming that the gradient is non-zero, the optimum is reached for:
Appendix B Minimization of a quadratic form
Let us consider a function , where is a scalar, a vector and a symmetric positive-definite matrix. The gradient of is:
being invertible, this gradient has a unique zero that corresponds to the global minimum of :