1 Introduction and overview
The natural gradient descent approach, pioneered by Amari and collaborators (e.g. Amari, 1998), is a popular alternative to traditional gradient descent methods which has received a lot of attention over the past several decades, motivating many new and related approaches. It has been successfully applied to a variety of problems such as blind source separation (Amari and Cichocki, 1998)Peters and Schaal, 2008)
, and neural network training(Park et al., 2000; Martens and Grosse, 2015).
Natural gradient descent is generally applicable to the optimization of probabilistic models111This includes neural networks, which can be cast as conditional models., and involves the use of the so-called “natural gradient” in place of the standard gradient, which is defined as the gradient times the inverse of the model’s Fisher information matrix (see Section 5). In many applications, natural gradient descent seems to require far fewer total iterations than gradient descent (although the reasons for this have remained somewhat mysterious), making it a potentially attractive alternative method. Unfortunately, for models with very many parameters such as large neural networks, computing the natural gradient is impractical due to the extreme size of the Fisher information matrix (“the Fisher”). This problem can be addressed through the use of one of many possible model-specific approximations to the Fisher (e.g Le Roux et al., 2008; Ollivier, 2015; Grosse and Salakhudinov, 2015; Martens and Grosse, 2015) that are designed to be easier to compute, store and invert than the exact Fisher.
Natural gradient descent is classically motivated as a way of implementing steepest descent in the space of realizable distributions222Those distributions which correspond to some setting of the model’s parameters. instead of the space of parameters, where distance in the distribution space is measured with a special “Riemannian metric” (Amari and Nagaoka, 2007). This metric depends only on the properties of the distributions themselves (and not their parameters), and in particular is defined so that it approximates the square root of the KL divergence within small neighborhoods. Under this interpretation (discussed in detail in Section 6), natural gradient descent is invariant to any smooth and invertible reparameterization of the model, putting it in stark contrast to gradient descent, whose performance is highly parameterization dependent.
In practice however, natural gradient descent still operates within the default parameter space, and works by computing directions in the space of distributions and then translating them back to the default space before taking a discrete step. Because of this, the above discussed interpretation breaks down unless the step size becomes arbitrarily small, and as discussed in Section 9, this breakdown has important implications for designing a natural gradient method that can work well in practice. Given a large step size one also loses the parameterization invariance property of the natural gradient method, although it will still hold approximately under certain conditions, which are described in Section 12. Another problem with this interpretation is that it doesn’t provide any obvious reason why a step of natural gradient descent should make more progress optimizing the objective than a step of standard gradient descent (assuming well chosen step-sizes for both).
In Section 9 we argue for an alternative view of natural gradient descent: as an approximate 2nd-order method which utilizes the Fisher as an approximation to the Hessian (so that the natural gradient approximates a 2nd-order step computed using the Hessian). In this view, natural gradient descent makes more progress per step because it implicitly uses a local quadratic model/approximation of the the objective function which is accurate over longer distances than the 1st-order model implicitly used by gradient descent.
In support of this view is the fact that the Fisher can be cast as an approximation of the Hessian in at least two different ways (provided the objective has the form discussed in Section 4). First, as discussed in Section 5, it corresponds to the expected Hessian of the loss under the model’s distribution over predicted outputs instead of the usual empirical one used to compute the exact Hessian. And second, as we establish in Section 8, it is very often equivalent to the so-called Generalized Gauss-Newton matrix (GGN) (Schraudolph, 2002; Martens and Sutskever, 2012) discussed in Section 7, which is an established and well justified approximation of the Hessian that has been used in practical 2nd-order optimizations such as those of Martens (2010) and Vinyals and Povey (2012).
Viewing natural gradient descent as an approximate 2nd-order method is also prescriptive since it suggests the use of various damping/regularization techniques often used in the optimization literature for dealing with the problem of quadratic model trust. Indeed, such techniques have been successfully applied in 2nd-order methods such as that of Martens (2010) and Martens and Grosse (2015), where they proved crucial in achieving good and robust performance in practice.
The Fisher, which is used in computing the natural gradient direction, is defined as the covariance of the gradient of the model’s log likelihood function with respect to cases sampled from its distribution. Because it is often simpler to implement and somewhat more economical, a commonly used approximation of the Fisher, which we discuss in Section 10, is to use cases sampled from the training set instead. Known as the “empirical Fisher”, this matrix differs from the usual Fisher in subtle but very important ways, which as shown in Section 10.1, make it considerably less useful as an approximation to the Fisher and as a curvature matrix within 2nd-order optimization methods. Using the empirical Fisher also breaks some of the theory regarding natural gradient descent, although it nonetheless preserves the (approximate) parameterization invariance enjoyed by the method (as shown in Section 12). Despite these objections, the empirical Fisher has been used in many works such as Le Roux et al. (2008) and the recent spate of methods based on diagonal approximations of this matrix (which we review and critique in Section 10.2).
A well-known and often quoted result about stochastic natural gradient descent is that it is asymptotically “Fisher efficient”. Roughly speaking, this means that it provides an asymptotically unbiased estimate of the parameters with the lowest possible variance among all unbiased estimators (given the same amount of data), thus achieving the best possible expected objective function value. Unfortunately, as discussed inSection 11.1, this result comes with several important caveats which severely limit its applicability. Moreover, even when it is applicable, this result provides only an asymptotically accurate characterization of the method which may not accurately describe its behavior given a realistic number of iterations.
To address these issues, in Section 11.2 and Section 11.4 we build on the work of Murata (1998) to develop a more powerful convergence theory for natural gradient descent and more general stochastic optimization methods, as applied to convex quadratic objectives. Our results provide a much more accurate expression for the convergence speed of such methods while properly accounting for the effect of the starting point, and imply various interesting consequences about the relative performance of various 1st and 2nd-order stochastic optimization methods. Perhaps the most interesting conclusion of our analysis is that with parameter averaging applied, fixed-learning rate stochastic gradient descent achieves the same asymptotic convergence speed as natural gradient descent (and is thus also “Fisher efficient”), although 2nd-order methods such as the latter can enjoy a more favorable dependence on the starting point, which means that they can make much more progress given a limited iteration budget.
2 Neural Networks
Feed-forward neural networks are structured very similarly to classical circuits. They typically consist of a sequence of
“layers” of units, where each unit in a given layer receive inputs from the units in the previous layer, and computes an affine function of these, followed by a scalar non-linear function called an “activation function”. The input vector to the network, denoted by, is given by the units of the first layer (which is called the input layer, and is not counted towards the total ), and the output vector of the network, denoted by , is given by the units of the network’s last layer (called the “output layer”). The other layers are referred to as the network’s “hidden layers”.
Formally, given input , and parameters which determine weight matrices and biases , the network computes its output according to
where . Here, is the vector of values (“activities”) of the network’s -th layer, and is the vector-valued non-linear function computed at layer , and is often given by some simple monotonic activation function applied coordinate-wise.
Note that most of the results discussed in this document will apply to the more general setting where is an arbitrary differentiable function (in both and ).
3 Supervised learning framework
The goal of optimization/learning is to find some setting of
so that the output of the network (which we will sometimes call its “prediction”) matches certain target outputs as closely as possible. In particular, given a training setconsisting of training pairs , the goal of learning is to minimize the objective function
is a “loss function” which measures the amount of disagreement betweenand .
The prediction may be a guess for , in which case might measure the inaccuracy of this guess (e.g. using the familiar squared error ). Or could encode the parameters of some simple predictive distribution. For example,
could be the set of probabilities which parameterize a multinomial distribution over the possible discrete values of, with being the negative log probability of under this distribution.
4 KL divergence objectives
The natural gradient method of Amari (1998) can be potentially applied to any objective function which measures the performance of some statistical model. However, it enjoys richer theoretical properties when applied to objective functions based on the KL divergence between the model’s distribution and the target distribution, or certain approximations/surrogates of these.
In this section we will establish the basic notation and properties of these objective functions, and discuss the various ways in which they can be formulated. Each of these formulations will be analogous to a particular formulation of the Fisher information matrix and natural gradient (as defined in Section 5), which will differ in subtle but important ways.
In the idealized setting, input vectors are drawn independently from a target distribution with density function , and the corresponding (target) outputs from a conditional target distribution with density function .
We define the goal of learning as the minimization of the KL divergence from target joint distribution, whose density is , to the learned distribution , whose density is . Note that the second is not a typo here, since we are not learning the distribution over , only the conditional distribution of given . Our objective function is thus
This is equivalent to the expected KL divergence
since we have
It is often the case that we only have samples from and no direct knowledge of its density function. Or the expectation w.r.t. in eqn. 2 may be too difficult to compute. In such cases, we can substitute an empirical training distribution in for , which is given by a set of samples from . This gives the objective
Provided that is known for each in and that can be efficiently computed, we can use the above expression as our objective.
Otherwise, as is often the case, we might only have access to a single sample from for each , giving an empirical training distribution . Substituting this in for gives the objective function
which is the same as the objective minimized in standard maximum likelihood learning. Note that we have extended to be the set of the pairs, which agrees with how was defined in Section 3.
This kind of objective function fits into the general supervised learning framework described in Section3 as follows. We define the learned conditional distribution to be the composition of the deterministic neural network function , and an “output” conditional distribution (with associated density function ), so that
We then define the loss function as .
Note that given some loss not necessarily defined in this way, one can find a corresponding where this definition does apply, provided that satisfies certain properties. In particular, if has the same finite integral w.r.t. for each , then one can define by taking , where the proportion is w.r.t. both and .
5 Various definitions of the natural gradient and the Fisher information matrix
The usual definition of the natural gradient (Amari, 1998) which appears in the literature is
where is the Fisher information matrix of w.r.t. . is given by
where gradients and Hessians are taken w.r.t. . For the purposes of brevity we will often refer to the Fisher information matrix simply as the “Fisher”.
It can be immediately seen from the first of these expressions for that it is PSD (since it’s the expectation of something which is trivially PSD, a vector outer-product). And from the second expression we can see that it also has the interpretation of being the negative expected Hessian of .
Because where doesn’t depend on , we have
and so can also be written as the expectation w.r.t. to of the Fisher information matrix of as follows:
In Amari (1998), this version of
is computed explicitly for a basic perceptron model (basically a neural network with 0 hidden layers) in the case whereis given by .
However, in practice the real may be not directly available, or it may be difficult to integrate over . For example, the conditional Hessian corresponding to a multilayer neural network may be far too complicated to be analytically integrated, even for a very simple . In such situations, may be replaced with its empirical version , giving
This is the version of considered in Park et al. (2000).
Note that when as in Section 4, the Fisher has the interpretation of being the expectation under of the Hessian of . Meanwhile, the Hessian of is given by the expectation under of the Hessian of (where is given by the density ), and so can be seen as an approximation of in this sense. Moreover, we can see that by computing using instead of as described above, the quality of this approximation will arguably be better since is also computed using instead of .
6 Geometric interpretation
The negative gradient can be interpreted as the steepest descent direction for in the sense that it yields the most reduction in per unit of change in , where change is measured using the standard Euclidean norm . More formally we have
This interpretation exposes the strong dependence of the gradient on the Euclidean geometry of the parameter space (as defined by the norm ).
One way to motivate the natural gradient is to show that it can be viewed as a steepest descent direction, much like the negative gradient can be, except with respect to a metric that is intrinsic to the distributions being modeled as opposed to the default Euclidean metric in parameter space. In particular, the natural gradient can be derived by adapting the steepest descent formulation to use an alternative definition of (local) distance based on the “information geometry” (Amari and Nagaoka, 2000)
of the space of probability distributions (as induced by the parameters). The particular distance function333Note that this is not a formal “distance” function in the usual sense since it is not symmetric. which gives rise to the natural gradient turns out to be
To make this formal, we will first show how the KL divergence and the Fisher are fundamentally connected. The Taylor series expansion of the above distance is
where “” is short-hand to mean terms that are order 3 or higher in the entries of . Thus defines the local quadratic approximation of this distance, and so gives the mechanism of local translation between the geometry of the space of distributions, and that of the original parameter space with its default Euclidean geometry.
To make use of this connection we first observe that for a general positive definite matrix we have
where the notation is defined by .
Then taking and using the above Taylor series expansion of the KL divergence to show that as , with some extra work (e.g. Arnold et al., 2011) it follows that
Thus the negative natural gradient is indeed the steepest descent direction in the space of distributions where distance is (approximately) measured in local neighborhoods by the KL divergence. While this might seem impossible since the KL divergence is in general not symmetric in its two arguments, it turns out that is locally/asymptotically symmetric as goes to zero, and so will be (approximately) symmetric in a local neighborhood 444This follows from the fact the second order term of the Taylor series of is also given by ..
Note that both and are defined in terms of the standard basis in -space and so obviously depend on the parameterization of . But the KL divergence does not, and instead only depends on the form of the predictive distribution . Thus, the direction in distribution space defined implicitly by will be invariant to our choice of parameterization (whereas the direction defined by will not be).
By using the smoothly varying positive semi-definite (PSD) matrix
to locally define a metric tensor at every point in parameter space, a Riemannian manifold can be generated over the space of distributions. Note that the associated metric of this space won’t be the KL divergence (this isn’t even a valid metric), although it will be “locally equivalent” to the square root of the KL divergence in the sense that the two will approximate each other within a small neighborhood.
When we use the KL divergence objective function discussed in Section 4, the geometric interpretation of the natural gradient becomes particularly nice. This is because the objective function will locally measure distance in distribution space the same way that it is locally measured in the steepest descent interpretation of the natural gradient. In this view, smoothly following the natural gradient is equivalent to following the geodesic path in the Riemannian manifold from the current distribution towards the target distribution (which may never be reached due to the presence of singularities).
7 The generalized Gauss-Newton matrix
The classical Gauss-Newton matrix (or more simply the Gauss-Newton matrix) is the curvature matrix which arises in the Gauss-Newton method for non-linear least squares problems. It is applicable to our standard neural network training objective in the case where , and is given by , where is the Jabobian of w.r.t. the parameters . It is usually defined as the approximation to the Hessian of (w.r.t. ) obtained by dropping the second term inside the sum of the following expression for :
where is the Hessian (w.r.t. ) of the -th component of .
An alternative way to derive the classical Gauss-Newton is to simply replace the non-linear function by its own local linear approximation, centered at the current value of . In particular, we replace by so that becomes a quadratic function of , with derivative and Hessian given by .
Schraudolph (2002) showed how the idea of the Gauss-Newton matrix can be generalized to the situation where is any loss function which is convex in . The generalized formula for is
where is the Hessian of w.r.t. , evaluated at . Because is convex, will be PSD for each , and thus so will . We will call this the Generalized Gauss-Newton matrix (GGN). Analogously to the case of the classical Gauss-Newton matrix (which assumed ), the GGN can be obtained by dropping the second term inside the sum of the following expression for the Hessian :
where is the gradient of w.r.t. , evaluated at .
Like the Hessian, the GGN can be used to define a local quadratic model of given by:
In approximate Newton/2nd-order methods based on the GGN, parameter updates are computed by minimizing w.r.t. . Since the exact minimizer is often too difficult to compute, practical methods like the Hessian-free optimization of Martens (2010), or Krylov Subspace Descent (Vinyals and Povey, 2012), will only approximately minimize .
A key property of which is not shared by the Hessian is that it is positive semi-definite (PSD), and can thus be used to define a local quadratic model to the objective which is bounded. While the unboundedness of local quadratic models defined by the Hessian can be worked around by imposing a trust region, it has nevertheless been observed by various researchers Schraudolph (2002); Martens (2010); Vinyals and Povey (2012) that works much better in practice for neural network optimization.
Since computing the whole matrix explicitly is usually too expensive, the GGN is typically accessed via matrix-vector products. To compute such products efficiently one can use the method of Schraudolph (2002)
, which is a generalization of the well-known method for computing such products with the classical Gauss-Newton. The method is similar in cost and structure to standard backpropagation, although it can sometimes be tricky to implement (seeMartens and Sutskever (2012)).
As pointed out in Martens and Sutskever (2011), the GGN can also be derived by generalizing the previously described alternative derivation of the classical Gauss-Newton matrix to the situation where is an arbitrary convex loss. In particular, if we substitute for in as before, it is not difficult to see that the Hessian of the resulting will be equal to the GGN.
Schraudolph (2002) advocated that when computing the GGN, and be redefined so that as much as possible of the network’s computation is formally performed by instead of , while maintaining the convexity of . This is because, unlike , is not linearly approximated in the GGN, and so its associated second-order derivative terms are faithfully captured. What this almost always means in practice is that what is usually thought of as the final non-linearity of the network (i.e. ) is folded into , and the network itself just computes the identity function at its top layer. Interestingly, in many natural situations which occur in practice, doing this gives a much simpler and more elegant expression for . Exactly when and why this happens will be made clear in Section 8.
Because contributions made to the GGN for each training case and each individual component of are PSD, there can be no cancellation between positive and negative/indefinite contributions. This means that the GGN can be more robustly estimated from subsets of the training data than the Hessian. By analogy, consider how it is harder to estimate the scale of the mean value of a variable when that variable can take on both positive and negative values and has a mean close to .
The GGN is arguably a much more “conservative” curvature matrix for similar reasons. For example, if the curvature associated with some direction is large according to a version of the obtained by averaging over just a few training cases, this will be reflected to some degree in the full version of (obtained by averaging over the whole training set). By contrast, if we instead use the Hessian, cancellation with negative curvature from other training cases in the direction becomes a distinct possibility, and can lead to very small (or negative) overall curvature in that direction. Intuitively, the Hessian-based quadratic model is saying “there is a large quadratic penalty on certain cases for going in this direction but it will be offset by a large quadratic reward on certain other cases”. Since negative curvature is arguably less trustworthy over long distances than positive curvature, it is probably better not to allow it to override positive curvature in this manner, and thus the GGN seems superior in this regard.
8 Computational aspects of the natural gradient and connections to the generalized Gauss-Newton matrix
where is the Jacobian of w.r.t. , and is the gradient of w.r.t. , evaluated at ( is defined at the end of Section 4).
As was first shown by Park et al. (2000), the Fisher information matrix is thus given by
where is the Fisher information matrix of the predictive distribution at .
is given by
where is the Hessian of w.r.t. , evaluated at .
Note that even if ’s density function is known, and is relatively simple, only for certain choices of and ) will it be possible to analytically evaluate the expectation w.r.t. in the above expression for .
For example, if we take , , and to be a simple neural network with no hidden units and a single tan-sigmoid output unit, then both and its inverse can be computed efficiently (Amari, 1998). This situation is exceptional however, and for even slightly more complex models, such as neural networks with one or more hidden layers, it has never been demonstrated how to make such computations feasible in high dimensions.
Fortunately, the situation improves significantly if is replaced by , as this gives
which is easy to evaluate when is. Moreover, this is essentially equivalent to the expression in eqn. 5 for the generalized Gauss-Newton matrix (GGN), except that we have instead of as the “inner” matrix.
It also suggests a straightforward and efficient way of computing matrix-vector products with , using an approach similar to the one in Schraudolph (2002) for computing matrix-vector products with the GGN. In particular, one can multiply by using a linearized forward pass, then multiply by (which will be easy if is sufficiently simple), and then finally multiply by using a standard backwards pass.
As we shall see in the remainder of this section, the connections between the GGN and the Fisher run deeper than just similar expressions and similar algorithms for computing matrix-vector products.
In Park et al. (2000) it was shown that if the density function of has the form where is some univariate density function over , then is equal to a re-scaled555Where the re-scaling constant is determined by properties of . version of the classical Gauss-Newton matrix for non-linear least squares, with regression function given by . And in particular, the choice turns the learning problem into exactly non-linear least squares, and into precisely the classical Gauss-Newton matrix.
Heskes (2000) showed that the Fisher and the classical Gauss-Newton matrix are equivalent in the case of the squared error loss and proposed using the Fisher as an approximation to the Hessian in more general contexts.
Concurrently with this work, Pascanu and Bengio (2014) showed that for several common loss functions like cross-entropy and squared error, the GGN and the Fisher are equivalent.
We will show that in fact there is a much more general equivalence between the two matrices, starting from observation that the expressions for the GGN in eqn. 5 and the Fisher in eqn. 6 are identical up to the equivalence of and .
First, note that may not even be convex in , and so the GGN won’t necessarily be well defined. But even if is convex in , it won’t be true in general that , and so the GGN and the Fisher will differ. However, there is an important class of ’s for which will hold, provided that we have (putting us in the framework of Section 4).
Notice that , and (which follows from ). Thus, the two matrices being equal is equivalent to the condition
While this condition may seem arbitrary, it is actually very natural and holds in the important case where corresponds to an exponential family model with “natural” parameters given by . That is, when we have
for some function , where is the normalizing constant/partition function. In this case we have which doesn’t depend on , and so eqn. 7 holds trivially.
Examples of such ’s include:
multivariate normal distributions whereparameterizes only the mean
multivariate normal distributions where is the concatenation of and the vectorization of
multinomial distributions where the softmax of is the vector of probabilities for each class
Note that the loss function corresponding to the multivariate normal is the familiar squared error, and the one corresponding to the multinomial distribution is the familiar cross-entropy.
As discussed in Section 7, when constructing the GGN one must pay attention to how and are defined with regards to what parts of the neural network’s computation are performed by each function. For example, the softmax computation performed at the final layer of a classification network is usually considered to be part of the network itself and hence to be part of . The output of this computation are normalized probabilities, which are then fed into a cross-entropy loss of the form . But the other way of doing it, which Schraudolph (2002) recommends, is to have the softmax function be part of instead of , which results in a GGN which is slightly closer to the Hessian due to “less” of the computational pipeline being linearized before taking the 2nd-order Taylor series approximation. The corresponding loss function is in this case. As we have established above, doing it this way also has the nice side effect of making the GGN equivalent to the Fisher, provided that is a exponential family model with as its natural parameters.
This (qualified) equivalence between the Fisher and the GGN suggests how the GGN can be generalized to cases where it might not otherwise be well defined. In particular, it suggests formulating the loss as the negative log density for some distribution and then taking the Fisher of this distribution. Sometimes, this might be as simple as defining as per the discussion at the end of Section 4.
9 Constructing practical natural gradient methods, and the role of damping
Assuming that it is easy to compute, the simplest way to use the natural gradient in optimization is to substitute it in place of the standard gradient within a basic gradient descent approach. This gives the iteration
where is a schedule of learning rates.
Choosing the learning rate schedule can be difficult. There are adaptive schemes which are largely heuristic in nature(Amari, 1998) and some non-adaptive prescriptions such as , which have certain theoretical convergence guarantees in the stochastic setting, but which won’t necessarily work well in practice.
Ideally, we would want to apply the natural gradient method with infinitesimally small steps and produce a smooth idealized path through the space of realizable distributions. But since this is usually impossible in practice, and we don’t have access to any other simple description of the class of distributions parameterized by that we could work with more directly, we must take non-negligible discrete steps in the given parameter space666 In principle, we could move to a much more general class of distributions, such as those given by some non-parametric formulation, where we could work directly with the distributions themselves. But even assuming such an approach would be practical from a computational efficiency standpoint, we would lose the various advantages that we get from working with powerful parametric models like neural networks. In particular, we would lose their ability to generalize to unseen data by modeling the “computational process” which explains the data, instead of merely using smoothness and locality to generalize.
In principle, we could move to a much more general class of distributions, such as those given by some non-parametric formulation, where we could work directly with the distributions themselves. But even assuming such an approach would be practical from a computational efficiency standpoint, we would lose the various advantages that we get from working with powerful parametric models like neural networks. In particular, we would lose their ability to generalize to unseen data by modeling the “computational process” which explains the data, instead of merely using smoothness and locality to generalize..
The fundamental problem with simple schemes such as the one in eqn. 8 is that they implicitly assume that the natural gradient is a good direction to follow over non-negligible distances in the original parameter space, which will not be true in general. Traveling along a straight line in the original parameter space will not yield a straight line in distribution space, and so the resulting path may instead veer far away from the target that the natural gradient originally pointed towards. This is illustrated in Figure 1.
Fortunately, we can exploit the (qualified) equivalence between the Fisher and the GGN in order to produce natural gradient-like updates which will often be appropriate to take with . In particular, we know from the discussion in Section 7 that the GGN can serve as a reasonable proxy for the Hessian of , and will often produce smaller and more “conservative” updates as it tends to model the curvature as being higher in most directions than the Hessian does. Meanwhile, the update produced by minimizing the GGN-based local quadratic model is given by , which will be equal to the natural gradient when . Thus, the natural gradient, with scaling factor , can be seen as the optimal update according to an approximate, and perhaps slightly conservative, 2nd-order model of .
But just as in the case of approximate Newton methods, the break-down in the accuracy of the 2nd-order approximation of , combined with the potential for the natural gradient to be very large (e.g. when
contains some very small eigenvalues), can often lead to very large and very poor update proposals. And simply re-scaling the update by reducingmay be too crude a mechanism to deal with this subtle problem, as it will affect all eigen-directions (of ) equally, including those in which the natural gradient is already sensible or even overly conservative.
Instead, the connection between natural gradient descent and 2nd-order methods suggests the use of some of the various update “damping” techniques that have been developed for the latter, which work by constraining or penalization the solution for in various ways during the optimization of . Examples include Tikhonov regularization/damping and the closely related trust-region method (e.g. Nocedal and Wright, 2006), and other more sophisticated ones such as the “structural damping” approach of Martens and Sutskever (2011), or the approach present in Krylov Subspace Descent (Vinyals and Povey, 2012). See Martens and Sutskever (2012) for an in-depth discussion of these and other damping techniques.
This idea is well supported by practical experience since, for example, the Hessian-free optimization approach of Martens (2010) generates its updates using an Tikhonov damping scheme applied to the GGN matrix (which in the experiments considered are equivalent to the Fisher), and these updates are used effectively with and make a lot more progress on the objective than optimally re-scaled updates computed without damping (i.e. the raw natural gradient).
10 The empirical Fisher
An approximation of the Fisher known as the “empirical Fisher” (denoted ), which is often used in practical natural gradient methods, is obtained by taking the inner expectation of eqn. 3 over the target distribution (or its empirical surrogate ) instead of the model’s distribution .
In the case that one uses , this yields the following simple form:
This matrix is often incorrectly referred to as the Fisher, or even the Gauss-Newton, although it is in general not equivalent to those matrices.
10.1 Comparisons to the standard Fisher
Like the Fisher , the empirical Fisher is PSD. But unlike , it is essentially free to compute, provided that one is already computing the gradient of . It can also be applied to objective functions which might not involve a probabilistic model in any obvious way.
Compared to , which is of rank , has a rank of , which can make it easier to work with in practice. For example, the problem of computing the diagonal (or various blocks) is easier for the empirical Fisher than it is for higher rank matrices like the standard Fisher (Martens et al., 2012). This has motivated its use in optimization methods such as TONGA (Le Roux et al., 2008), and as the diagonal preconditioner of choice in the Hessian-free optimization method (Martens, 2010). Interestingly however, there are stochastic estimation methods (Chapelle and Erhan, 2011; Martens et al., 2012) which can be used to efficiently estimate the diagonal (or various blocks) of the standard Fisher , and these work quite well in practice.
Despite the various advantages of using , there are good reasons to use instead of whenever possible. In addition to Amari’s extensive theory developed for the exact natural gradient (which uses ), perhaps the best reason for using over is that turns out to be a reasonable approximation to the Hessian of in certain important special cases, which is a property that lacks in general.
For example, as discussed in Section 5, when the loss is given by (as in Section 4), can be seen as an approximation of , because both matrices have the interpretation of being the expected Hessian of the loss under some distribution. Due to the similarity of the expression for in eqn. 3 and the one above for it might be tempting to think that is given by the expected Hessian of the loss under (which is actually the formula for ) in the same way that is given by eqn. 4, however this is not the case in general.
And as we saw in Section 8, given certain assumptions about how the GGN is computed, and some additional assumptions about the form of the loss function , turns out to be equivalent to the GGN. This is very useful since the GGN can be used to define a local quadratic approximation of , whereas normally doesn’t have such an interpretation. Moreover, Schraudolph (2002) and later Martens (2010) compared to the GGN and observed that the latter performed much better as a curvature matrix within various neural network optimization methods.
As concrete evidence for why the empirical Fisher is, at best, a questionable choice for the curvature matrix, consider the following example. We will set , , , and , so that is a simple convex quadratic function of given by . In this example we have that , , while . If we use as our curvature matrix for some exponent , then it is easy to see that an iteration of the form
will fail to converge to the minimizer () unless and the learning rate goes to sufficiently fast. And even when it does converge, it will only be at a rate comparable to the speed at which goes to , which in typical situations will be either or . Meanwhile, a similar iteration of the form
which uses the exact Fisher as the curvature matrix, will experience very fast linear convergence777Here we mean “linear” in the classical sense that and not in the sense that with rate , for any fixed learning rate satisfying .
It is important to note that this example uses a noise-free version of the gradient, and that this kind of linear convergence is (provably) impossible in most realistic stochastic/online settings. Nevertheless, we would argue that a highly desirable property of any stochastic optimization method should be that it can, in principle, revert to an optimal (or nearly optimal) behavior in the deterministic setting. This might matter a lot in practice, since the gradient may end up being sufficiently well estimated in earlier stages of optimization from only a small amount of data (which is a common occurrence in our experience), or in later stages provided that larger mini-batches or other variance-reducing procedures are employed (e.g. Le Roux et al., 2012; Johnson and Zhang, 2013).
10.2 Recent diagonal methods based on the empirical Fisher
Recently, a spate of stochastic optimization methods have been proposed that are all based on diagonal approximations of the empirical Fisher . These include the diagonal version of AdaGrad (Duchi et al., 2011)
, RMSProp(Tieleman and Hinton, 2012), Adam (Ba and Kingma, 2015), etc. Such methods use iterations of the following form (possibly with some slight modifications):
where the curvature matrix is taken to be a diagonal matrix with adapted to maintain some kind of estimate of the diagonal of (possibly using information from previous iterates/mini-batches), is an estimate of produced from the current mini-batch, is a schedule of learning rates, and and are “fudge factors” (discussed later in this section).
There are also slightly more sophisticated methods (Schaul et al., 2013; Zeiler, 2013) which use preconditioners that combine the diagonal of with other quantities (such as an approximation of the diagonal of the Gauss-Newton/Fisher in the case of Schaul et al. (2013)) in order to correct for how the empirical Fisher doesn’t have the right “scale” (which is ultimately the reason why it does poorly in the example given at the end of Section 10.1).
A diagonal preconditioner of the form used in eqn. 9 was also used to accelerate the CG sub-optimizations performed within HF (Martens, 2010). In the context of CG, the improper scale of is not as serious an issue due to the fact that CG is invariant to the overall scale of its preconditioner (since it computes an optimal “learning rate” at each step which automatically adjusts for the scale). However, it still makes more sense to use the diagonal of as a preconditioner, and thanks to the method proposed by Chapelle and Erhan (2011), this can be estimated efficiently and accurately.
While the idea of using the diagonal of , , or the Gauss-Newton as a preconditioner for stochastic gradient descent (SGD) is sometimes incorrectly attributed to Duchi et al. (2011), it actually goes back much earlier, and was likely first applied to neural networks with the work of Lecun and collaborators (Becker and LeCun, 1989; LeCun et al., 1998), who proposed an iteration of the form in eqn. 9 with where approximates the diagonal of the Hessian or the Gauss-Newton matrix (which as shown in Section 8, is actually equivalent to for the common squared-error loss).
Following the early pioneering work of Lecun, Amari, and their collaborators, various neural network optimization methods have been developed over the last couple of decades that use diagonal, block-diagonal, low-rank, or Krylov-subspace based approximations of or as a curvature matrix/preconditioner. In addition to methods based on diagonal approximations already mentioned, some methods based on non-diagonal approximations include the method of Park et al. (2000), TONGA (Le Roux et al., 2008), Natural Newton (Le Roux and Fitzgibbon, 2010), HF (Martens, 2010), KSD (Vinyals and Povey, 2012) and many more.
The idea of computing an estimate of the (empirical) Fisher using a history of previous iterates/mini-batches appeared in various early works. The particular way of doing this advocated by Duchi et al. (2011), which was to use an equally weighted average of all past gradients and which was done in order to make it possible to prove a particular regret bound888The total regret at iteration is defined as for the optimal , and provides a measure of the speed of convergence of an online optimization algorithm which is particularly popular in the convex optimization community., is – not surprisingly – a poor choice in practice (Tieleman and Hinton, 2012) compared to the older and more intuitive approach of using an exponentially decayed running average (e.g. LeCun et al., 1998; Park et al., 2000), which naturally “forgets” very old contributions to the estimate which are based on stale parameter values.
It is important to observe that the way is estimated can affect the convergence characteristics of an iteration like eqn. 9 in subtle and important ways. For example, if is estimated using gradients from previous iterations, and especially if it is the average of all past gradients as in AdaGrad, it may shrink sufficiently slowly that the convergence issues seen in the example at the end of Section 10.1 are avoided. Moreover, for reasons related to this phenomenon, it seems likely that the proofs of regret bounds in Duchi et al. (2011) and the related work of Hazan et al. (2007) could not be modified work if the exact , computed only at the current , were used. Developing a better understanding of this issue, and the relationship between methods and theories such as AdaGrad developed in the convex optimization literature, and classical stochastic 2nd-order methods and theories (e.g Murata, 1998; Bottou and LeCun, 2005) remains an interesting direction for future research.
The constants and present in eqn. 9 are often thought of as “fudge factors” designed to correct for the “poor conditioning” (Becker and LeCun, 1989) of the curvature matrix, or to guarantee boundedness of the updates and prevent the optimizer from “blowing up” (LeCun et al., 1998). However, these explanations are severe oversimplifications at best. A much more compelling and useful explanation, at least in the case of , comes from viewing the update in eqn. 9 as being the minimizer of a local quadratic approximation to , as discussed in Section 9. In this view, plays the role of a Tikhonov damping parameter (e.g. Martens and Sutskever, 2012) which is added to in order to ensure that the proposed update stays within a certain radius around zero in which remains a reasonable approximation to . Note that this explanation implies that no single fixed value of will be appropriate throughout the entire course of optimization (since the local properties of the objective will change), and so an adaptive adjustment scheme, such as the one present in HF (Martens, 2010) (based on the Levenberg-Marquardt method) should be used.
The use of the exponent first appeared in HF as part of its diagonal preconditioner for CG, and was justified as a way of making the curvature estimate “more conservative” by making it closer to a multiple of the identity, to compensate for the diagonal approximation being made (among other things). Around the same time, Duchi et al. (2011) proposed to use within an update of the form of eqn. 9, which was important in proving a certain regret bound both for the diagonal and non-diagonal versions of the method. However, it is noteworthy that while the use of would invalidate this particular bound (or at least its existing proof), it is relatively simple to prove a different bound for the case by a minor modification of the original argument of Duchi et al. (2011), provided that one appropriately modifies the schedule for the learning rate so that the updates scale as .
To shed some light on the question of , we can consider the work of Hazan et al. (2007), who like Duchi et al. (2011) developed and analyzed an online approximate Newton method within the framework of online convex optimization. Like the non-diagonal version of AdaGrad, the method proposed by Hazan et al. (2007) uses an estimate of the empirical Fisher computed as the average of gradients from all previous iterations. While impractical for high dimensional problems like any non-diagonal method is (or at least, one that doesn’t make some other strong approximation of the curvature matrix), this method achieves a superior upper bound on the regret than Duchi et al. (2011) was able to show for AdaGrad ( instead of , where is the total number of iterations), which was possible in part due to the use of slightly stronger hypotheses about the properties of (e.g. that for each and , is a strongly convex function of ). Notably, this method uses , just as in standard natural gradient descent, which provides support for such a choice, especially since the used in neural networks usually satisfies these stronger assumptions (at least when regularization is used).
However, it is important to note that Hazan et al. (2007) also proves a bound on the regret for a basic version of SGD, and that what actually differentiates the various methods they analyze is the constant hidden in the big-O notation, which is much larger for the version of SGD they consider than for their approximate Newton method. In particular, the former depends on a quantity which grows with the condition number of the Hessian at , while the latter does not, in a way that echos the various analyses performed on stochastic gradient descent and stochastic approximations of Newton’s method in the more classical “local-convergence” setting (e.g. Murata, 1998; Bottou and LeCun, 2005).
11 Asymptotic convergence speed
11.1 Amari’s Fisher efficiency result
A property of natural gradient descent which is frequently referenced in the literature is that it is “Fisher efficient”. In particular, Amari (1998) showed that an iteration of the form
when applied to an objective of the form discussed in Section 4, with shrinking as , and with where is a stochastic estimate of (from a single training case), will produce an estimator which is asymptotically “Fisher efficient”. This means that will tend to an unbiased estimator of the global optimum , and that its expected squared error matrix (which is asymptotically equal to its variance) will satisfy
which is (asymptotically) the smallest999With the usual definition of for PSD matrices: iff is PSD. possible variance matrix that any unbiased estimator based training cases can have, according to the Cramér-Rao lower bound.
This result can also be straightforwardly extended to handle the case where is computed using a mini-batch of size (which uses independently sampled cases at each iteration), in which case the above asymptotic variance bound becomes
which again matches the Cramér-Rao lower bound.
Note that this result, as stated above, applies to the version of natural gradient descent where is computed using the training distribution (see Section 5). If we instead consider the version where is computed using the true data distribution , then a similar result will still apply, provided that we sample of from and from when computing the stochastic gradient , and that is defined as the minimum of the idealized objective (see Section 4).
While this Fisher efficiency result would seem to suggest that natural gradient descent is the best possible optimization method in the stochastic setting, it unfortunately comes with several important caveats, which we discuss below.
Firstly, the proof assumes that the iteration in eqn. 10 eventually converges to the global optimum (at an unspecified speed). While this assumption can be justified when the objective is convex (provided that is chosen appropriately), it won’t be true in general for non-convex objectives, such as those encountered in neural network training. In practice, a reasonable local optimum might be a good surrogate for the global optimum, in which case a property roughly analogous to asymptotic Fisher efficiency may still hold, at least approximately.
Secondly, it is assumed in Amari’s proof that is computed using the full training distribution , which in the case of neural network optimization usually amounts to an entire pass over the training set . So while the proof allows for the gradient to be stochastically estimated from a mini-batch, it doesn’t allow this for the Fisher . This is a serious challenge to the idea that (stochastic) natural gradient descent gives an estimator which makes optimal use of the training data that it sees. And note that while one can approximate using samples of from , which is a solution that often works well in practice (depending on how much data is used, and whether the estimate can feasibly be “accumulated” across multiple iterations), a Fisher efficiency result like the one proved by Amari (1998) will likely no longer hold. Investigating the manner and degree in which it may hold approximately when is estimated in this way is an interesting direction for future research.
A third issue with Amari’s result is that it is given in terms of the convergence of according to its own (arbitrary) euclidean geometry instead of the arguably more relevant objective function value. Fortunately, it is straightforward to obtain the former from the latter. In particular, by applying Taylor’s theorem and using we have
where and is short-hand to mean a function which is cubic in the entries of . From this it follows that
where we have used which follows from the “realizability” hypothesis used to prove the Fisher efficiency result (see below). Note that we have also used , which is an (unjustified) assumption that is used in Amari’s proof. This assumption has intuitive appeal since , and so it makes sense that would shrink faster. However, extreme counterexamples are possible which involve very heavy-tailed distributions on over unbounded regions. By adding some mild hypotheses such as being restricted to some bounded region, which is an assumption frequently used in the convex optimization literature, it is possible to justify this assumption rigorously. Rather than linger on this issue we will refer the reader to Bottou and LeCun (2005), which provides a more rigorous treatment of these kind of asymptotic results, using various generalizations of the big-O notation.
Note that while this is the same convergence rate () as the one which appears in Hazan et al. (2007) (see our Section 10), the constant is much better. However, the comparison is slightly unfair, since Hazan et al. (2007) doesn’t require that the curvature matrix be estimated on the entire dataset (as discussed above).
The forth and final caveat of Amari’s Fisher efficiency result is that Amari’s proof assumes that the training distribution and the optimal model distribution coincide, a condition called “realizability” (which is also required in order for the Cramér-Rao lower bound to apply). This means, essentially, that the model perfectly captures the training distribution at . This assumption is used in Amari’s proof of the Fisher efficiency result to show that the Fisher , when evaluated at , is equal to both the empirical Fisher and the Hessian of .
It is not clear from Amari’s proof what happens when this correspondence fails to hold at , and whether a (perhaps) weaker asymptotic upper bound on the variance might still be provable. Fortunately, various authors (Murata, 1998; Bottou and LeCun, 2005; Bordes et al., 2009) building on the early work of Amari (1967), provide some further insight into this question by studying asymptotic behavior of general iterations of the form101010Note that some authors define to be the matrix that multiplies the gradient, instead of its inverse (as we do instead).