limitations-empirical-fisher
Limitations of the Empirical Fisher Approximation
view repo
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.
READ FULL TEXT VIEW PDFLimitations of the Empirical Fisher Approximation
Consider a supervised machine learning problem of predicting outputs
from inputs . We assume a probabilistic model for the conditional distribution of the form , where is an exponential family with natural parameters in and is a prediction function parameterized by . Given iid training samples , we want to minimize(1) |
This framework covers common scenarios such as least-squares regression ( and with fixed ) or -class classification with cross-entropy loss (, and ) with an arbitrary prediction function .
Eq. (1) can be minimized by gradient descent, which updates with step size . This update can be preconditioned, , with a matrix that incorporates additional information such as local curvature. Choosing to be the Hessian yields Newton’s method, but its computation is often computationally burdensome and might not even be desirable for non-convex problems. A prominent variant in machine learning is natural gradient descent (NGD; Amari, 1998), which preconditions with the Fisher information matrix (or simply “Fisher”) of the probabilistic model,
(2) |
This adapts to the information geometry of the model. While this motivation is conceptually distinct from approximating the Hessian, the Fisher coincides with a generalized Gauss-Newton (GGN; Schraudolph, 2002) approximation of the Hessian for the problems presented here. We discuss this in detail in Section 2. This gives NGD theoretical grounding as an approximate second-order method.
A number of recent works in machine learning have relied on a certain approximation of the Fisher, which is often called the empirical Fisher (EF) and is defined as
(3) |
At first glance, this approximation is merely replacing the expectation over in Eq. (2) with a sample . However, is a training label and not a sample from the model’s predictive distribution . Therefore, and contrary to what its name suggests, the empirical Fisher is not
an empirical (i.e. Monte Carlo) estimate of the Fisher. Due to the unclear relationship between the model distribution and the data distribution, the theoretical grounding of the EF approximation is dubious. Despite that, the empirical Fisher approximation has seen widespread adoption, possibly because it is convenient to compute as a simple sum of outer products of individual gradients. We will provide a survey of papers using the EF in Section
1.3.The main purpose of this work is to provide a detailed critical discussion of the empirical Fisher approximation. While the discrepancy between the EF and the Fisher has been mentioned in the literature before (Pascanu and Bengio, 2014; Martens, 2014)
, we see the need for a detailed elaboration of the subtleties of this important issue. The intricacies of the relationship between the empirical Fisher and the Fisher remain opaque from the current literature. Not all authors using the EF seem to be fully aware of the heuristic nature of this approximation and overlook its shortcomings, which can be seen clearly even on simple linear regression problems, see Fig.
1.The ubiquity of the EF approximation reaches the extent that the EF is sometimes just called the Fisher (e.g., Chaudhari et al., 2017; Wen et al., 2019). Possibly as a result of this, there are examples of algorithms involving the Fisher, such as Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) and KFAC (Martens and Grosse, 2015), which have been re-implemented by third parties using the empirical Fisher. Interestingly, there is also at least one example of an algorithm that was originally developed using the empirical Fisher and later found to work better with the Fisher (Wierstra et al., 2008; Sun et al., 2009)
. As the empirical Fisher is now used beyond preconditioning, for example as an approximation of the Hessian in empirical works studying properties of neural network training objectives
(Chaudhari et al., 2017; Jastrzebski et al., 2018), the pathologies of the EF approximation may lead the community to subtly erroneous conclusions; an arguably more worrysome outcome than a suboptimal preconditioner.The poor theoretical grounding stands in stark contrast to the practical success that EF-based methods have seen. This paper is in no way meant to negate these practical advances but rather points out that the existing justifications for the EF approximation are insufficient and do not stand the test of simple examples. This indicates that there are effects at play that currently elude our understanding, which is not only unsatisfying, but might also prevent advancement of these methods. We hope that this paper helps spark interest in understanding these effects; our final chapter explores a possible direction.
We discuss related work in Section 1.3 and provide a short but complete overview of generalized Gauss-Newton and natural gradient in Section 2. We then proceed to the main contribution in Section 3, a critical discussion of the specific arguments used to advocate the empirical Fisher approximation. A principal conclusion is that, while the EF follows the formal definition of a generalized Gauss-Newton matrix, it is not guaranteed to capture any useful second-order information. We propose a clarifying amendment to the definition of a GGN. Furthermore, while there are conditions under which the EF approaches the true Fisher, we argue that these are unlikely to be met in practice. We illustrate that using the EF in such cases can lead to highly undesirable effects; Fig. 1 shows a first example.
This raises the question: Why are EF-based methods empirically successful? In Section 4, we point to an alternative explanation of EF-preconditioning as adapting to gradient noise in stochastic optimization, instead of adapting to curvature.
The generalized Gauss-Newton (Schraudolph, 2002) and natural gradient descent (Amari, 1998) methods have inspired a line of work on approximate second-order optimization (Martens, 2010; Botev et al., 2017; Park et al., 2000; Pascanu and Bengio, 2014; Ollivier, 2015)
. A successful example in modern deep learning is the KFAC algorithm
(Martens and Grosse, 2015), which uses a computationally efficient structural approximation to the Fisher.Numerous papers have relied on the empirical Fisher approximation for preconditioning and other purposes. Our critical discussion is in no way intended as an invalidation of these works. All of them provide important insights and the use of the empirical Fisher is usually not essential to the main contribution. However, there is a certain degree of vagueness regarding the relationship between the Fisher, the EF, Gauss-Newton matrices and the Hessian. Oftentimes, only limited attention is devoted to possible implications of the empirical Fisher approximation.
The most prominent example of preconditioning with the EF is Adam, which uses a moving average of squared gradients as “an approximation to the diagonal of the Fisher information matrix” (Kingma and Ba, 2015). The EF has been used in the context of variational inference by various authors (Graves, 2011; Zhang et al., 2018; Salas et al., 2018; Khan et al., 2018; Mishkin et al., 2018), some of which have drawn further connections between NGD and Adam. There are also several works building upon KFAC which substitute the EF for the Fisher (George et al., 2018; Osawa et al., 2018).
The empirical Fisher has also been used as an approximation of the Hessian for other purposes than preconditioning. Chaudhari et al. (2017) use it to investigate curvature properties of deep learning training objectives. It has also been employed to explain certain characteristics of SGD (Zhu et al., 2018; Jastrzebski et al., 2018) or as a diagnostic tool during training (Liao et al., 2018).
Le Roux et al. (2007) and Le Roux and Fitzgibbon (2010) have considered the empirical Fisher in its interpretation as the (non-central) covariance matrix of stochastic gradients. While they refer to their method as “Online Natural Gradient”, their goal is explicitly to adapt the update to the stochasticity of the gradient estimate, not to curvature. We will come back to this perspective in Section 4.
This section briefly introduces the generalized Gauss-Newton method and natural gradient descent, summarizes the known relationship between the GGN and the Fisher, and explains in which sense they approximate the Hessian. Details can be found in the appendix.
The original Gauss-Newton algorithm is an approximation to Newton’s method for nonlinear least squares problems,
. By the chain rule, the Hessian can be written as
(4) |
where are the residuals. The first part, , is the Gauss-Newton matrix. For small residuals, will be small and will approximate the Hessian. In particular, when the model perfectly fits the data, the Gauss-Newton is equal to the Hessian.
Schraudolph (2002) generalized this idea to objectives of the form , with and , for which the Hessian can be written as^{1}^{1}1 is the Jacobian of ; we use the shortened notation ; selects the -th component of a vector; and denotes the -th component function of .
(5) |
The generalized Gauss-Newton matrix (GGN) is defined as the part of the Hessian that ignores the second-order information of ,
(6) |
If is convex, as is customary, the GGN is positive (semi-)definite even if the Hessian itself is not, making it a popular curvature matrix in non-convex problems such as neural network training. The GGN is ambiguous as it crucially depends on the “split” given by and . As an example, consider the two following possible splits for the least-squares problem from above:
or | (7) |
The first recovers the classical Gauss-Newton, while in the second case, the GGN equals the Hessian. While this is an extreme example, the split will be important for our discussion.
Gradient descent follows the direction of “steepest descent”, the negative gradient. But the definition of steepest depends on a notion of distance and the gradient is defined with respect to the Euclidean distance. The natural gradient is a concept from information geometry (Amari, 1998) and applies when the gradient is taken w.r.t. the parameters
. Instead of measuring the distance between parameters and with the Euclidean distance, we use the Kullback–Leibler (KL) divergence between the distributions and . The resulting steepest descent direction is the negative gradient preconditioned with the Hessian of the KL divergence, which is exactly the Fisher information matrix of ,(8) |
The second equality may seem counterintuitive; it crucially depends on the expectation being taken over the model distribution at , see Appendix A. This equivalence highlights the relationship of the Fisher to the Hessian. In our setting, where we only model the conditional distribution , the Fisher is given by Eq. (2).
While NGD is not explicitly motivated as an approximate second-order method, the following result, noted by several authors,^{2}^{2}2 Heskes (2000) showed this for regression with squared loss, Pascanu and Bengio (2014) for classification with cross-entropy loss, and Martens (2014) for general exponential families. However, this has been known earlier in the statistics literature in the context of “Fisher Scoring” (see Wang (2010) for a review). shows that the Fisher captures partial curvature information about the problem defined in Eq. (1).
If is an exponential family distribution with natural parameters , then the Fisher information matrix coincides with the GGN of Eq. (1) using the split
(9) |
and reads .
For completeness, a proof can be found in Appendix A. The key insight is that does not depend on for exponential families. One can see Eq. (9) as the “canonical” split, since it matches the classical Gauss-Newton for the probabilistic interpretation of least-squares. From now on, when referencing “the GGN” without further specification, we mean this particular split.
The GGN, and under the assumptions of Proposition 1 also the Fisher, are well-justified approximations of the Hessian and we can bound their approximation error in terms of the (generalized) residuals, mirroring the motivation behind the classical Gauss-Newton.
The approximation improves as the residuals in diminish, and is exact if the data is perfectly fit.
Two arguments have been put forward to advocate the empirical Fisher approximation. Firstly, it has been argued that the EF follows the definition of a generalized Gauss-Newton matrix, making it an approximate curvature matrix in its own right. We examine this relation in §3.1 and show that, while technically correct, it does not entail the approximation guarantee usually associated with the GGN.
Secondly, a popular argument is that the EF approaches the Fisher at a minimum if the model “is a good fit for the data”. We discuss this argument in §3.2 and point out that it requires strong additional assumptions, which are unlikely to be met in practical scenarios. In addition, this argument only applies close to a minimum, which calls into question the usefulness of the empirical Fisher as a preconditioner. We discuss this in §3.3, showing that preconditioning with the EF leads to adverse effects on the scaling and the direction of the updates far from an optimum.
We use simple examples to illustrate our arguments. We want to emphasize that, as these are counter-examples to arguments found in the existing literature, they are designed to be as simple as possible, and deliberately do not involve intricate state-of-the art models that would complicate analysis. On a related note, while contemporary machine learning often relies on stochastic optimization, we restrict our considerations to the deterministic (full-batch) setting to focus on the adaptation to curvature.
Bottou et al. (2018) point out that the EF matches the construction of a GGN (Eq. 6) using the split
(11) |
Although technically correct^{3}^{3}3 The equality can easily be verified by plugging the split (11) into the definition of the GGN (Eq. 6) and observing that as a special property of the choice ., we argue that this split does not provide a reasonable approximation.
For example, consider a least-squares problem which corresponds to the log-likelihood . In this case, Eq. (11) splits the identity function, , and takes into account the curvature from the while ignoring that of . This questionable split runs counter to the basic motivation behind the classical Gauss-Newton matrix, that small residuals lead to a good approximation to the Hessian: The empirical Fisher
(12) |
approaches zero as the residuals become small. In that same limit, the Fisher does approach the Hessian, which we recall from Eq. (4) to be given by . This argument generally applies for problems where we can fit all training samples such that for all . In such cases, the EF goes to zero while the Fisher (and the corresponding GGN) approaches the Hessian (Prop. 2).
For the generalized Gauss-Newton, the role of the “residual” is played by the gradient ; compare Equations (4) and (5). To retain the motivation behind the classical Gauss-Newton, the split should be chosen such that this gradient can in principle attain zero, in which case the residual curvature not captured by the GGN in (5) vanishes. The EF split (Eq. 11) does not satisfy this property, as can never go to zero for a probability . It might be desirable to amend the definition of a generalized Gauss-Newton to enforce this property (addition in bold):
A split with convex , leads to a generalized Gauss-Newton matrix of , defined as
(13) |
if the split is such that there is such that .
Under suitable smoothness conditions, a split satisfying this condition will have a meaningful error bound akin to Proposition 2. (To avoid confusion, we want to note that this condition does not assume the existence of such that for all ; only that the residual gradient for each data point can, in principle, go to zero.)
An oft-repeated argument is that the empirical Fisher converges to the true Fisher when the model is a good fit for the data (e.g., Jastrzebski et al., 2018; Zhu et al., 2018). Unfortunately, this is often misunderstood to simply mean “near the minimum”. The above statement has to be carefully formalized and requires additional assumptions, which we detail in the following.
Assume that the training data consists of iid samples from some data generating distribution . If the model is realizable, i.e., there exists a parameter setting such that , then clearly by an MC sampling argument, as the number of data points goes to infinity, . If, additionally, the maximum likelihood estimate for samples, , is consistent in the sense that converges to as , then
(14) |
That is, the empirical Fisher converges to the Fisher at the minimum as the number of data points grows. (Both approach the Hessian, as can be seen from the second equality in Eq. 8 and detailed in Appendix C.3.) For the EF to be a useful approximation, we thus need (i) a “correctly-specified” model in the sense of the realizability condition, and (ii) enough data to recover the true parameters.
Even under the assumption that is sufficiently large, the model needs to be able to realize the true data distribution. This requires that the likelihood is well-specified and that the prediction function captures all relevant information. This is possible in classical statistical modeling of, say, scientific phenomena where the effect of on is modeled based on domain knowledge. But it is unlikely to hold when the model is only approximate, as is most often the case in machine learning. Figure 2 shows examples of model misspecification and the effect on the empirical and true Fisher.
It is possible to satisfy the realizability condition by using a very flexible prediction function , such as a deep network. However, “enough” data has to be seen relative to the model capacity. The massively overparameterized models typically used in deep learning are able to fit the training data almost perfectly, even when regularized (Zhang et al., 2017). In such settings, the individual gradients, and thus the EF, will be close to zero at a minimum, whereas the Hessian will generally be nonzero.
The relationship discussed in §3.2 only holds close to the minimum. Any similarity between and is very unlikely when has not been adapted to the data, for example, at the beginning of an optimization procedure. This makes the empirical Fisher a questionable preconditioner.
In fact, the empirical Fisher can cause severe, adverse distortions of the gradient field far from the optimum, as evident even on an elementary linear least-squares problem in Fig. 1. As a consequence, EF-preconditioned gradient descent compares unfavorably to NGD even on simple linear regression and classification tasks, as shown in Fig. 3
. The cosine similarity plotted in Fig.
3 shows that the empirical Fisher can be arbitrarily far from the Fisher in that the two preconditioned updates point in almost opposite directions.One particular issue is the scaling of EF-preconditioned updates. As the empirical Fisher is the sum of “squared” gradients (Eq. 3), multiplying the gradient by the inverse of the EF leads to updates of magnitude almost inversely proportional to that of the gradient, at least far from the optimum. This effect has to be counteracted by adapting the step size, which requires manual tuning and makes the selected step size dependent on the starting point; we explore this aspect further in Appendix D.
The previous sections have shown that, interpreted as a curvature matrix, the empirical Fisher is a questionable choice at best. Another perspective on the empirical Fisher is that (in contrast to the Fisher) it contains useful information to adapt to the gradient noise in stochastic optimization.
In stochastic gradient descent
(SGD; Robbins and Monro, 1951), we sample uniformly at random and use a stochastic gradient as an inexpensive but noisy estimate of. The empirical Fisher, as a sum of outer products of individual gradients, coincides with the non-central second moment of this estimate and can be written as
(15) |
Gradient noise is a major hindrance to SGD and the covariance information encoded in the EF may be used to attenuate its harmful effects, e.g., by scaling back the update in high-noise directions.
A small number of works have explored this idea before. Le Roux et al. (2007) showed that the update direction maximizes the probability of decreasing in function value, while Schaul et al. (2013) proposed a diagonal rescaling based on the signal-to-noise ratio of each coordinate, . Balles and Hennig (2018) identified these factors as optimal in that they minimize the expected error for a diagonal matrix .
A straightforward extension of this argument to full matrices yields the variance adaptation matrix
(16) |
In that sense, preconditioning with the empirical Fisher can be understood as an adaptation to gradient noise instead of an adaptation to curvature. (Note that Eq. (16) involves a multiplication with which will counteract the poor scaling discussed in §3.3.)
This perspective on the empirical Fisher is currently not well studied. Of course, there are obvious difficulties ahead: Computing the matrix in Eq. (16) requires the evaluation of all gradients, which defeats its purpose. It is not obvious how to obtain meaningful estimates of this matrix from, say, a mini-batch of gradients that would provably attenuate the effects of gradient noise. Nevertheless, we believe that variance adaptation is a possible explanation for the practical success of existing methods using the EF, and an interesting avenue for future research. To put it bluntly: It may just be that the name “empirical Fisher” is a fateful historical misnomer, and the quantity should instead just be described as the gradient’s non-central second moment.
As a final comment, it is worth pointing out that some works actually precondition with the square-root of the EF, the prime example being Adam. While this avoids the “inverse gradient” scaling discussed in §3.3, it further widens the conceptual gap between those methods and natural gradient. In fact, such a preconditioning effectively cancels out the gradient magnitude, which has recently been examined more closely as “sign gradient descent” (Balles and Hennig, 2018; Bernstein et al., 2018).
We offered a critical discussion of the empirical Fisher approximation, summarized as follows:
While the EF follows the formal definition of a generalized Gauss-Newton matrix, the underlying split does not retain useful second-order information. We proposed a clarifying amendment to the definition of the GGN.
A clear relationship between the empirical Fisher and the Fisher only exists at a minimum under strong additional assumptions: (i) a correct model and (ii) enough data relative to model capacity. These conditions are unlikely to be met in practice, especially when using overparametrized general function approximators and settling for approximate minima.
Far from an optimum, EF preconditioning leads to update magnitudes which are inversely proportional to that of the gradient, complicating step size tuning and often leading to poor performance even for linear models.
As a possible alternative explanation of the practical success of EF preconditioning, and an interesting avenue for future research, we have pointed to the concept of variance adaptation.
Hence, the existing arguments do not justify the empirical Fisher as a reasonable approximation to the Fisher or the Hessian. Of course, this does not rule out the existence of certain model classes for which the EF might give reasonable approximations. However, as long as we have not clearly identified and understood these cases, the true Fisher is the “safer” choice as a curvature matrix and should be preferred in virtually all cases.
Contrary to conventional wisdom, the Fisher is not inherently harder to compute than the EF. As shown by Martens and Grosse (2015)
, an unbiased estimate of the true Fisher can be obtained at the same computational cost as the empirical Fisher by replacing the expectation in Eq. (
2) with a single sample from the model’s predictive distribution . Even exact computation of the Fisher is feasible in many cases. We discuss computational aspects further in Appendix B. The apparent reluctance to compute the Fisher might have more to do with the current lack of convenient implementations in deep learning libraries. We believe that it is misguided—and potentially dangerous—to accept the poor theoretical grounding of the EF approximation purely for implementational convenience.We would like to thank Matthias Bauer, Felix Dangel, Filip de Roos, Diego Fioravanti, Si Kai Lee, and Frank Schneider for their helpful comments on the manuscript. We also would like to acknowledge the constructive feedback of the anonymous reviewers on an earlier version of this paper. Frederik Kunstner would like to thank Emtiyaz Khan, Aaron Mishkin, and Didrik Nielsen for many insightful conversations. Lukas Balles kindly acknowledges the support of the International Max Planck Research School for Intelligent Systems (IMPRS-IS).
On “natural” learning and pruning in multilayered perceptrons.
Neural Computation, 12(4):881–901, 2000.Proceedings of the 11th Annual conference on Genetic and evolutionary computation
, pages 539–546, 2009.Fisher scoring: An interpolation family and its Monte Carlo implementations.
Computational Statistics & Data Analysis, 54(7):1744–1755, 2010.provides additional exposition on the natural gradient and the generalized Gauss-Newton; its relation to distances in probability distribution space (§A.1), a note regarding the treatment of the distribution over inputs (§A.2), the expression of the Fisher for common loss functions (§A.3) and the view of the generalized Gauss-Newton as a linearization of the model (§A.4).
discusses the challenges of computing the empirical Fisher and the Fisher, and possible options.
provides proof of the propositions and statements skipped in the main paper; the relation between the expected Hessian and expected outer product of gradients (Eq. 8), the equivalence between the generalized Gauss-Newton (Prop. 1), and the bound on the difference between the generalized Gauss-Newton and the Hessian (Prop. 2).
shows the experiments on different datasets.
gives the necessary details to reproduce our experiments.
We give an expanded version of the introduction to natural gradient descent provided in Section 2.2
Gradient descent minimizes some objective function by greedily updating in the “direction of steepest descent”. But what, precisely, is meant by the direction of steepest descent? Consider the following definition,
(17) |
where is some distance function. This definition says that we are looking for the update step which minimizes within an -ball around the current , and subsequently let the radius go to zero (to make finite, we have to divide by ). This definition makes clear that the direction of steepest descent is intrinsically tied to the geometry which we impose on the parameter space by the definition of the distance function. If we choose , the Euclidean distance, Eq. (17) reduces to the (normalized) negative gradient.
Now, assume that parameterizes a statistical model . The parameter vector is not the main quantity of interest; the distance between and would be better measured in terms of distance between the distributions and . A canonical distance function for probability distributions is the Kullback–Leibler (KL) divergence. If we choose , the steepest descent direction becomes the natural gradient, , where
(18) |
is the Fisher information matrix of the statistical model, which arises in this context as the Hessian of the KL divergence
(19) |
When going from Eq. (18) (Eq. 8
in the main text) to the expression for the Fisher of a conditional probability distribution
with samples (Eq. 2 in the main text),(20) |
we considered our statistical model to be only on the conditional distribution and restricted to the empirical distribution, i.e., a mixture of Dirac distributions centered at the training points ,
. An alternative is to consider a statistical model for the joint distribution
given by(21) |
where is the true distribution over . In this case the Fisher would be
(22) |
However, since is generally unknown (and the expectation over it intractable), this is impractical usually approximated using the empirical distribution over the inputs. In the statistic literature, this quantity is occasionally referred to as the empirical Fisher, due to the approximation of by the empirical distribution but this is not what is referred to as the empirical Fisher in the main text and in the literature we cite. Going from Eq. (22) to Eq. (20) replaces the expectation over with samples from that distribution. In contrast to that, going from Eq. (2) to Eq. (3) replaces the expectation over with a single sample from a different distribution.
For a probabilistic conditional model of the form where is an exponential family distribution, the equivalence between the Fisher and the generalized Gauss-Newton leads to a straightforward way to compute the Fisher without expectations, as
(23) |
where and , and often has an exploitable structure.
The squared-loss used in regression,
, can be cast in a probabilistic setting with a Gaussian distribution with unit variance,
,The Hessian of the negative log-likelihood w.r.t. is then simply given by
(24) |
The cross-entropy loss used in -class classification can be cast as an exponential family distribution by using the softmax function on the mapping ,
Taking the Hessian of the negative log-likelihood w.r.t. , we can simply write the Hessian as
Checking the partial derivatives individually, we get that
and |
Or, using as the vector of predicted probabilities
(25) |
In §2.1, we mentioned that the generalized Gauss-Newton with a split can be interpreted as an approximation of where the second-order information of is kept but the second-order information of is ignored as is approximated by a linear function. To see this connection, note that if is a linear function, then the Hessian and the GGN are equal as the Hessian of w.r.t. to is zero,
(26) |
Let us write for the first-order Taylor approximation of around , which is now a function of . We now approximate , in the vicinity of , by replacing by its linear approximation . The generalized Gauss-Newton is the Hessian of this approximation, evaluated at ,
(27) |
The empirical Fisher approximation is often motivated as an easier-to-compute alternative to the Fisher. While there is some merit to this argument, we argued in the main text that the empirical Fisher computes a wrong quantity. However, it is possible to compute an actual approximation to the Fisher at the same computational complexity and using a very similar implementation: just sample one output from the model distribution for each input and compute the outer product of the gradients
(28) |
While noisy, this one-sample Monte Carlo estimate is unbiased and will not suffer from the issues mentioned in the main text. This is the approach used by Martens and Grosse [2015] as well as Zhang et al. [2018].
As a side note, some implementations use a biased estimate in which, instead of sampling from , they compute the most likely output . This scheme could be beneficial in some circumstances, as it reduces variance, but it can also backfire by increasing the bias of the estimation. For the least-squares loss, as is a Gaussian distribution centered as , the most likely output is and the gradient is zero, which defeats the purpose of sampling.
For high quality estimates, however, sampling additional outputs and averaging the results is inefficient. If MC samples per input are used to compute the gradients , most of the computation is repeated. The gradient is
(29) |
where the Jacobian of the model output, , does not depend on . As the Jacobian of the model is much more complex to compute than the gradient of the log-likelihood w.r.t. the model output, this approach repeats the most difficult computation, especially when the model is a neural network. The expectation can instead be computed in closed form using the generalized Gauss-Newton equation (Eq. 23, or Eq. 6 in the main text), which requires the computation of the Jacobian only once per sample .
The main issue with this approach is that computing Jacobians is currently not well supported by deep learning auto-differentiation libraries, such as TensorFlow or Pytorch. However, the current the implementations relying on the empirical Fisher also suffer from this lack of support, as they need access to the individual gradients to compute their outer-product. Access to the individual gradients is equivalent to computing the Jacobian of the vector
. The ability to efficiently compute Jacobians and/or individual gradients in parallel would drastically improve the practical performance of methods based on the Fisher and empirical Fisher, as most of the computation of the backward pass can be shared between samples.In §2.2 Eq. (8), we mentioned that the two following representations of the Fisher are equivalent:
(30) |
To see why, apply the chain rule on the to split the equation in terms of the Hessian and the outer product of the gradients of ,
(31) |
The first term on the right-hand side is zero, since
(32) |
The second term of Eq. (31) is the expected outer-product of the gradients, as ,
(33) |
The same technique also shows that if the empirical distribution over the data is equal to the model distribution , then the Fisher, empirical Fisher and the Hessian are all equal.
In §2.3, Prop. 1, we mentioned that the Fisher and the generalized Gauss-Newton are equivalent for the problems considered in the introduction;
Proposition 1 (Martens [2014], §9.2). If is an exponential family distribution with natural parameters , then the Fisher information matrix coincides with the GGN of Eq. (1) using the split
and reads .
Plugging the split into the definition of the GGN (Eq. 6) yields , so we only need to show that the Fisher coincides with this GGN. By the chain rule, we have
(34) |
and we can then apply the following steps.
(35) | ||||
(36) | ||||
(37) |
Eq. (35) rewrites the Fisher using the chain rule, Eq. (36) take the Jacobians out of the expectation as they do not depend on and Eq. (37) is due to the equivalence between the expected outer product of gradients and expected Hessian shown in the last section.
If is an exponential family distribution with natural parameters (a linear combination of) , its log density has the form (where are the sufficient statistics, is the cumulant function and is the base measure). The Hessian w.r.t. is simply and is independent of , and hence,
(38) |
In §2.3, Prop. 2, we show that the difference between the Fisher (or the GNN) and the Hessian can be bounded by the residuals and the smoothness constant of the model ;
Proposition 2. Let be defined as in Eq. (1) with . Denote by the -th component function of and assume each is -smooth. Let be the GGN (Eq. 6). Then,
(39) where and denotes the spectral norm.
Dropping from the notation for brevity, the Hessian can be expressed as
where | (40) |
is the derivative of w.r.t. the -th component of , evaluated at .
If all are -smooth, we have and, consequently,
(41) |
Pulling the absolute value inside the double sum gives the upper bound
(42) |
and the statement about the spectral norm (the largest singular value of the matrix) follows.
Fig. 6 repeats the experiment described in Fig. 2 (§3.2), on the effect of model misspecification on the Fisher and empirical Fisher at the minimum, on linear regression problems instead of a classification problem. Similar issues in scaling and directions can be observed.
Fig. 6 repeats the experiment described in Fig. 3 (§3.3) on additional linear regression problems. Those additional examples show that the poor performance of empirical Fisher-preconditioned updates compared to NGD is not isolated to the examples shown in the main text.
Fig. 6 show the linear regression problem on the Boston dataset, originally shown in Fig. 3
, where each line is a different starting point, using the same hyperparameters as in Fig.
3. The starting points are selected from , where is the optimum. When the optimization starts close to the minimum (low loss), the empirical Fisher is a good approximation to the Fisher and there are very few differences with NGD. However, when the optimization starts far from the minimum (high loss), the individual gradients, and thus the sum of outer product gradients, are large, which leads to very small steps, regardless of curvature, and slow convergence. While this could be counteracted with a larger step size in the beginning, this large step size would not work close to the minimum and would lead to oscillations. The selection of the step size therefore depends on the starting point, and would ideally be on a decreasing schedule.