# Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.

## Authors

• 1 publication
• 3 publications
• 6 publications
• 19 publications
• 34 publications
• ### Fast Approximation of Rotations and Hessians matrices

A new method to represent and approximate rotation matrices is introduce...

04/29/2014 ∙ by Michael Mathieu, et al. ∙ 0

• ### Numerical Analysis of Diagonal-Preserving, Ripple-Minimizing and Low-Pass Image Resampling Methods

Image resampling is a necessary component of any operation that changes ...

04/20/2012 ∙ by Chantal Racette, et al. ∙ 0

• ### Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descen...

03/19/2015 ∙ by James Martens, et al. ∙ 0

• ### SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Uncertainty estimation in large deep-learning models is a computationall...

11/11/2018 ∙ by Aaron Mishkin, et al. ∙ 0

• ### A Kronecker-factored approximate Fisher matrix for convolution layers

Second-order optimization methods such as natural gradient descent have ...

02/03/2016 ∙ by Roger Grosse, et al. ∙ 0

The development of machine learning is promoting the search for fast and...

06/11/2019 ∙ by Marco Baiesi, et al. ∙ 0

• ### Eigenvalue Corrected Noisy Natural Gradient

Variational Bayesian neural networks combine the flexibility of deep lea...

11/30/2018 ∙ by Juhan Bae, et al. ∙ 4

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep networks have exhibited state-of-the-art performance in many application areas, including image recognition (He et al., 2016)

(Gehring et al., 2017). However top-performing systems often require days of training time and a large amount of computational power, so there is a need for efficient training methods.

Stochastic Gradient Descent (SGD) and its variants are the current workhorse for training neural networks. Training consists in optimizing the network parameters (of size ) to minimize a regularized empirical risk

, through gradient descent. The negative loss gradient is approximated based on a small subset of training examples (a mini-batch). The loss functions of neural networks are highly non-convex functions of the parameters, and the loss surface is known to have highly imbalanced curvature which limits the efficiency of

optimization methods such as SGD.

Methods that employ order information have the potential to speed up order gradient descent by correcting for imbalanced curvature. The parameters are then updated as: , where is a positive learning-rate and is a preconditioning matrix capturing the local curvature or related information such as the Hessian matrix in Newton’s method or the Fisher Information Matrix in Natural Gradient (Amari, 1998). Matrix has a gigantic size which makes it too large to compute and invert in the context of modern deep neural networks with millions of parameters. For practical applications, it is necessary to trade-off quality of curvature information for efficiency.

A long family of algorithms used for optimizing neural networks can be viewed as approximating the diagonal of a large preconditioning matrix. Diagonal approximate of the Hessian (Becker et al., 1988) are proven to be efficient, and algorithms that use the diagonal of the covariance matrix of the gradients are widely used among neural networks practitioners, such as Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012)

(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2015). We refer the reader to Bottou et al. (2016)

for an informative review of optimization methods for deep networks, including diagonal rescalings, and connections with the Batch Normalization (BN)

(Ioffe & Szegedy, 2015) technique.

More elaborate algorithms do not restrict to diagonal approximations, but instead aim at accounting for some correlations between different parameters (as encoded by non-diagonal elements of the preconditioning matrix). These methods range from Ollivier (2015) who introduces a rank 1 update that accounts for the cross correlations between the biases and the weight matrices, to quasi Newton methods (Liu & Nocedal, 1989)

that build a running estimate of the exact non-diagonal preconditioning matrix, and also include block diagonals approaches with blocks corresponding to entire layers

(Heskes, 2000; Desjardins et al., 2015; Martens & Grosse, 2015; Fujimoto & Ohira, 2018). Factored approximations such as KFAC (Martens & Grosse, 2015; Ba et al., 2017) approximate each block as a Kronecker product of two much smaller matrices, both of which can be estimated and inverted more efficiently than the full block matrix, since the inverse of a Kronecker product of two matrices is the Kronecker product of their inverses.

In the present work, we draw inspiration from both diagonal and factored approximations. We introduce an Eigenvalue-corrected Kronecker Factorization (EKFAC) that consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis. We show that EKFAC is a provably better approximation of the Fisher Information Matrix than KFAC. In addition, while computing Kronecker-factored eigenbasis is an expensive operation that needs to be amortized, tracking of the diagonal variance is a cheap operation. EKFAC therefore allows to perform partial updates of our curvature estimate

at the iteration level. We conduct an empirical evaluation of EKFAC on the deep auto-encoder optimization task using fully-connected networks and CIFAR-10 relying on deep convolutional neural networks where EKFAC shows improvements over KFAC in optimization.

## 2 Background and notations

We are given a dataset containing (input, target) examples , and a neural network

with parameter vector

of size . We want to find a value of that minimizes an empirical risk expressed as an average of a loss incurred over the training set: . We will use to denote both expectations w.r.t. a distribution or, as here, averages over finite sets, as made clear by the subscript and context. Considered algorithms for optimizing use stochastic gradients , or their average over a mini-batch of examples sampled from . Stochastic gradient descent (SGD) does a order update: where is a positive learning rate. order methods first multiply by a preconditioning matrix yielding the update: . Preconditioning matrices for Natural Gradient (Amari, 1998) / Generalized Gauss-Newton (Schraudolph, 2001) / TONGA (Le Roux et al., 2008)

can all be expressed as either (centered) covariance or (un-centered) second moment of

, computed over slightly different distributions of

. Thus natural gradient uses the Fisher Information Matrix, which for a probabilistic classifier can be expressed as

where the expectation is taken over targets sampled form the model . Whereas the "empirical Fisher" approximation or generalized Gauss-Newton uses . Our discussion and development applies regardless of the precise distribution over used to estimate a so we will from here on use without a subscript.

Matrix has a gigantic size , which makes it too big to compute and invert. In order to get a practical algorithm, we must find approximations of that keep some of the relevant order information while removing the unnecessary and computationally costly parts. A first simplification, adopted by nearly all prior approaches, consists in treating each layer of the neural network separately, ignoring cross-layer terms. This amounts to a first block-diagonal approximation of : each block caters for the parameters of a single layer . Now can typically still be extremely large.

A cheap but very crude approximation consists in using a diagonal , i.e. taking into account the variance in each parameter dimension, but ignoring all covariance structure. A less stringent approximation was proposed by Heskes (2000) and later Martens & Grosse (2015). They propose to approximate as a Kronecker product which involves two smaller matrices, making it much cheaper to store, compute and invert111Since .. Specifically for a layer that receives input of size and computes linear pre-activations of size

(biases omitted for simplicity) followed by some non-linear activation function, let the backpropagated gradient on

be . The gradients on parameters will be . The Kronecker factored approximation of corresponding will use and i.e. matrices of size and , whereas the full would be of size . Using this Kronecker approximation (known as KFAC) corresponds to approximating entries of as follows: .

A similar principle can be applied to obtain a Kronecker-factored expression for the covariance of the gradients of the parameters of a convolutional layer (Grosse & Martens, 2016). To obtain matrices and one then needs to also sum over spatial locations and corresponding receptive fields, as illustrated in Figure 1.

## 3 Proposed method

### 3.1 Motivation: reflexion on diagonal rescaling in different coordinate bases

It is instructive to contrast the type of “exact” natural gradient preconditioning of the gradient that uses the full Fisher Information Matrix would yield, to what we do when approximating this by using a diagonal matrix only. Using the full matrix yields the natural gradient update: . When resorting to a diagonal approximation we instead use where . So that update amounts to preconditioning the gradient vector by dividing each of its coordinates by an estimated second moment . This diagonal rescaling happens in the initial basis of parameters . By contrast, a full natural gradient update can be seen to do a similar diagonal rescaling, not along the initial parameter basis axes, but along the axes of the eigenbasis of the matrix . Let be the eigendecomposition of . The operations that yield the full natural gradient update correspond to the sequence of a) multiplying gradient vector by which corresponds to switching to the eigenbasis: yields the coordinates of the gradient vector expressed in that basis b) multiplying by a diagonal matrix , which rescales each coordinate (in that eigenbasis) by c) multiplying by , which switches the rescaled vector back to the initial basis of parameters. It is easy to show that (proof is given in Appendix A.2). So similarly to what we do when using a diagonal approximation, we are rescaling by the second moment of gradient vector components, but rather than doing this in the initial parameter basis, we do this in the eigenbasis of

. Note that the variance measured along the leading eigenvector can be much larger than the variance along the axes of the initial parameter basis, so the effects of the rescaling by using either the full

or its diagonal approximation can be very different.

Now what happens when we use the less crude KFAC approximation instead? We approximate222This approximation is done separately for each block , we dropped the superscript to simplify notations. yielding the update . Let us similarly look at it through its eigendecomposition. The eigendecomposition of the Kronecker product of two real symmetric positive semi-definite matrices can be expressed using their own eigendecomposition and , yielding . gives the orthogonal eigenbasis of the Kronecker product, we call it the Kronecker-Factored Eigenbasis (KFE). is the diagonal matrix containing the associated eigenvalues. Note that each such eigenvalue will be a product of an eigenvalue of stored in and an eigenvalue of stored in . We can view the action of the resulting Kronecker-factored preconditioning in the same way as we viewed the preconditioning by the full matrix: it consists in a) expressing gradient vector in a different basis which can be thought of as approximating the directions of , b) doing a diagonal rescaling by in that basis, c) switching back to the initial parameter space. Here however the rescaling factor is not guaranteed to match the second moment along the associated eigenvector .

In summary (see Figure 2):

• Full matrix preconditioning will scale by variances estimated along the eigenbasis of .

• Diagonal preconditioning will scale by variances properly estimated, but along the initial parameter basis, which can be very far from the eigenbasis of .

• KFAC preconditioning will scale the gradient along the KFE basis that will likely be closer to the eigenbasis of , but doesn’t use properly estimated variances along these axes for this scaling (the scales being themselves constrained to being a Kronecker product ).

### 3.2 Eigenvalue-corrected Kronecker Factorization (EKFAC)

To correct for the potentially inexact rescaling of KFAC, and obtain a better but still computationally efficient approximation, instead of we propose to use an Eigenvalue-corrected Kronecker Factorization: where is the diagonal matrix defined by . Vector is the vector of second moments of the gradient vector coordinates expressed in the approximate basis and can be efficiently estimated and stored.

In Appendix A.1 we prove that this is the optimal diagonal rescaling in that basis, in the sense that s.t. is diagonal: it minimizes the approximation error to as measured by Frobenius norm (denoted ), which KFAC’s corresponding cannot generally achieve. A corollary of this is that we will always have i.e. EKFAC yields a better approximation of than KFAC (Theorem 2 proven in Appendix). Figure 2 illustrates the different rescaling strategies, including EKFAC.

##### Potential benefits:

Since EKFAC is a better approximation of than KFAC (in the limited sense of Frobenius norm of the residual) it could yield a better preconditioning of the gradient for optimizing neural networks333Although there is no guarantee. In particular being a better approximation of does not guarantee that will be closer to the natural gradient update direction .. Another potential benefit is linked to computational efficiency: even if KFAC yielded a reasonably good approximation, it is costly to re-estimate and invert matrices and , so this has to be amortized over many updates: re-estimation of the preconditioning is thus typically done at a much lower frequency than the parameter updates, and may lag behind, no longer accurately reflecting the local order information. Re-estimating the Kronecker-factored Eigenbasis for EKFAC is similarly costly and must be similarly amortized. But re-estimating the diagonal scaling in that basis is cheap, doable with every mini-batch, so we can hope to reactively track and leverage the changes in order information along these directions.

##### Algorithm:

Using Eigenvalue-corrected Kronecker factorization (EKFAC) for neural network optimization involves: a) periodically (every mini-batches) computing the Kronecker-factored Eigenbasis by doing an eigendecomposition of the same and matrices as KFAC; b) estimating scaling vector as second moments of gradient coordinates in that implied basis; c) preconditioning gradients accordingly prior to updating model parameters. Algorithm 1 provides a high level pseudo-code for the case of fully-connected layers444EKFAC for convolutional layers follows the same structure, but require a more convoluted notation., and when using EKFAC to approximate the “empirical Fisher”. In this version, we re-estimate from scratch on each mini-batch. An alternative is to update a running average estimate of the variance (of either individual gradients or mini-batch averaged gradients), denoted by EKFAC-ra (for running average) in section 4.

##### Dual view by working in the KFE:

Instead of thinking of this new method as an improved factorization of that we use as a preconditioning matrix, we can adopt the alternate view of applying a diagonal method, but in a different basis where the diagonal approximation is more accurate (an assumption we empirically confirm in Figure 3). This can be viewed by reinterpreting the update given by EKFAC as a 3 step process: project the gradient in the KFE (), apply natural gradient in this basis (), then project back to the parameter space ():

Note that by writing the projected gradient in KFE, the computation of the coefficients simplifies in . Figure 3 shows gradient correlation matrices in both the initial parameter basis and in the KFE. Gradient components appear far less correlated when expressed in the KFE, which justifies the usage of a diagonal method in that basis.

This viewpoint brings us close to network reparametrization approaches such as Fujimoto & Ohira (2018), whose proposal – that was already hinted towards by Desjardins et al. (2015) – amounts to a reparametrization equivalent of KFAC. More precisely, while Desjardins et al. (2015) empirically explored a reparametrization that uses only input covariance (and thus corresponds only to "half of" KFAC), Fujimoto & Ohira (2018) extend this to use also backpropagated gradient covariance , making it essentially equivalent to KFAC (with a few extra twists). Our approach differs in that moving to the KFE corresponds to a change of orthonormal basis, and that we cheaply track and perform a more optimal full diagonal rescaling in that basis, rather than the constrained factored diagonal that these other approaches are implicitly using.

## 4 Experiments

This section presents an empirical evaluation of our proposed Eigenvalue Corrected KFAC (EKFAC) algorithm in two variants: EKFAC estimates scalings as second moment of intrabatch gradients (in KFE coordinates) as in Algorithm 1, whereas EKFAC-ra estimates as a running average of squared minibatch gradient (in KFE coordinates). We compare them with KFAC and other baselines, primarily SGD with momentum, with and without batch-normalization (BN). For all our experiments KFAC and EKFAC approximate the “empirical Fisher”

. This research focuses on improving optimization techniques, so except when specified otherwise, we performed model and hyperparameter selection based on the performance of the optimization objective, i.e. on training loss.

### 4.1 Deep auto-encoder

We consider the task of minimizing the reconstruction error of an 8-layer auto-encoder on the MNIST dataset, a standard task used to benchmark optimization algorithms (Hinton & Salakhutdinov, 2006; Martens & Grosse, 2015; Desjardins et al., 2015). The model consists of an encoder composed of 4 fully-connected sigmoid layers, with a number of hidden units per layer of respectively , , , , and a symmetric decoder (with untied weights).

We compare our EKFAC, computing the second moment statistics through its mini-batch, and EKFAC-ra, its running average variant, with different baselines (KFAC, SGD, SGD with BN, ADAM and ADAM with BN). For each algorithm, best hyperparameters were selected using a mix of grid and random search based on training error. Grid values for hyperparameters are: learning rate and damping in , mini-batch size in , frequency of reparametrization (i.e. recomputing the inverse or eigendecomposition) for KFAC and EKFAC: either every 50 or 100 updates. In addition we explored 20 values for by random search around each grid points. We found that an extra care must be taken when choosing the values of the learning rate and damping parameter in order to get good performances, as often observed when working with algorithms that leverage curvature information (see Figure 8 (d)). The learning rate and the damping parameters are kept constant during training.

Figure 8 (a) reports the training loss through training and shows that EKFAC and EKFAC-ra both minimize faster the training loss per epoch than KFAC and the other baselines. In addition, Figure 8 (b) shows that an efficient estimation of diagonal scaling vector , as done by EKFAC, allows to achieve faster training in wall-clock time. The use of running average in EKFAC-ra leads to a faster training than KFAC, while EKFAC is on par with the latter. Finally, EKFAC and EKFAC-ra achieve better optimization on this task while maintaining their generalization performances (Figure 8 (c)).

Next we investigate how the frequency of the reparametrizations affects the optimization. In Figure 12, we compare KFAC/EKFAC with different reparametrization frequencies to a strong KFAC baseline where we reestimate and compute the inverse at each update. This baseline outperforms the amortized version (as a function of number of of epochs), and is likely to leverage a better approximation of as it recomputes the approximated eigenbasis at each update. However it comes at a high computational cost, as shown in Figure 12 (b). Amortizing the eigendecomposition allows to strongly decrease the computational cost while slightly degrading the optimization performances. In addition, Figure 12 (a) shows that the amortized EKFAC preserves better the optimization performances than its KFAC counterpart. EKFAC re-estimates at each update, the diagonal second moments in the KFE basis, which correspond to the eigenvalues of the EKFAC approximation of . This could reduce its estimation error, as the approximation can better match the true curvature of the loss function. To verify this hypothesis, we investigate how the eigenspectrum of the true empirical Fisher changes compared to the eigenspectrum of its approximations as and . In Figure 12 (c), we track their eigenspectra and report the distance between them during training. We compute the KFE once at the beginning and then keep it fixed during training. We focus on the layer of the auto-encoder, since it is small which allows estimating the corresponding and computing its eigenspectrum at a reasonable cost. We observe that the spectrum of quickly diverges from the spectrum of , whereas the cheap frequent reestimation of the diagonal scaling for allows the spectrum of to stay much closer to that of . This is true for both the running average and intrabatch versions of EKFAC.

### 4.2 Cifar-10

In this section, we evaluate our proposed algorithm on the CIFAR-10 dataset using a VGG11 convolutional neural network (Simonyan & Zisserman, 2015) and a Resnet34 (He et al., 2016). To implement KFAC/EKFAC in a convolutional neural network, we rely on the SUA approximation (Grosse & Martens, 2016) which has been shown to be competitive in practice (Laurent et al., 2018). We highlight that we do not use BN in our model when they are trained using KFAC/EKFAC.

As in the previous experiments, a grid search is performed to select the hyperparameters. Around each grid point, learning rate and damping values are further explored through random search. We experiment with constant learning rate in this section, but explore learning rate schedule with KFAC/EKFAC in Appendix C.2. In figures reporting the model training loss per epoch or wall-clock-time, we report the performance of the hyper-parameters attaining the lowest training loss for each epoch. This per-epoch model selection allows to show which model reach the lowest cost during training and also which model optimizes best given any “epoch budget”.

In Figure 16, we compare EKFAC/EKFAC-ra to KFAC and SGD Momentum with or without BN when training a VGG-11 network. We use a batch size of 500 for the KFAC based approaches and 200 for the SGD baselines. Figure 16 (a) show that EKFAC yields better optimization than the SGD baselines and KFAC in training loss per epoch when the computation of the KFE is amortized. Figure 16 (c) also shows that models trained with EKFAC maintain good generalization. EKFAC-ra shows some wall-clock time improvements over the baselines in that setting ( Figure 16 (b)). However, we observe that using KFAC with a batch size of 200 can catch-up EKFAC in wall-clock time despite being outperformed in term of optimization per iteration (see Figure C.7, in Appendix). VGG11 is a relatively small network by modern standard and the KFAC (with SUA approximation) remains computationally bearable in this model. We hypothesize that using smaller batches, KFAC can be updated often enough per epoch to have a reasonable estimation error while not paying too high a computational cost .

In Figure 20, we report similar results on the Resnet34. We compare EKFAC with running averages with KFAC and SGD-Momentum (with and without BN). In order to train the Resnet34 without BN, we need to rely on a careful initialization scheme in order to ensure good signal propagation during the forward and backward passes (see Appendix B for details). EKFAC outperforms both KFAC (when amortized) and SGD-Momentum in term of optimization per epochs, and compute time. This gain is robust across different batch sizes as shown in Figure C.10.

## 5 Conclusion and future work

In this work, we introduced the Eigenvalue-corrected Kronecker factorization (EKFAC), an approximate factorization of the (empirical) Fisher Information Matrix that is computationally manageable while still being accurate. We formally proved (in Appendix) that EKFAC yields a more accurate estimate than its closest competitor KFAC, in the sense of the Frobenius norm. In addition, we showed that our algorithm allows to cheaply perform partial updates of our curvature estimate, maintaining an up-to-date estimate of its eigenvalues while keeping the estimate of its eigenbasis fixed. This partial updating proves competitive when applied to standard optimization tasks, both with respect to the number of iterations and wall-clock time.

Our approach amounts to normalizing the gradient by its moment component-wise in a Kronecker-factored Eigenbasis (KFE). But one could apply other component-wise (diagonal) adaptive algorithms such as Adagrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012) or Adam (Kingma & Ba, 2015), in the KFE where the diagonal approximation is much more accurate. This is left for future work. We also intend to explore alternative strategies for obtaining the approximate eigenbasis and investigate how to increase the robustness of the algorithm with respect to the damping hyperparameter. We also want to explore novel regularization strategies, so that the advantage of efficient optimization algorithms can more reliably be translated to generalization error.

#### Acknowledgments

The experiments were conducted using PyTorch (

Paszke et al. (2017)). The authors would like to acknowledge the support of Calcul Quebec, Compute Canada, CIFAR and Facebook for research funding and computational resources.

## References

• Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 1998.
• Ba et al. (2017) Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-factored approximations. In ICLR, 2017.
• Becker et al. (1988) Sue Becker, Yann Le Cun, et al. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school. San Matteo, CA: Morgan Kaufmann, 1988.
• Bottou et al. (2016) Léon Bottou, Frank E Curtis, and Jorge Nocedal.

Optimization methods for large-scale machine learning.

arXiv preprint, 2016.
• Desjardins et al. (2015) Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In NIPS, 2015.
• Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
• Fujimoto & Ohira (2018) Yuki Fujimoto and Toru Ohira. A neural network model with bidirectional whitening. In

International Conference on Artificial Intelligence and Soft Computing

, pp. 47–57. Springer, 2018.
• Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In ICLR, 2017.
• Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
• Grosse & Martens (2016) Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In ICML, 2016.
• He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

In ICCV, 2015.
• He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
• Heskes (2000) Tom Heskes.

On “natural” learning and pruning in multilayered perceptrons.

Neural Computation, 12(4):881–901, 2000.
• Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
• Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
• Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
• Laurent et al. (2018) César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization. ICLR Workshop, 2018.
• Le Roux et al. (2008) Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NIPS, 2008.
• Liu & Nocedal (1989) Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 1989.
• Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015.
• Ollivier (2015) Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. Information and Inference: A Journal of the IMA, 2015.
• Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
• Schraudolph (2001) Nicol N Schraudolph. Fast curvature matrix-vector products. In International Conference on Artificial Neural Networks. Springer, 2001.
• Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
• Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.

## Appendix A Proofs

### a.1 Proof that EKFAC does optimal diagonal rescaling in the KFE

###### Lemma 1.

Let a real positive semi-definite matrix. And let

a given orthogonal matrix. Among diagonal matrices, diagonal matrix

with diagonal entries minimize approximation error (measured as Frobenius norm).

###### Proof.

Since the Frobenius norm remains unchanged through multiplication by an orthogonal matrix we can write

 e2 =∥∥G−QDQ⊤∥∥2F =∥∥Q⊤(G−QDQ⊤)Q∥∥2F =∥∥Q⊤GQ−D∥∥2F =∑i(Q⊤GQ−D)2iidiagonal + ∑i∑j≠i(Q⊤GQ)2ijoff-% diagonal

Since is diagonal, it does not affect the off-diagonal terms.

The squared diagonal terms all reach their minimum value by setting for all :

 Dii =(Q⊤GQ)ii =(Q⊤E[∇θ∇⊤θ]Q)ii =(E[Q⊤∇θ∇⊤θQ])ii =(E[Q⊤∇θ(Q⊤∇θ)⊤])ii =E[(Q⊤∇θ)2i] since Q⊤∇θ is a vector

We have thus shown that diagonal matrix with diagonal entries minimize . Since Frobenius norm is non-negative this implies that also minimizes . ∎

###### Theorem 2.

Let the matrix we want to approximate. Let the approximation of obtained by KFAC. Let and eigendecomposition of and . The diagonal rescaling that EKFAC does in the Kronecker-factored Eigenbasis (KFE) is optimal in the sense that it minimizes the Frobenius norm of the approximation error: among diagonal matrices , approximation error is minimized by the matrix with with diagonal entries .

###### Proof.

This follows directly by setting in Lemma 1. Note that the Kronecker product of two orthogonal matrices yields an orthogonal matrix.

###### Theorem 3.

Let the KFAC approximation of and the EKFAC approximation of , we always have .

###### Proof.

This follows trivially from Theorem 2 on the optimality of the EKFAC diagonal rescaling.

Since , with the of EKFAC, minimizes , it implies that:

 ∥G−(UA⊗UB)S∗(UA⊗UB)⊤∥F ≤ ∥G−(UA⊗UB)(SA⊗SB)(UA⊗UB)⊤∥F ∥G−GEKFAC∥F ≤ ∥G−(UASAU⊤A)⊗(UBSBU⊤B)∥F ∥G−GEKFAC∥F ≤ ∥G−A⊗B)∥F ∥G−GEKFAC∥F ≤ ∥G−GKFAC∥F

We have thus demonstrated that EKFAC yields a better approximation (more precisely: at least as good in Frobenius norm error) of than KFAC.

### a.2 Proof that Sii=E[(U⊤∇θ)2i]

###### Theorem 4.

Let and its eigendecomposition.
Then .

###### Proof.

Starting from eigendecomposition and the fact that is orthogonal so that we can write

 G = USU⊤ U⊤GU = U⊤USU⊤U U⊤GE[∇θ∇⊤θ]U = S

so that

 Sii = (U⊤E[∇θ∇⊤θ]U)ii = (E[U⊤∇θ∇⊤θU])ii = (E[U⊤∇θ(U⊤∇θ)])ii = E[(U⊤∇θ(U⊤∇θ))ii] = E[(U⊤∇θ)2i]

where we obtained the last equality by observing that is a vector and that the diagonal entries of the matrix for any vector are given by where the square operation is element-wise. ∎

## Appendix B Residual network initialization

To train residual networks without using BN, one need to initialize them carefully, so we used the following procedure, denoting the fan-in of the layer:

1. We use the He initialization for each layer directly preceded by a ReLU

(He et al., 2015): , .

2. Each layer not directly preceded by an activation function (for example the convolution in a skip connection) is initialized as: , . This can be derived from the He initialization, using the Identity as activation function.

3. Inspired from Goyal et al. (2017), we divide the scale of the last convolution in each residual block by a factor 10: , . This not only helps preserving the variance through the network but also eases the optimization at the beginning of the training.

## Appendix C Additional empirical results

### c.1 Impact of batch size

In this section, we evaluate the impact of the batch size on the optimization performances for KFAC and EKFAC. In Figure C.4, we reports the training loss performance per epochs for different batch sizes for VGG11. We observe that the optimization gain of EKFAC with respect of KFAC diminishes as the batch size gets smaller.

In Figure C.7, we look at the training loss per iterations and the training loss per computation time for different batch sizes, again on VGG11. EKFAC shows optimization benefits over KFAC as we increase the batch size (thus reducing the number of inverse/eigendecomposition per epoch). This gain does not translate in faster training in term of computation time in that setting. VGG11 is a relatively small network by modern standard and the SUA approximation remains computationally bearable on this model. We hypothesize that using smaller batches, KFAC can be updated often enough per epoch to have a reasonable estimation error while not paying a computational price too high.

In Figure C.10, we perform a similar experiment on the Resnet34. In this setting, we observe that the optimization gain of EKFAC with respect of KFAC remains consistent across batch sizes.

### c.2 Learning rate schedule

In this section we investigate the impact of a learning rate schedule on the optimization of EKFAC. We use a similar setting than the CIFAR-10 experiment, except that we decay the learning by 2 every 20 epochs. Figure C.13 and C.16 show that EKFAC still highlight some optimization benefit, relatively to the baseline, when combined with a learning rate schedule.