1 Introduction
Deep networks have exhibited stateoftheart performance in many application areas, including image recognition (He et al., 2016)
and natural language processing
(Gehring et al., 2017). However topperforming systems often require days of training time and a large amount of computational power, so there is a need for efficient training methods.Stochastic Gradient Descent (SGD) and its variants are the current workhorse for training neural networks. Training consists in optimizing the network parameters (of size ) to minimize a regularized empirical risk
, through gradient descent. The negative loss gradient is approximated based on a small subset of training examples (a minibatch). The loss functions of neural networks are highly nonconvex functions of the parameters, and the loss surface is known to have highly imbalanced curvature which limits the efficiency of
optimization methods such as SGD.Methods that employ order information have the potential to speed up order gradient descent by correcting for imbalanced curvature. The parameters are then updated as: , where is a positive learningrate and is a preconditioning matrix capturing the local curvature or related information such as the Hessian matrix in Newton’s method or the Fisher Information Matrix in Natural Gradient (Amari, 1998). Matrix has a gigantic size which makes it too large to compute and invert in the context of modern deep neural networks with millions of parameters. For practical applications, it is necessary to tradeoff quality of curvature information for efficiency.
A long family of algorithms used for optimizing neural networks can be viewed as approximating the diagonal of a large preconditioning matrix. Diagonal approximate of the Hessian (Becker et al., 1988) are proven to be efficient, and algorithms that use the diagonal of the covariance matrix of the gradients are widely used among neural networks practitioners, such as Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012)
, RMSProp
(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2015). We refer the reader to Bottou et al. (2016)for an informative review of optimization methods for deep networks, including diagonal rescalings, and connections with the Batch Normalization (BN)
(Ioffe & Szegedy, 2015) technique.More elaborate algorithms do not restrict to diagonal approximations, but instead aim at accounting for some correlations between different parameters (as encoded by nondiagonal elements of the preconditioning matrix). These methods range from Ollivier (2015) who introduces a rank 1 update that accounts for the cross correlations between the biases and the weight matrices, to quasi Newton methods (Liu & Nocedal, 1989)
that build a running estimate of the exact nondiagonal preconditioning matrix, and also include block diagonals approaches with blocks corresponding to entire layers
(Heskes, 2000; Desjardins et al., 2015; Martens & Grosse, 2015; Fujimoto & Ohira, 2018). Factored approximations such as KFAC (Martens & Grosse, 2015; Ba et al., 2017) approximate each block as a Kronecker product of two much smaller matrices, both of which can be estimated and inverted more efficiently than the full block matrix, since the inverse of a Kronecker product of two matrices is the Kronecker product of their inverses.In the present work, we draw inspiration from both diagonal and factored approximations. We introduce an Eigenvaluecorrected Kronecker Factorization (EKFAC) that consists in tracking a diagonal variance, not in parameter coordinates, but in a Kroneckerfactored eigenbasis. We show that EKFAC is a provably better approximation of the Fisher Information Matrix than KFAC. In addition, while computing Kroneckerfactored eigenbasis is an expensive operation that needs to be amortized, tracking of the diagonal variance is a cheap operation. EKFAC therefore allows to perform partial updates of our curvature estimate
at the iteration level. We conduct an empirical evaluation of EKFAC on the deep autoencoder optimization task using fullyconnected networks and CIFAR10 relying on deep convolutional neural networks where EKFAC shows improvements over KFAC in optimization.
2 Background and notations
We are given a dataset containing (input, target) examples , and a neural network
with parameter vector
of size . We want to find a value of that minimizes an empirical risk expressed as an average of a loss incurred over the training set: . We will use to denote both expectations w.r.t. a distribution or, as here, averages over finite sets, as made clear by the subscript and context. Considered algorithms for optimizing use stochastic gradients , or their average over a minibatch of examples sampled from . Stochastic gradient descent (SGD) does a order update: where is a positive learning rate. order methods first multiply by a preconditioning matrix yielding the update: . Preconditioning matrices for Natural Gradient (Amari, 1998) / Generalized GaussNewton (Schraudolph, 2001) / TONGA (Le Roux et al., 2008)can all be expressed as either (centered) covariance or (uncentered) second moment of
, computed over slightly different distributions of. Thus natural gradient uses the Fisher Information Matrix, which for a probabilistic classifier can be expressed as
where the expectation is taken over targets sampled form the model . Whereas the "empirical Fisher" approximation or generalized GaussNewton uses . Our discussion and development applies regardless of the precise distribution over used to estimate a so we will from here on use without a subscript.Matrix has a gigantic size , which makes it too big to compute and invert. In order to get a practical algorithm, we must find approximations of that keep some of the relevant order information while removing the unnecessary and computationally costly parts. A first simplification, adopted by nearly all prior approaches, consists in treating each layer of the neural network separately, ignoring crosslayer terms. This amounts to a first blockdiagonal approximation of : each block caters for the parameters of a single layer . Now can typically still be extremely large.
A cheap but very crude approximation consists in using a diagonal , i.e. taking into account the variance in each parameter dimension, but ignoring all covariance structure. A less stringent approximation was proposed by Heskes (2000) and later Martens & Grosse (2015). They propose to approximate as a Kronecker product which involves two smaller matrices, making it much cheaper to store, compute and invert^{1}^{1}1Since .. Specifically for a layer that receives input of size and computes linear preactivations of size
(biases omitted for simplicity) followed by some nonlinear activation function, let the backpropagated gradient on
be . The gradients on parameters will be . The Kronecker factored approximation of corresponding will use and i.e. matrices of size and , whereas the full would be of size . Using this Kronecker approximation (known as KFAC) corresponds to approximating entries of as follows: .A similar principle can be applied to obtain a Kroneckerfactored expression for the covariance of the gradients of the parameters of a convolutional layer (Grosse & Martens, 2016). To obtain matrices and one then needs to also sum over spatial locations and corresponding receptive fields, as illustrated in Figure 1.
3 Proposed method
3.1 Motivation: reflexion on diagonal rescaling in different coordinate bases
It is instructive to contrast the type of “exact” natural gradient preconditioning of the gradient that uses the full Fisher Information Matrix would yield, to what we do when approximating this by using a diagonal matrix only. Using the full matrix yields the natural gradient update: . When resorting to a diagonal approximation we instead use where . So that update amounts to preconditioning the gradient vector by dividing each of its coordinates by an estimated second moment . This diagonal rescaling happens in the initial basis of parameters . By contrast, a full natural gradient update can be seen to do a similar diagonal rescaling, not along the initial parameter basis axes, but along the axes of the eigenbasis of the matrix . Let be the eigendecomposition of . The operations that yield the full natural gradient update correspond to the sequence of a) multiplying gradient vector by which corresponds to switching to the eigenbasis: yields the coordinates of the gradient vector expressed in that basis b) multiplying by a diagonal matrix , which rescales each coordinate (in that eigenbasis) by c) multiplying by , which switches the rescaled vector back to the initial basis of parameters. It is easy to show that (proof is given in Appendix A.2). So similarly to what we do when using a diagonal approximation, we are rescaling by the second moment of gradient vector components, but rather than doing this in the initial parameter basis, we do this in the eigenbasis of
. Note that the variance measured along the leading eigenvector can be much larger than the variance along the axes of the initial parameter basis, so the effects of the rescaling by using either the full
or its diagonal approximation can be very different.Now what happens when we use the less crude KFAC approximation instead? We approximate^{2}^{2}2This approximation is done separately for each block , we dropped the superscript to simplify notations. yielding the update . Let us similarly look at it through its eigendecomposition. The eigendecomposition of the Kronecker product of two real symmetric positive semidefinite matrices can be expressed using their own eigendecomposition and , yielding . gives the orthogonal eigenbasis of the Kronecker product, we call it the KroneckerFactored Eigenbasis (KFE). is the diagonal matrix containing the associated eigenvalues. Note that each such eigenvalue will be a product of an eigenvalue of stored in and an eigenvalue of stored in . We can view the action of the resulting Kroneckerfactored preconditioning in the same way as we viewed the preconditioning by the full matrix: it consists in a) expressing gradient vector in a different basis which can be thought of as approximating the directions of , b) doing a diagonal rescaling by in that basis, c) switching back to the initial parameter space. Here however the rescaling factor is not guaranteed to match the second moment along the associated eigenvector .
In summary (see Figure 2):

Full matrix preconditioning will scale by variances estimated along the eigenbasis of .

Diagonal preconditioning will scale by variances properly estimated, but along the initial parameter basis, which can be very far from the eigenbasis of .

KFAC preconditioning will scale the gradient along the KFE basis that will likely be closer to the eigenbasis of , but doesn’t use properly estimated variances along these axes for this scaling (the scales being themselves constrained to being a Kronecker product ).
3.2 Eigenvaluecorrected Kronecker Factorization (EKFAC)
To correct for the potentially inexact rescaling of KFAC, and obtain a better but still computationally efficient approximation, instead of we propose to use an Eigenvaluecorrected Kronecker Factorization: where is the diagonal matrix defined by . Vector is the vector of second moments of the gradient vector coordinates expressed in the approximate basis and can be efficiently estimated and stored.
In Appendix A.1 we prove that this is the optimal diagonal rescaling in that basis, in the sense that s.t. is diagonal: it minimizes the approximation error to as measured by Frobenius norm (denoted ), which KFAC’s corresponding cannot generally achieve. A corollary of this is that we will always have i.e. EKFAC yields a better approximation of than KFAC (Theorem 2 proven in Appendix). Figure 2 illustrates the different rescaling strategies, including EKFAC.
Potential benefits:
Since EKFAC is a better approximation of than KFAC (in the limited sense of Frobenius norm of the residual) it could yield a better preconditioning of the gradient for optimizing neural networks^{3}^{3}3Although there is no guarantee. In particular being a better approximation of does not guarantee that will be closer to the natural gradient update direction .. Another potential benefit is linked to computational efficiency: even if KFAC yielded a reasonably good approximation, it is costly to reestimate and invert matrices and , so this has to be amortized over many updates: reestimation of the preconditioning is thus typically done at a much lower frequency than the parameter updates, and may lag behind, no longer accurately reflecting the local order information. Reestimating the Kroneckerfactored Eigenbasis for EKFAC is similarly costly and must be similarly amortized. But reestimating the diagonal scaling in that basis is cheap, doable with every minibatch, so we can hope to reactively track and leverage the changes in order information along these directions.
Algorithm:
Using Eigenvaluecorrected Kronecker factorization (EKFAC) for neural network optimization involves: a) periodically (every minibatches) computing the Kroneckerfactored Eigenbasis by doing an eigendecomposition of the same and matrices as KFAC; b) estimating scaling vector as second moments of gradient coordinates in that implied basis; c) preconditioning gradients accordingly prior to updating model parameters. Algorithm 1 provides a high level pseudocode for the case of fullyconnected layers^{4}^{4}4EKFAC for convolutional layers follows the same structure, but require a more convoluted notation., and when using EKFAC to approximate the “empirical Fisher”. In this version, we reestimate from scratch on each minibatch. An alternative is to update a running average estimate of the variance (of either individual gradients or minibatch averaged gradients), denoted by EKFACra (for running average) in section 4.
Dual view by working in the KFE:
Instead of thinking of this new method as an improved factorization of that we use as a preconditioning matrix, we can adopt the alternate view of applying a diagonal method, but in a different basis where the diagonal approximation is more accurate (an assumption we empirically confirm in Figure 3). This can be viewed by reinterpreting the update given by EKFAC as a 3 step process: project the gradient in the KFE (–), apply natural gradient in this basis (–), then project back to the parameter space (–):
Note that by writing the projected gradient in KFE, the computation of the coefficients simplifies in . Figure 3 shows gradient correlation matrices in both the initial parameter basis and in the KFE. Gradient components appear far less correlated when expressed in the KFE, which justifies the usage of a diagonal method in that basis.
This viewpoint brings us close to network reparametrization approaches such as Fujimoto & Ohira (2018), whose proposal – that was already hinted towards by Desjardins et al. (2015) – amounts to a reparametrization equivalent of KFAC. More precisely, while Desjardins et al. (2015) empirically explored a reparametrization that uses only input covariance (and thus corresponds only to "half of" KFAC), Fujimoto & Ohira (2018) extend this to use also backpropagated gradient covariance , making it essentially equivalent to KFAC (with a few extra twists). Our approach differs in that moving to the KFE corresponds to a change of orthonormal basis, and that we cheaply track and perform a more optimal full diagonal rescaling in that basis, rather than the constrained factored diagonal that these other approaches are implicitly using.
4 Experiments
This section presents an empirical evaluation of our proposed Eigenvalue Corrected KFAC (EKFAC) algorithm in two variants: EKFAC estimates scalings as second moment of intrabatch gradients (in KFE coordinates) as in Algorithm 1, whereas EKFACra estimates as a running average of squared minibatch gradient (in KFE coordinates). We compare them with KFAC and other baselines, primarily SGD with momentum, with and without batchnormalization (BN). For all our experiments KFAC and EKFAC approximate the “empirical Fisher”
. This research focuses on improving optimization techniques, so except when specified otherwise, we performed model and hyperparameter selection based on the performance of the optimization objective, i.e. on training loss.
4.1 Deep autoencoder
We consider the task of minimizing the reconstruction error of an 8layer autoencoder on the MNIST dataset, a standard task used to benchmark optimization algorithms (Hinton & Salakhutdinov, 2006; Martens & Grosse, 2015; Desjardins et al., 2015). The model consists of an encoder composed of 4 fullyconnected sigmoid layers, with a number of hidden units per layer of respectively , , , , and a symmetric decoder (with untied weights).
We compare our EKFAC, computing the second moment statistics through its minibatch, and EKFACra, its running average variant, with different baselines (KFAC, SGD, SGD with BN, ADAM and ADAM with BN). For each algorithm, best hyperparameters were selected using a mix of grid and random search based on training error. Grid values for hyperparameters are: learning rate and damping in , minibatch size in , frequency of reparametrization (i.e. recomputing the inverse or eigendecomposition) for KFAC and EKFAC: either every 50 or 100 updates. In addition we explored 20 values for by random search around each grid points. We found that an extra care must be taken when choosing the values of the learning rate and damping parameter in order to get good performances, as often observed when working with algorithms that leverage curvature information (see Figure 8 (d)). The learning rate and the damping parameters are kept constant during training.
Training loss vs epochs. Both EKFAC and EKFACra show an optimization benefit compared to amortizedKFAC and the other baselines.
(b) Training loss vs wallclock time. Optimization benefits transfer to faster training for EKFACra. (c) Validation loss. KFAC, EKFAC and BN achieve the same validation performances. (d) Sensitivity to hyperparameters values. Color corresponds to final loss reached after epochs for batch size .Figure 8 (a) reports the training loss through training and shows that EKFAC and EKFACra both minimize faster the training loss per epoch than KFAC and the other baselines. In addition, Figure 8 (b) shows that an efficient estimation of diagonal scaling vector , as done by EKFAC, allows to achieve faster training in wallclock time. The use of running average in EKFACra leads to a faster training than KFAC, while EKFAC is on par with the latter. Finally, EKFAC and EKFACra achieve better optimization on this task while maintaining their generalization performances (Figure 8 (c)).
Next we investigate how the frequency of the reparametrizations affects the optimization. In Figure 12, we compare KFAC/EKFAC with different reparametrization frequencies to a strong KFAC baseline where we reestimate and compute the inverse at each update. This baseline outperforms the amortized version (as a function of number of of epochs), and is likely to leverage a better approximation of as it recomputes the approximated eigenbasis at each update. However it comes at a high computational cost, as shown in Figure 12 (b). Amortizing the eigendecomposition allows to strongly decrease the computational cost while slightly degrading the optimization performances. In addition, Figure 12 (a) shows that the amortized EKFAC preserves better the optimization performances than its KFAC counterpart. EKFAC reestimates at each update, the diagonal second moments in the KFE basis, which correspond to the eigenvalues of the EKFAC approximation of . This could reduce its estimation error, as the approximation can better match the true curvature of the loss function. To verify this hypothesis, we investigate how the eigenspectrum of the true empirical Fisher changes compared to the eigenspectrum of its approximations as and . In Figure 12 (c), we track their eigenspectra and report the distance between them during training. We compute the KFE once at the beginning and then keep it fixed during training. We focus on the layer of the autoencoder, since it is small which allows estimating the corresponding and computing its eigenspectrum at a reasonable cost. We observe that the spectrum of quickly diverges from the spectrum of , whereas the cheap frequent reestimation of the diagonal scaling for allows the spectrum of to stay much closer to that of . This is true for both the running average and intrabatch versions of EKFAC.
4.2 Cifar10
In this section, we evaluate our proposed algorithm on the CIFAR10 dataset using a VGG11 convolutional neural network (Simonyan & Zisserman, 2015) and a Resnet34 (He et al., 2016). To implement KFAC/EKFAC in a convolutional neural network, we rely on the SUA approximation (Grosse & Martens, 2016) which has been shown to be competitive in practice (Laurent et al., 2018). We highlight that we do not use BN in our model when they are trained using KFAC/EKFAC.
As in the previous experiments, a grid search is performed to select the hyperparameters. Around each grid point, learning rate and damping values are further explored through random search. We experiment with constant learning rate in this section, but explore learning rate schedule with KFAC/EKFAC in Appendix C.2. In figures reporting the model training loss per epoch or wallclocktime, we report the performance of the hyperparameters attaining the lowest training loss for each epoch. This perepoch model selection allows to show which model reach the lowest cost during training and also which model optimizes best given any “epoch budget”.
In Figure 16, we compare EKFAC/EKFACra to KFAC and SGD Momentum with or without BN when training a VGG11 network. We use a batch size of 500 for the KFAC based approaches and 200 for the SGD baselines. Figure 16 (a) show that EKFAC yields better optimization than the SGD baselines and KFAC in training loss per epoch when the computation of the KFE is amortized. Figure 16 (c) also shows that models trained with EKFAC maintain good generalization. EKFACra shows some wallclock time improvements over the baselines in that setting ( Figure 16 (b)). However, we observe that using KFAC with a batch size of 200 can catchup EKFAC in wallclock time despite being outperformed in term of optimization per iteration (see Figure C.7, in Appendix). VGG11 is a relatively small network by modern standard and the KFAC (with SUA approximation) remains computationally bearable in this model. We hypothesize that using smaller batches, KFAC can be updated often enough per epoch to have a reasonable estimation error while not paying too high a computational cost .
In Figure 20, we report similar results on the Resnet34. We compare EKFAC with running averages with KFAC and SGDMomentum (with and without BN). In order to train the Resnet34 without BN, we need to rely on a careful initialization scheme in order to ensure good signal propagation during the forward and backward passes (see Appendix B for details). EKFAC outperforms both KFAC (when amortized) and SGDMomentum in term of optimization per epochs, and compute time. This gain is robust across different batch sizes as shown in Figure C.10.
5 Conclusion and future work
In this work, we introduced the Eigenvaluecorrected Kronecker factorization (EKFAC), an approximate factorization of the (empirical) Fisher Information Matrix that is computationally manageable while still being accurate. We formally proved (in Appendix) that EKFAC yields a more accurate estimate than its closest competitor KFAC, in the sense of the Frobenius norm. In addition, we showed that our algorithm allows to cheaply perform partial updates of our curvature estimate, maintaining an uptodate estimate of its eigenvalues while keeping the estimate of its eigenbasis fixed. This partial updating proves competitive when applied to standard optimization tasks, both with respect to the number of iterations and wallclock time.
Our approach amounts to normalizing the gradient by its moment componentwise in a Kroneckerfactored Eigenbasis (KFE). But one could apply other componentwise (diagonal) adaptive algorithms such as Adagrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012) or Adam (Kingma & Ba, 2015), in the KFE where the diagonal approximation is much more accurate. This is left for future work. We also intend to explore alternative strategies for obtaining the approximate eigenbasis and investigate how to increase the robustness of the algorithm with respect to the damping hyperparameter. We also want to explore novel regularization strategies, so that the advantage of efficient optimization algorithms can more reliably be translated to generalization error.
Acknowledgments
References
 Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 1998.
 Ba et al. (2017) Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. In ICLR, 2017.
 Becker et al. (1988) Sue Becker, Yann Le Cun, et al. Improving the convergence of backpropagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school. San Matteo, CA: Morgan Kaufmann, 1988.

Bottou et al. (2016)
Léon Bottou, Frank E Curtis, and Jorge Nocedal.
Optimization methods for largescale machine learning.
arXiv preprint, 2016.  Desjardins et al. (2015) Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In NIPS, 2015.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

Fujimoto & Ohira (2018)
Yuki Fujimoto and Toru Ohira.
A neural network model with bidirectional whitening.
In
International Conference on Artificial Intelligence and Soft Computing
, pp. 47–57. Springer, 2018.  Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In ICLR, 2017.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Grosse & Martens (2016) Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In ICML, 2016.

He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In ICCV, 2015.  He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Heskes (2000)
Tom Heskes.
On “natural” learning and pruning in multilayered perceptrons.
Neural Computation, 12(4):881–901, 2000.  Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Laurent et al. (2018) César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization. ICLR Workshop, 2018.
 Le Roux et al. (2008) Nicolas Le Roux, PierreAntoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NIPS, 2008.
 Liu & Nocedal (1989) Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 1989.
 Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In ICML, 2015.
 Ollivier (2015) Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. Information and Inference: A Journal of the IMA, 2015.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 Schraudolph (2001) Nicol N Schraudolph. Fast curvature matrixvector products. In International Conference on Artificial Neural Networks. Springer, 2001.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.
 Zeiler (2012) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Proofs
a.1 Proof that EKFAC does optimal diagonal rescaling in the KFE
Lemma 1.
Let a real positive semidefinite matrix. And let
a given orthogonal matrix. Among diagonal matrices, diagonal matrix
with diagonal entries minimize approximation error (measured as Frobenius norm).Proof.
Since the Frobenius norm remains unchanged through multiplication by an orthogonal matrix we can write
Since is diagonal, it does not affect the offdiagonal terms.
The squared diagonal terms all reach their minimum value by setting for all :
We have thus shown that diagonal matrix with diagonal entries minimize . Since Frobenius norm is nonnegative this implies that also minimizes . ∎
Theorem 2.
Let the matrix we want to approximate. Let the approximation of obtained by KFAC. Let and eigendecomposition of and . The diagonal rescaling that EKFAC does in the Kroneckerfactored Eigenbasis (KFE) is optimal in the sense that it minimizes the Frobenius norm of the approximation error: among diagonal matrices , approximation error is minimized by the matrix with with diagonal entries .
Proof.
This follows directly by setting in Lemma 1. Note that the Kronecker product of two orthogonal matrices yields an orthogonal matrix.
∎
Theorem 3.
Let the KFAC approximation of and the EKFAC approximation of , we always have .
Proof.
This follows trivially from Theorem 2 on the optimality of the EKFAC diagonal rescaling.
Since , with the of EKFAC, minimizes , it implies that:
∎
We have thus demonstrated that EKFAC yields a better approximation (more precisely: at least as good in Frobenius norm error) of than KFAC.
a.2 Proof that
Theorem 4.
Let and its eigendecomposition.
Then .
Proof.
Starting from eigendecomposition and the fact that is orthogonal so that we can write
so that
where we obtained the last equality by observing that is a vector and that the diagonal entries of the matrix for any vector are given by where the square operation is elementwise. ∎
Appendix B Residual network initialization
To train residual networks without using BN, one need to initialize them carefully, so we used the following procedure, denoting the fanin of the layer:

Each layer not directly preceded by an activation function (for example the convolution in a skip connection) is initialized as: , . This can be derived from the He initialization, using the Identity as activation function.

Inspired from Goyal et al. (2017), we divide the scale of the last convolution in each residual block by a factor 10: , . This not only helps preserving the variance through the network but also eases the optimization at the beginning of the training.
Appendix C Additional empirical results
c.1 Impact of batch size
In this section, we evaluate the impact of the batch size on the optimization performances for KFAC and EKFAC. In Figure C.4, we reports the training loss performance per epochs for different batch sizes for VGG11. We observe that the optimization gain of EKFAC with respect of KFAC diminishes as the batch size gets smaller.
In Figure C.7, we look at the training loss per iterations and the training loss per computation time for different batch sizes, again on VGG11. EKFAC shows optimization benefits over KFAC as we increase the batch size (thus reducing the number of inverse/eigendecomposition per epoch). This gain does not translate in faster training in term of computation time in that setting. VGG11 is a relatively small network by modern standard and the SUA approximation remains computationally bearable on this model. We hypothesize that using smaller batches, KFAC can be updated often enough per epoch to have a reasonable estimation error while not paying a computational price too high.
In Figure C.10, we perform a similar experiment on the Resnet34. In this setting, we observe that the optimization gain of EKFAC with respect of KFAC remains consistent across batch sizes.
c.2 Learning rate schedule
In this section we investigate the impact of a learning rate schedule on the optimization of EKFAC. We use a similar setting than the CIFAR10 experiment, except that we decay the learning by 2 every 20 epochs. Figure C.13 and C.16 show that EKFAC still highlight some optimization benefit, relatively to the baseline, when combined with a learning rate schedule.