TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

06/07/2021 ∙ by Saeed Soori, et al. ∙ UNIVERSITY OF TORONTO Rutgers University 22

This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with approximation. However, the approximations do not reduce the overall time significantly and lead to less accurate parameter updates and loss of curvature information. TENGraD improves the time efficiency of NGD by computing Fisher block inverses with a computationally efficient covariance factorization and reuse method. It computes the inverse of each block exactly using the Woodbury matrix identity to preserve curvature information while admitting (linear) fast convergence rates. Our experiments on image classification tasks for state-of-the-art deep neural architecture on CIFAR-10, CIFAR-100, and Fashion-MNIST show that TENGraD significantly outperforms state-of-the-art NGD methods and often stochastic gradient descent in wall-clock time.



There are no comments yet.


page 9

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Second order methods specifically Natural Gradient Descent (NGD) [2, 1, 6, 17] have gained traction in recent years as they accelerate the training of deep neural networks (DNN) by capturing the geometry of the optimization landscape [21] with the Fisher Information Matrix (FIM) [23, 17, 16]. NGD’s running time depends on both the convergence rate and computations per iteration which typically involves computing and inverting the FIM [13]. Exact NGD [30] methods demonstrate improved convergence compared to first order techniques such as Stochastic Gradient Descent (SGD)[25], and approximate NGD variants attempt to improve the time per iteration, hence the overall time, with approximation [13, 11, 12]. However, to our knowledge, none of the current approximate NGD approaches outperform or are comparable with the end-to-end wall-clock time of tuned SGD when training DNNs [13, 11, 12].

Exact NGD methods such as [30] and [3] improve the convergence of training DNNs compared to first-order methods, however, they are typically expensive and do not scale due to high computational complexity per iteration. The work in [3]

derives an analytic formula for the linear network using the generalized inverse of the curvature which cannot be used for training state-of-the-art models due to lack of non-linear activation functions. Zhang

et al. [30]

extend NGD to deep nonlinear networks with non-smooth activations and show that NGD converges to the global optimum with a linear rate. However, their method fails to scale to large or even moderate size models primarily because it relies heavily on backpropagating Jacobian matrices, which scales with the network’s output dimension

[7]. In [24] the authors use Woodbury identity for the inversion of Fisher matrix and propose a unified framework for subsampled Gauss-Newton and NGD methods. Their framework is limited to fully connected networks and relies on empirical Fisher and requires extra forward-backward passes to perform parameter updates which slows down the training [19, 7].

Approximate NGD approaches such as [13, 11, 12, 5, 14, 8] attempt to improve the overall execution time of NGD with FIM inverse approximation, however, due to costly operations in the inverse approximation process they do not noticeably reduce the overall NGD time. In these methods, the FIM is approximated with a block-diagonal matrix, where blocks correspond to layers. The dimension of each block scales with the input and output dimension of the layer and therefore it cannot be computed and inverted efficiently for wide layers. To alleviate this cost, some methods further approximate each block inverse to reduce its size and computation complexity. For example, KFAC [13], approximates each block inverse using the Kronecker product of two smaller matrices, i.e. Kronecker factors. However, these factors have large sizes for wide layers and hence their inversion is expensive. EKFAC [11]

improves the approximation used in KFAC by rescaling the Kronecker factors with a diagonal matrix obtained via costly singular value decompositions. Other work such as KBFGS


further estimates the inverse of Kronecker factors using low-rank BFGS types updates. WoodFisher

[27] estimates the empirical FIM block inverses using rank-one updates, however, this estimation will not contain enough useful curvature information to produce a good search direction [20]. Our proposed method, TENGraD, improves the time efficiency of NGD by reducing the FIM block inversion cost using a computationally efficient covariance factorization and reuse of intermediate values method.

Recent work such as [30, 3] show that exact NGD converges to the global optimum with a linear convergence rate. Work such as [24] analyze convergence for the Levenberg-Marquardt variant of NGD without specifying the convergence rate. The work in [30] provides the convergence analysis of KFAC for a shallow two-layer fully connected network where the convergence rate depends on the condition number of input data. Goldfarb et al. [12] follow the framework for stochastic quasi-Newton methods and prove that KBFGS converges with a sublinear rate for a network with bounded activation functions. TENGraD improves the convergence properties of approximate NGD methods and has a linear convergence rate for DNNs with non-smooth activation functions.


Loss vs epochs.

(b) Loss vs time.
(c) Fully connected layer.
Figure 1: Training loss vs epoch and time for CIFAR-100 on DenseNet, and a fully connected layer.

Motivation and contributions: (a) and (b) show the performance of approximate NGD methods (including TENGraD) and SGD for an image classification benchmark on a DNN. As shown in (a), the NGD methods provide more accurate updates resulting in a better loss reduction compared to SGD. However, as shown in (b), despite efforts made by approximate NGD methods to improve the overall time, none are competitive with tuned SGD with momentum. TENGraD is the first NGD based implementation, which we call a time-efficient implementation, that outperforms other approximate NGD methods as well as SGD (under certain conditions) in overall time while benefiting from the convergence properties of second-order methods.

In the following, we demonstrate for a fully connected layer shown in Figure 1c, why TENGraD is more time-efficient compared to state-of-the-art approximate NGD methods. Figure 2 shows KFAC, EKFAC, KBFGS, and TENGraD. All methods first use a block-approximate approach to create the blocked matrix . KFAC, EKFAC, and KBFGS approximate with a Kronecker product of two Kronecker factors and . Because the size of factors is proportional to the input and output dimension of a layer, i.e. and , their inversion is costly to compute for wide layers and is cubic in input and output dimensions, i.e. .EKFAC creates factors and with a complexity that is also cubic in input and output dimensions of a layer. It also computes the scaling factor by performing number of matrix multiplications with the SVD factors. The rank-2 BFGS updates in KBFGS lead to a computation complexity that is quadratic in the input and output size of a layer.

TENGraD computes the exact inverse of Fisher-blocks using the Woodbury matrix identity so the inverse is factorized into matrices and using a novel method called covariance factorization. The size of and is equal to a mini-batch size, i.e. , which is typically in the order of hundreds hence considerably smaller and easier to inverse compared to Kronecker factors in other NGD methods. Other matrices, i.e. and , involved in the Fisher-block inverse in TENGraD, are reused

. This accelerates the inverse computation as it removes extra passes on the network via storing intermediate values computed during the forward-backward passes. As a result, TENGraD becomes an actual time-efficient approximate NGD method and improves the overall time. Also, to the best of our knowledge, we are the first to prove the linear rate convergence of approximate NGD methods for deep neural networks with ReLU activation functions. Our contributions are:

  • A time-efficient NGD method called TENGraD that computes the exact inverse of Fisher blocks efficiently by using a covariance factorization and reuse

    technique. Extension of TENGraD to convolutional layers using tensor algebra.

  • Empirical demonstration that TENGraD outperforms overall time of state-of-the-art NGD methods and SGD on well-known deep learning benchmarks.

  • TENGraD converges to the global optimum with a linear rate if exact gradients are used with an iteration complexity of which is comparable to exact NGD methods.

Figure 2: The TENGraD method vs Kronecker-factored approaches for a fully connected layer.

2 Background and Definitions

A deep neural network transforms its input to an output through a series of layers where

is the vector of network parameters. During the forward pass, layer

performs a linear transformation on its input of

samples, i.e. , parameterized with , and computes its outputs or pre-activations, , followed by a nonlinear activation function such as ReLU. The optimal parameters are obtained by minimizing the average loss over the training set during the backward pass:


where is the vector containing all of the network’s parameters concatenated together, i.e. , is the operator which vectorizes matrix by stacking columns together, and measures the accuracy of the prediction (e.g. cross entropy loss).

Gradient and Jacobian in DNN. Optimization methods that aim to minimize the average loss in Equation 1 work with gradient computations. The average gradient over mini-batch is defined as which is obtained using standard backpropagation. In the backpropagation, each layer receives the gradient w.r.t its output, i.e. the pre-activation derivatives , and computes the gradient w.r.t its parameters, i.e. . The Jacobian of loss w.r.t the parameters for a single output network is defined as where is the gradient of loss w.r.t the parameters. In the stochastic setting, for a layer the Jacobian of loss w.r.t the layer parameters is approximated via .

NGD update rule. Natural gradient descent scales the gradient with the inverse of FIM as follows:


where is the learning rate, is the damping parameter, is the FIM of the model’s conditional distribution and is defined as where is the model’s density function. Since is singular for over-parameterized models, a non-negative damping term is added to make it invertible. As shown in [21], the FIM coincides with Gauss-Newton matrix if the conditional distribution is in the exponential family which implies:


where , and . Hereafter, we use to simplify the notations.

Notations. Following notations are used in the paper: , is the block diagonal matrix with matrices on its diagonal; is the input data, i.e. .

is the smallest eigenvalue of matrix

; Hadamard product between two matrices is ; the Kronecker product is ; for the Euclidean norm of vectors/matrices; and is for the column-wise Khatri-Rao product; is used to denote the -th columns of matrix .

3 Time-Efficient NGD with Exact Fisher-block Inversion

The computation bottleneck in NGD is and its gradient product. Naively computing the Fisher inverse is impractical due to its large size, i.e. where is the number of parameters and can be millions. Therefore, the Fisher matrix is typically approximated with a block-diagonal matrix where block corresponds to layer [13]. As a result, becomes a block diagonal matrix. TENGraD computes the exact inverse of each block, i.e. (hereafter, all computations are performed on one block, thus, is removed from notations). TENGraD uses the Woodbury identity for each Fisher-block:


where , and are the parameters, gradient, and Jacobian associated with the Fisher-block (we drop the index to simplify the notations). Main steps of the new update rule are: (A) computing and inverting the Gram Jacobian , which is costly because the Jacobian and a matrix-matrix multiplication have to be computed, (B) scaling of the gradient using the Jacobian to transform the gradient into input space, and (C) rescaling with to transform back to the parameter space. The Gram matrix computation in (A) has high complexity and is proportional to both input and output dimensions of a layer, i.e. . TENGraD mitigates this cost with covariance factorization that factors the Gram Jacobian into a low-cost operation between two small matrices. Steps (B) and (C) also involve Jacobian vector products that are orders of magnitude more computationally expensive than a gradient computation [7]. TENGraD mitigates this cost by reusing the input and activations computed during the forward-backward pass.

3.1 Covariance Factorization and Reusing of Intermediate Values in TENGraD

TENGraD factors the Gram matrix into two smaller matrices and called covariance factors such that , is the batch size. These factors can be obtained with low overhead as batch size is typically small. For only one sample and for the fully connected layer in (c), with the input and the pre-activations , we show the low-rank property of the gradient and how it is leveraged to reduce the computation complexity of the Gram Jacobian. The gradient of the loss w.r.t parameter for input sample is computed using the input and pre-activation derivatives by . The gradient is the outer product of two vectors and therefore a rank-one matrix. The element of Gram Jacobian is the inner product of vectorized gradients of two samples and , i.e. . With , the Gram Jacobian becomes and with , it is rewritten as . As a result, the -th element of the Gram Jacobian is efficiently computed without forming the gradients. Extended to a mini-batch of samples, the per-sample gradients are written as the column-wise Khatri-Rao product of input and pre-activation derivatives, i.e. . Therefore the compact form of the Gram Jacobian is:


From Equation 5, the Gram Jacobian is written as the Hadamard product of two smaller matrices, i.e. covariance factors. The input covariance and the pre-activation covariance are efficiently computed during the forward and backward pass of the neural network using a matrix-matrix multiplication. The Jacobian does not need to be explicitly formed to compute these factors, reducing the overall computation complexity. Also, these factors are smaller and have low storage costs compared to Kronecker-factors, which have a size of and .

The cost of storing the Jacobian for the NGD update in steps (B) and (C) of Equation 4 is considerable because its storage cost is times the number of parameters. With a reuse method, TENGraD instead recomputes the Jacobian when needed, with a negligible computation cost and by storing small data structures. TENGraD breaks down the computation of Jacobian vector products in steps (B) and (C) into two operations involving input and pre-activation derivative matrices and . In step (B) in Equation 4 the vectorized form of the gradient for the layer parameters, i.e. , is propagated. The objective is to compute without explicitly forming the Jacobian. With and using the properties of column-wise Khatri-Rao product:


where is a vector of all ones with appropriate dimension. The Jacobian vector product can be computed with two efficient matrix operations without forming the Jacobian. First, is computed with , followed by a Hadamard product with the pre-activation derivative matrix, i.e. , and a column-wise summation on . In (C) in (4), the objective is to compute with obtained from step (A). Using the column-wise Khatri-Rao structure of :


Similarly, is computed using the two step process in Equation 7. By storing and , TENGraD can efficiently compute the required operations in the NGD update.

Storage and Computation Complexity. Table 1 shows the computation and storage complexity of TENGraD vs other NGD methods. Curvature shows the cost of computing and inverting Kronecker-factors for KFAC, EKFAC, and KBFGS, and Gram Jacobian for TENGraD. The computation complexity of covariance factors is , while Kronecker factors in KFAC have a complexity of . Inverting the factors in TENGraD is while that of KFAC is . Thus, when the batch size is smaller than layer dimensions TENGraD’s computation complexity is better than others. TENGraD reduces storage complexity with reuse from to , see the Extra Storage column. Extra Pass refers to the cost of a second backpropagation and Step is the computation cost of parameter updates after computing the curvature.

Algorithm Curvature Extra Pass Step Extra Storage
Table 1: Computation and storage complexity of NGD methods.
Figure 3: Convolution shown as matrix-matrix product of unfolded input and weights.

3.2 Extension to Convolution Layers

We extend TENGraD to support convolutional layers, general form provided in the Appendix. Convolutional layers operate on high-dimension tensors. For a general tensor with dimension , the components of a tensor can be represented with where index is associated with dimension . We consider a convolutional layer with the input tensor and convolutional filters . For the weight tensor the spatial support of each kernel filter is . There are input channels and feature maps. Let be the output of the convolutional layer, where and . Also, and

are the padding and stride parameters, respectively.

To extend covariance factorization for in a convolutional layer, we first derive a closed form equation for the Jacobian and show that the convolution operation can be written as a matrix multiplication for a single sample. From Figure 3, can be reshaped to a matrix with dimensions , each row is a unfolded sub-tensor in that has the same size of a group of filters. Each filter is also unfolded into a 1-D vector with size . The weight tensor is reshaped to a matrix with size (see Figure 3). Hence, convolution can be written as , where the shape of is . Matrix is reshaped to the output tensor by folding each column into a feature map. Applying a similar unfolding to the pre-activation derivatives tensor results in the matrix with a shape of .

The compact form of the Jacobian for a batch of samples, with , is:


where and are the unfolded input and output for sample . We derive a closed form for an arbitrary element in the Gram Jacobian matrix and then extend to the whole batch size. Using Equation 8 each element in the Gram Jacobian is written as:


Hence, each element in Gram Jacobian is obtained via a reduction on covariance matrices and . The covariance matrices are defined over the spatial support .

With tensor notations, we expand the dimensions of covariance factors. Since element in Gram Jacobian is represented with two covariance matrices, the Gram Jacobian is covariance matrices. The covariance tensors and are defined as:


Where denotes the sub-tensor at . As a result the Gram Jacobian is written as:

Corollary 1

For a standard convolutional layer with spatial support , overall filter support and batch size , TENGraD reduces the computational complexity of the Gram Jacobian computation from to .

For TENGraD’s reuse for step (B), the gradient tensor of the convolution layer, i.e. is reshaped to to compute . Using Equation 9, this can be written as a two step process without explicitly forming the Jacobian matrix: . Similarly, for step (C), for is computed with . For better storage, instead of we store the input tensor .

4 Linear Convergence in TENGraD

We provide convergence guarantees for TENGraD with exact gradients (i.e. full batch case with ). We focus on the single output but a general case with multiple outputs will be similar. As shown TENGraD, converges linearly at a rate independent from the condition number of the input data matrix , unlike the convergence rate provided for KFAC at [30]. Consequently, for sufficiently ill-conditioned data, our convergence guarantees will improve upon those currently available for KFAC.

Lets us introduce the output vector where and . We consider the squared error loss on a given dataset with and , i.e. the objective is to minimize . The update rule (4) of TENGraD with exact gradient becomes


where is the Fisher-block matrix and the block Jacobian is defined as for and . It can be seen directly from the TENGraD update rule (4) that if the Gram matrix stays positive definite and bounded for each layer and iteration , then by the standard theory of preconditioned gradient descent methods [4], it will converge globally for sufficiently small stepsizes. In the following, we introduce two assumptions; the first one ensures that at the initialization Gram matrix is positive-definite and the second assumption is a stability condition on the Jacobian requiring that it does not vary too rapidly. These assumptions will allow us to control the convergence rate.

Assumption 1

The data is normalized, , and for any , and are independent. Lastly, the Gram matrix for individual layers are positive definite at initialization, i.e. .

Lee et al. [9] shows that if the input data are i.i.d. samples, then Assumption 1 can often be satisfied.

Assumption 2

There exists that satisfies if .

Notice that Assumption 2 requires that the network behaves like a linearized network for small values of constant [9]. As a result of Assumptions 1-2, the Fisher block matrix remains close to the initialization over iterations and therefore the Gram matrices of layers stay positive-definite during training which is parallel to the setting of [30, 10, 22]. In practice, we could also keep Gram matrices well-conditioned during training by tuning the damping parameter.

Next, we present our convergence result in Theorem 1. Our analysis follows the techniques of [30] and adapts them for TENGraD. Notice that TENGraD differs from NGD and KFAC in terms of the matrix it uses to approximate FIM; therefore results of [30] do not directly apply to our setting. As a proof technique, we first utilize the Assumptions 1 and 2 to derive a lower bound on the smallest eigenvalue of the Fisher-block matrix showing that it is positive, and building on this result we derive the rate in the following result. The proof can be found in the Appendix.

Theorem 1

Suppose Assumptions 1, 2 hold. Consider the update rule (12) for a network with layers, damping parameter , and learning rate where . Then, we have

Theorem 1 states that TENGraD converges to the global optimum with a linear rate under Assumptions 1 and 2. Therefore, TENGraD requires only iterations to achieve accuracy. Moreover, our result shows that a smaller learning rate is needed for a deeper network to guarantee linear convergence. This behavior is different than that of exact NGD because the rate of exact NGD provided in [30] does not depend on the network size but rather on the parameter . We also observe from Theorem 1 that smaller damping parameter improves the upper bound on the learning rate which suggests faster convergence for TENGraD. This makes the analysis of important for the performance of TENGraD since Gram matrix can potentially be ill-conditioned. The work [30] uses matrix concentration inequalities and harmonic analysis techniques to bound the minimum eigenvalue of the Gram matrix of the L-layer ReLU network which does not provide a tight bound for the Gram matrix of each layer. The authors of [22] have obtained tighter bounds in the NTK setting, and their result can be used to derive lower bound on of . Due to limited space, we present the details on the explicit bound on at Appendix.

We can show that under some mild conditions on the distribution of input data , if the weights of layer are drawn from , and the conditions and are satisfied, then the smallest eigenvalue of Gram matrix satisfies for all

with some probability where the coefficient

depends on the variances

(see Appendix). This implies that we can choose a smaller damping parameter to get a faster rate according to Theorem 1 which, in turn, improves the performance of TENGraD. Otherwise, needs to be larger to guarantee the inversion at (12).

5 Experimental Results

We test TENGraD, KFAC, EKFAC, KBFGS, and SGD with momentum on three image classification benchmarks namely Fashion-MNIST [28], CIFAR-10[18] and CIFAR-100) [18]. For Fashion-MNIST a network with its convolution layers and one fully connected layer is used, hence called 3C1F. For other benchmarks, the following DNN architectures are used DenseNet [15], WideResNet [29] and MobileNetV2 [26]. In all experiments, 50K samples are chosen for training and 10K for test. Experiments are conducted on AWS P3.2xlarge machines with one Volta V100 GPU and 16GB RAM. Each experiment is repeated 5 times and the average is reported. In our theoretical analysis, we assumed deterministic gradients but in the experiments we use stochastic gradients estimated with mini-batch size . The Appendix provides additional setup information.

(a) Fashion-MNIST
(b) CIFAR-10
(c) CIFAR-100
Figure 4: Comparing train accuracy between optimization methods for a) Fashion-MNIST on 3C1F network. b) CIFAR-10 on Wide-ResNet network c) CIFAR-100 on Wide-ResNet Network.

From the training loss plots in Figure 4, TENGraD clearly outperforms all other approximate NGD methods in timing and train accuracy. Moreover, TENGraD performs better than SGD in both time and accuracy on Fashion-MNIST dataset, as shown in (a), and it has a competitive timing compared to SGD on both CIFAR-10 and CIFAR-100 as shown in (b) and (c). The test accuracy reported in Figure 5 shows that TENGraD also generalizes well and achieves state-of-the-art accuracy on all benchmarks, and especially outperforms KBFGS with a margin of  1.6% . Moreover, TENGraD clearly outperforms SGD in both time and test accuracy on Fashion-MNIST, as shown in (a), with competitive results on CIFAR-10 and CIFAR-100.

(a) Fashion-MNIST
(b) CIFAR-10
(c) CIFAR-100
Figure 5: Comparing test accuracy between optimization methods for a) Fashion-MNIST on 3C1F network. b) CIFAR-10 on Wide-ResNet network c) CIFAR-100 on Wide-ResNet Network.

To further demonstrate the performance of TENGraD, we conduct extensive experiments on state-of-the-art DNNs and report the time and test accuracy for the 7 configurations in Table 2. Our results show that TENGraD outperforms KFAC, EKFAC, and KBFGS in overall time in 7 out of 7 (7/7) experiments. Moreover, TENGraD achieves a better accuracy, with a margin between 0.4%-1.41%, in 6/7 experiments compared to KBFGS which is the fastest method among approximate Kronecker-factored NGD methods. Also, compared to SGD, TENGraD achieves a better timing for 3/7 experiments and a better test accuracy in 5/7. In particular, TENGraD shows a 20% improvement over SGD in time for the Fashion-MNIST dataset and a better timing for CIFAR-10 on Wide-ResNet and DenseNet.

Models Dataset Time(s) [Accuracy%]
3C1F F-MNIST 52 [92.60] 43 [93.15] 188 [92.2] 275 [92.01] 113 [92.12]
1128 [94.01]
1004 [75.79]
1108 [94.10]
1124 [75.20]
2454 [94.50]
2419 [75.60]
2936 [94.10]
2766 [76.07]
1132 [92.70]
1151 [74.35]
2278 [93.52]
1909 [72.10]
2077 [93.40]
2002 [73.90]
4292 [93.31]
3542 [74.50]
4820 [93.11]
3788 [74.29]
2412 [93.41]
2292 [73.50]
1350 [92.70]
1370 [72.78]
1841 [92.71]
1406 [72.81]
3221 [92.63]
3152 [72.82]
4304 [92.83]
4687 [72.97]
1728 [91.30]
1695 [72.01]
Table 2: Top test accuracy and time for state-of-the-art models and different natural gradient methods.

To compare the quality of Fisher matrix approximation, we compare the block-diagonal approximation in TENGraD and KFAC with exact NGD. From Figure 6, TENGraD provides a similar approximation to the exact NGD by preserving the structure of the Fisher matrix. However, KFAC’s approximation hardly captures the structure. Similar results for other layers are observed (see Appendix).

To demonstrate the robustness, we examine the train accuracy under various hyperparemeter (HP) settings and show TENGraD is stable under a wide range for the HPs (see Appendix).

(a) Exact NGD
(b) TENGraD
(c) KFAC
Figure 6: Structure of block inverses for the second convolution layer in 3C1F with Fashion-MNIST.

6 Conclusion

We propose a Time-Efficient NGD method, called TENGraD, that improves the overall time of second order methods with computationally efficient factorization and reuse methods and prove linear convergence rates under certain assumptions. Extension to convolutional and general layers is provided. Results indicate that TENGraD outperforms or perform comparably to the state-of-the-art NGD methods as well as SGD. Our work does not have a direct negative impact as it is mostly theoretical.


This work was supported in part by NSERC Discovery Grants (RGPIN-06516, DGECR-00303), the Canada Research Chairs program, and NSF grants CCF-1814888, NSF DMS-2053485, NSF DMS-1723085 and Office of Naval Research Award Number N00014-21-1-2244.


  • [1] S. Amari (1997) Neural learning in structured parameter spaces - natural riemannian gradient. In Advances in Neural Information Processing Systems, M. C. Mozer, M. Jordan, and T. Petsche (Eds.), Vol. 9, pp. . External Links: Link Cited by: §1.
  • [2] S. Amari (1998) Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §1.
  • [3] A. Bernacchia, M. Lengyel, and G. Hennequin (2019) Exact natural gradient in deep linear networks and application to the nonlinear case. Cited by: §1, §1.
  • [4] D. P. Bertsekas (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §4.
  • [5] A. Botev, H. Ritter, and D. Barber (2017) Practical gauss-newton optimisation for deep learning. In

    International Conference on Machine Learning

    pp. 557–565. Cited by: §1.
  • [6] T. Cai (2019) A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. Machine learning (1/28). Cited by: §1.
  • [7] F. Dangel, F. Kunstner, and P. Hennig (2019) BackPACK: packing more into backprop. In International Conference on Learning Representations, Cited by: §1, §3.
  • [8] G. Desjardins, K. Simonyan, R. Pascanu, and k. kavukcuoglu (2015) Natural neural networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §1.
  • [9] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai (2019-09–15 Jun) Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1675–1685. External Links: Link Cited by: §4, §4.
  • [10] S. S. Du, X. Zhai, B. Poczos, and A. Singh (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §4.
  • [11] T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent (2018) Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1, §1.
  • [12] D. Goldfarb, Y. Ren, and A. Bahamou (2020) Practical quasi-newton methods for training deep neural networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 2386–2396. External Links: Link Cited by: §1, §1, §1.
  • [13] R. Grosse and J. Martens (2016) A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582. Cited by: §1, §1, §3.
  • [14] T. Heskes (2000)

    On “natural” learning and pruning in multilayered perceptrons

    Neural Computation 12 (4), pp. 881–901. Cited by: §1.
  • [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4700–4708. Cited by: §A.4.1, §5.
  • [16] R. Karakida, S. Akaho, and S. Amari (2019) Universal statistics of fisher information in deep neural networks: mean field approach. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    pp. 1032–1041. Cited by: §1.
  • [17] R. Karakida and K. Osawa (2020) Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 10891–10901. External Links: Link Cited by: §1.
  • [18] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §5.
  • [19] F. Kunstner, P. Hennig, and L. Balles (2019) Limitations of the empirical fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • [20] J. Martens and R. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §1.
  • [21] J. Martens (2020) New insights and perspectives on the natural gradient method. Journal of Machine Learning Research 21 (146), pp. 1–76. Cited by: §1, §2.
  • [22] Q. Nguyen, M. Mondelli, and G. Montufar (2020) Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. arXiv preprint arXiv:2012.11654. Cited by: §A.3.2, §A.3.2, §A.3.2, §4, §4.
  • [23] J. Pennington and P. Worah (2018) The spectrum of the fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1.
  • [24] Y. Ren and D. Goldfarb (2019) Efficient subsampled gauss-newton and natural gradient methods for training neural networks. arXiv preprint arXiv:1906.02353. Cited by: §1, §1.
  • [25] H. Robbins and S. Monro (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
  • [26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §A.4.1, §5.
  • [27] S. P. Singh and D. Alistarh (2020) WoodFisher: efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems 33. Cited by: §1.
  • [28] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
  • [29] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In British Machine Vision Conference 2016, Cited by: §A.4.1, §5.
  • [30] G. Zhang, J. Martens, and R. B. Grosse (2019) Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1, §1, §1, §4, §4, §4, §4.

Appendix A Appendix

a.1 Outline

The general form of the covariance factorization discussed in Section 3.2 is provided in Section A.2. The convergence proof of Theorem 1 discussed in Section 4 is provided in Section A.3.1. The bounds for minimum eigenvalue discussed in Section 4 is provided in A.3.2. We provide the setup information in Section A.4.1. The plots that compare the quality of approximation of Fisher inverse blocks are provide in Section A.4.2

. We provide the sensitivity to hyperparameters figures in Section

A.4.3. For the error bars refer to Section A.5.

a.2 General From of Covariance Factorization

In this section, we extend the results obtained in Section 3 and show that the covariance factorization and reuse method can be applied to an arbitrary layer. An arbitrary layer can operate on its input tensor in many ways depending on its underlying linear transformations. Typically these transformations operate differently on each tensor dimension. For example, a transformation such as convolution applies a specific filtering on spatial dimensions. Therefore, we need to distinguish between these dimensions depending on the operations performed on them. In the following, we categorize these dimensions into free, independent, and common dimensions.

Tensor Dimensions. A common dimension is the dimension of the unfolded input that is shared amongst input and output tensors, i.e. spatial dimensions. An independent dimension is the dimension of the unfolded input that is shared amongst input and the parameters, i.e. input channel and filter dimensions . A free dimension is the dimension of the parameters that is shared amongst output and the parameters, i.e. output channel .

Now, we show that the Jacobian matrix can be formed as a linear transformation amongst two tensors namely input and pre-activation derivatives which later is used to compute the Gram Jacobian .

Lemma 1

Consider a layer with parameters that applies a linear transformation on input tensor where and are the indices of free dimension and batch samples. Suppose that the output for a sample can be computed without requiring other samples, i.e. there is no dependency amongst samples in a batch. The pre-activation derivatives are given by tensor . We refer to set as common dimensions or and as independent dimensions or . Then the Jacobian tensor can be written as:


where tensor is obtained by unfolding the input tensor :


where depends on the underlying linear transformation.

Proof of Lemma 1. Assume the arbitrary linear transformation as follow:


Where and define the underlying linear transformation such as the input and filter spatial dimensions in the convolutional layer. Now, we can rewrite Equation 15 by unfolding the input tensor. To do so, for each index we can unfold the input tensor using the following:


The above equation unfolds the input along all the dimensions in the shared dimensions with the output tensor, i.e. , and for all the dimensions in the parameter tensor. As a result, the output can be written as the following:


Using this result and the fact that the pre-activation derivatives tensor has the same dimension as output tensor , we can obtain the Jacobian tensor as stated in the Equation 13.

Lemma 1 shows that the Jacobian tensor can be achieved with a linear transformation between the unfolded input tensor and the pre-activation derivative tensor . Now, using this result we can extend the covariance factorization in TENGraD to an arbitrary layer.

Theorem 2

For a linear layer with input tensor and pre-activation derivatives defined in Lemma 1, the Gram matrix is obtained by:

where and are the covariance tensors of unfolded input and pre-activation derivatives defined as:

Proof of Theorem 2. Using Lemma 1 we can write the Gram Jacobian matrix by expanding the common dimensions as follows:

As a result, the Gram Jacobian for any layer can be obtained by a reduction on the Hadamard product of two small covariance tensors. In the following Corollary we state the conditions such that Theorem 2 improves the performance in practice by reducing the computation complexity.

Corollary 2

Suppose we show the size of common dimensions, independent dimensions and free dimensions by , and . Then Theorem 2 reduces the complexity of Gram matrix computation from to .

Remark 1

Theorem 2 applies to a wide range of layers in deep neural networks such as fully connected, convolution, and group convolution. It can also be applied to layers with nonlinear transformation as long as the non-linearity only applies to the input data such as layer normalization.

a.3 Proofs of Section 4

a.3.1 Convergence Analysis of TENGraD

In this section, we provide the proof for Theorem 1. If Assumption 2 holds, we have . This result can help bound the minimum eigenvalue of the Gram Jacobian at each iteration, stated in the following Lemma.

Lemma 2

If , then we have .

Proof of Lemma 2. Based on the inequality and based on the fact that , we have:

We can now prove the linear convergence rate in Theorem 1.

Proof of Theorem 1. We prove Theorem 1 by induction. Assume . We can see that with and where

is the identity matrix of dimension