# A Kronecker-factored approximate Fisher matrix for convolution layers

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kronecker Factors for Convolution (KFC), a tractable approximation to the Fisher matrix for convolutional networks based on a structured probabilistic model for the distribution over backpropagated derivatives. Similarly to the recently proposed Kronecker-Factored Approximate Curvature (K-FAC), each block of the approximate Fisher matrix decomposes as the Kronecker product of small matrices, allowing for efficient inversion. KFC captures important curvature information while still yielding comparably efficient updates to stochastic gradient descent (SGD). We show that the updates are invariant to commonly used reparameterizations, such as centering of the activations. In our experiments, approximate natural gradient descent with KFC was able to train convolutional networks several times faster than carefully tuned SGD. Furthermore, it was able to train the networks in 10-20 times fewer iterations than SGD, suggesting its potential applicability in a distributed setting.

Comments

There are no comments yet.

## Authors

• 47 publications
• 13 publications
• ### Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descen...
03/19/2015 ∙ by James Martens, et al. ∙ 0

read it

• ### TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

This work proposes a time-efficient Natural Gradient Descent method, cal...
06/07/2021 ∙ by Saeed Soori, et al. ∙ 22

read it

• ### Convolutional Neural Network Training with Distributed K-FAC

Training neural networks with many processors can reduce time-to-solutio...
07/01/2020 ∙ by J. Gregory Pauloski, et al. ∙ 0

read it

• ### An iterative K-FAC algorithm for Deep Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high effici...
01/01/2021 ∙ by Yingshi Chen, et al. ∙ 0

read it

• ### A Coordinate-Free Construction of Scalable Natural Gradient

Most neural networks are trained using first-order optimization methods,...
08/30/2018 ∙ by Kevin Luk, et al. ∙ 6

read it

• ### Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

Optimization algorithms that leverage gradient covariance information, s...
06/11/2018 ∙ by Thomas George, et al. ∙ 0

read it

• ### Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) ...
07/14/2021 ∙ by Shaohuai Shi, et al. ∙ 4

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Despite advances in optimization, most neural networks are still trained using variants of stochastic gradient descent (SGD) with momentum. It has been suggested that natural gradient descent (Amari, 1998) could greatly speed up optimization because it accounts for the geometry of the optimization landscape and has desirable invariance properties. (See Martens (2014) for a review.) Unfortunately, computing the exact natural gradient is intractable for large networks, as it requires solving a large linear system involving the Fisher matrix, whose dimension is the number of parameters (potentially tens of millions for modern architectures). Approximations to the natural gradient typically either impose very restrictive structure on the Fisher matrix (e.g. LeCun et al., 1998; Le Roux et al., 2008) or require expensive iterative procedures to compute each update, analogously to approximate Newton methods (e.g. Martens, 2010). An ongoing challenge has been to develop a curvature matrix approximation which reflects enough structure to yield high-quality updates, while introducing minimal computational overhead beyond the standard gradient computations.

Much progress in machine learning in the past several decades has been driven by the development of structured probabilistic models whose independence structure allows for efficient computations, yet which still capture important dependencies between the variables of interest. In our case, since the Fisher matrix is the covariance of the backpropagated log-likelihood derivatives, we are interested in modeling the distribution over these derivatives. The model must support efficient computation of the inverse covariance, as this is what’s required to compute the natural gradient. Recently, the Factorized Natural Gradient (FANG)

(Grosse & Salakhutdinov, 2015) and Kronecker-Factored Approximate Curvature (K-FAC) (Martens & Grosse, 2015)

methods exploited probabilistic models of the derivatives to efficiently compute approximate natural gradient updates. In its simplest version, K-FAC approximates each layer-wise block of the Fisher matrix as the Kronecker product of two much smaller matrices. These (very large) blocks can then be can be tractably inverted by inverting each of the two factors. K-FAC was shown to greatly speed up the training of deep autoencoders. However, its underlying probabilistic model assumed fully connected networks with no weight sharing, rendering the method inapplicable to two architectures which have recently revolutionized many applications of machine learning — convolutional networks

(LeCun et al., 1989; Krizhevsky et al., 2012)(Hochreiter & Schmidhuber, 1997; Sutskever et al., 2014).

We introduce Kronecker Factors for Convolution (KFC), an approximation to the Fisher matrix for convolutional networks. Most modern convolutional networks have trainable parameters only in convolutional and fully connected layers. Standard K-FAC can be applied to the latter; our contribution is a factorization of the Fisher blocks corresponding to convolution layers. KFC is based on a structured probabilistic model of the backpropagated derivatives where the activations are modeled as independent of the derivatives, the activations and derivatives are spatially homogeneous, and the derivatives are spatially uncorrelated. Under these approximations, we show that the Fisher blocks for convolution layers decompose as a Kronecker product of smaller matrices (analogously to K-FAC), yielding tractable updates.

KFC yields a tractable approximation to the Fisher matrix of a conv net. It can be used directly to compute approximate natural gradient descent updates, as we do in our experiments. One could further combine it with the adaptive step size, momentum, and damping methods from the full K-FAC algorithm (Martens & Grosse, 2015). It could also potentially be used as a pre-conditioner for iterative second-order methods (Martens, 2010; Vinyals & Povey, 2012; Sohl-Dickstein et al., 2014). We show that the approximate natural gradient updates are invariant to widely used reparameterizations of a network, such as whitening or centering of the activations.

We have evaluated our method on training conv nets on object recognition benchmarks. In our experiments, KFC was able to optimize conv nets several times faster than carefully tuned SGD with momentum, in terms of both training and test error. Furthermore, it was able to train the networks in 10-20 times fewer iterations, suggesting its usefulness in the context of highly distributed training algorithms.

## 2 Background

In this section, we outline the K-FAC method as previously formulated for standard fully-connected feed-forward networks without weight sharing (Martens & Grosse, 2015). Each layer of a fully connected network computes activations as:

 sℓ =Wℓ¯aℓ−1 (1) aℓ =ϕℓ(sℓ), (2)

where indexes the layer, denotes the inputs to the layer, denotes the activations, denotes the matrix of biases and weights, denotes the activations with a homogeneous dimension appended, and

denotes a nonlinear activation function (usually applied coordinate-wise). (Throughout this paper, we will use the index 0 for all homogeneous coordinates.) We will refer to the values

as pre-activations. By convention, corresponds to the inputs and corresponds to the prediction

made by the network. For convenience, we concatenate all of the parameters of the network into a vector

, where denotes the Kronecker vector operator which stacks the columns of a matrix into a vector. We denote the function computed by the network as .

Typically, a network is trained to minimize an objective given by as averaged over the training set, where is a loss function. The gradient of

, which is required by most optimization methods, is estimated stochastically using mini-batches of training examples. (We will often drop the explicit

subscript when the meaning is unambiguous.) For each case, is usually computed using automatic-differentiation aka backpropagation (Rumelhart et al., 1986; LeCun et al., 1998), which can be thought of as comprising two steps: first computing the pre-activation derivatives for each layer, and then computing .

For the remainder of this paper, we will assume the network’s prediction determines the value of the parameter of a distribution over , and the loss function is the corresponding negative log-likelihood .

### 2.1 Second-order optimization of neural networks

Second-order optimization methods work by computing a parameter update that minimize (or approximately minimize) a local quadratic approximation to the objective, given by , where is a matrix which quantifies the curvature of the cost function at . The exact solution to this minimization problem can be obtained by solving the linear system . The original and most well-known example is Newton’s method, where is chosen to be the Hessian matrix; this isn’t appropriate in the non-convex setting because of the well-known problem that it searches for critical points rather than local optima (e.g. Pascanu et al., 2014). Therefore, it is more common to use natural gradient (Amari, 1998) or updates based on the generalized Gauss-Newton matrix (Schraudolph, 2002), which are guaranteed to produce descent directions because the curvature matrix is positive semidefinite.

Natural gradient descent can be usefully interpreted as a second-order method (Martens, 2014) where is the Fisher information matrix , as given by

 F=Ex∼pdatay∼Ry|f(x,θ)[Dθ(Dθ)⊤], (3)

where denotes the training distribution, denotes the model’s predictive distribution, and is the log-likelihood gradient. For the remainder of this paper, all expectations are with respect to this distribution (which we term the model’s distribution), so we will leave off the subscripts. (In this paper, we will use the notation for log-likelihood derivatives; derivatives of other functions will be written out explicitly.) In the case where corresponds to an exponential family model with “natural” parameters given by , is equivalent to the generalized Gauss-Newton matrix (Martens, 2014), which is an approximation of the Hessian which has also seen extensive use in various neural-network optimization methods (e.g. Martens, 2010; Vinyals & Povey, 2012).

is an matrix, where is the number of parameters and can be in the tens of millions for modern deep architectures. Therefore, it is impractical to represent explicitly in memory, let alone solve the linear system exactly. There are two general strategies one can take to find a good search direction. First, one can impose a structure on enabling tractable inversion; for instance LeCun et al. (1998) approximates it as a diagonal matrix, TONGA (Le Roux et al., 2008) uses a more flexible low-rank-within-block-diagonal structure, and factorized natural gradient (Grosse & Salakhutdinov, 2015) imposes a directed Gaussian graphical model structure.

Another strategy is to approximately minimize the quadratic approximation to the objective using an iterative procedure such as conjugate gradient; this is the approach taken in Hessian-free optimization (Martens, 2010), a type of truncated Newton method (e.g. Nocedal & Wright, 2006). Conjugate gradient (CG) is defined in terms of matrix-vector products , which can be computed efficiently and exactly using the method outlined by Schraudolph (2002). While iterative approaches can produce high quality search directions, they can be very expensive in practice, as each update may require tens or even hundreds of CG iterations to reach an acceptable quality, and each of these iterations is comparable in cost to an SGD update.

We note that these two strategies are not mutually exclusive. In the context of iterative methods, simple (e.g. diagonal) curvature approximations can be used as preconditioners, where the iterative method is implicitly run in a coordinate system where the curvature is less extreme. It has been observed that a good choice of preconditioner can be crucial to obtaining good performance from iterative methods (Martens, 2010; Chapelle & Erhan, 2011; Vinyals & Povey, 2012). Therefore, improved tractable curvature approximations such as the one we develop could likely be used to improve iterative second-order methods.

### 2.2 Kronecker-factored approximate curvature

Kronecker-factored approximate curvature (K-FAC; Martens & Grosse, 2015) is a recently proposed optimization method for neural networks which can be seen as a hybrid of the two approximation strategies: it uses a tractable approximation to the Fisher matrix , but also uses an optimization strategy which behaves locally like conjugate gradient. This section gives a conceptual summary of the aspects of K-FAC relevant to the contributions of this paper; a precise description of the full algorithm is given in Appendix A.2.

The block-diagonal version of K-FAC (which is the simpler of the two versions, and is what we will present here) is based on two approximations to which together make it tractable to invert. First, weight derivatives in different layers are assumed to be uncorrelated, which corresponds to being block diagonal, with one block per layer:

 F≈⎛⎜ ⎜ ⎜⎝E[vec(D¯W1)vec(D¯W1)⊤]0⋱0E[vec(D¯WL)vec(D¯WL)⊤]⎞⎟ ⎟ ⎟⎠ (4)

This approximation by itself is insufficient, because each of the blocks may still be very large. (E.g., if a network has 1,000 units in each layer, each block would be of size .) For the second approximation, observe that

 E[D[¯Wℓ]ijD[¯Wℓ]i′j′]=E[D[sℓ]i[¯aℓ−1]jD[sℓ]i′[¯aℓ−1]j′]. (5)

If we approximate the activations and pre-activation derivatives as independent, this can be decomposed as . This can be written algebraically as a decomposition into a Kronecker product of two smaller matrices:

 E[vec(¯Wℓ)vec(¯Wℓ)⊤]≈Ψℓ−1⊗Γℓ≜^Fℓ, (6)

where and

denote the second moment matrices of the activations and pre-activation derivatives, respectively. Call the block diagonal approximate Fisher matrix, with blocks given by Eqn.

6, . The two factors are estimated online from the empirical moments of the model’s distribution using exponential moving averages.

To invert , we use the facts that (1) we can invert a block diagonal matrix by inverting each of the blocks, and (2) the Kronecker product satisfies the identity :

 ^F−1=⎛⎜ ⎜ ⎜⎝Ψ−10⊗Γ−110⋱0Ψ−1L−1⊗Γ−1L⎞⎟ ⎟ ⎟⎠ (7)

We do not represent explicitly, as each of the blocks is quite large. Instead, we keep track of each of the Kronecker factors.

The approximate natural gradient can then be computed as follows:

 (8)

We would often like to add a multiple of the identity matrix to

for two reasons. First, many networks are regularized with weight decay, which corresponds to a penalty of , for some parameter . Following the interpretation of as a quadratic approximation to the curvature, it would be appropriate to use to approximate the curvature of the regularized objective. The second reason is that the local quadratic approximation of implicitly used when computing the natural gradient may be inaccurate over the region of interest, owing to the approximation of by , to the approximation of the Hessian by , and finally to the error associated with approximating as locally quadratic in the first place. A common way to address this issue is to damp the updates by adding to the approximate curvature matrix, for some small value , before minimizing the local quadratic model. Therefore, we would ideally like to compute .

Unfortunately, adding breaks the Kronecker factorization structure. While it is possible to exactly solve the damped system (see Appendix A.2), it is often preferable to approximate in a way that maintains the factorizaton structure. Martens & Grosse (2015) pointed out that

 ^Fℓ+(λ+γ)I≈(Ψℓ−1+πℓ√λ+γI)⊗(Γℓ+1πℓ√λ+γI). (9)

We will denote this damped approximation as . Mathematically, can be any positive scalar, but Martens & Grosse (2015) suggest the formula

 πℓ=√∥Ψℓ−1⊗I∥∥I⊗Γℓ∥, (10)

where denotes some matrix norm, as this value minimizes the norm of the residual in Eqn. 9. In this work, we use the trace norm . The approximate natural gradient is then computed as:

 (11)

The algorithm as presented by Martens & Grosse (2015) has many additional elements which are orthogonal to the contributions of this paper. For concision, a full description of the algorithm is relegated to Appendix A.2.

### 2.3 Convolutional networks

Convolutional networks require somewhat crufty notation when the computations are written out in full. In our case, we are interested in computing correlations of derivatives, which compounds the notational difficulties. In this section, we summarize the notation we use. (Table 1 lists all convolutional network notation used in this paper.) In sections which focus on a single layer of the network, we drop the explicit layer indices.

A convolution layer takes as input a layer of activations , where indexes the input map and indexes the spatial location. (Here,

is the set of spatial locations, which is typically a 2-D grid. For simplicity, we assume convolution is performed with a stride of 1 and padding equal to

, so that the set of spatial locations is shared between the input and output feature maps.) This layer is parameterized by a set of weights and biases , where indexes the output map, indexes the input map, and indexes the spatial offset (from the center of the filter). If the filters are of size , then we would have . We denote the numbers of spatial locations and spatial offsets as and , respectively. The convolution layer computes a set of pre-activations as follows:

 si,t=∑δ∈Δwi,j,δaj,t+δ+bi, (12)

where denotes the bias parameter. The activations are defined to take the value 0 outside of

. The pre-activations are passed through a nonlinearity such as ReLU to compute the output layer activations, but we have no need to refer to this explicitly when analyzing a single layer. (For simplicity, we assume operations such as pooling and response normalization are implemented as separate layers.)

Pre-activation derivatives are computed during backpropagation. One then computes weight derivatives as:

 Dwi,j,δ=∑t∈Taj,t+δDsi,t. (13)

#### 2.3.1 Efficient implementation and vectorized notation

For modern large-scale vision applications, it’s necessary to implement conv nets efficiently for a GPU (or some other parallel architecture). We provide a very brief overview of the low-level efficiency issues which are relevant to K-FAC. We base our discussion on the Toronto Deep Learning ConvNet (TDLCN) package

(Srivastava, 2015), whose convolution kernels we use in our experiments. Like many modern implementations, this implementation follows the approach of Chellapilla et al. (2006), which reduces the convolution operations to large matrix-vector products in order to exploit memory locality and efficient parallel BLAS operators. We describe the implementation explicitly, as it is important that our proposed algorithm be efficient using the same memory layout (shuffling operations are extremely expensive). As a bonus, these vectorized operations provide a convenient high-level notation which we will use throughout the paper.

The ordering of arrays in memory is significant, as it determines which operations can be performed efficiently without requiring (very expensive) transpose operations. The activations are stored as a array , where is the mini-batch size, is the number of spatial locations, and is the number of feature maps.111The first index of the array is the least significant in memory. This can be interpreted as an matrix. (We must assign orderings to and , but this choice is arbitrary.) Similarly, the weights are stored as an array , which can be interpreted either as an matrix or a matrix without reshuffling elements in memory. We will almost always use the former interpretation, which we denote ; the matrix will be denoted .

The naive implementation of convolution, while highly parallel in principle, suffers from poor memory locality. Instead, efficient implementations typically use what we will term the expansion operator and denote . This operator extracts the patches surrounding each spatial location and flattens them into vectors. These vectors become the rows of a matrix. For instance, is a matrix, defined as

 ⟦~Aℓ−1⟧tM+m,j|Δ|+δ=[~Aℓ−1](t+δ)M+m,j=a(m)j,t+δ, (14)

for all entries such that . All other entries are defined to be 0. Here, indexes the data instance within the mini-batch.

In TDLCN, the forward pass is computed as

 ~Aℓ=ϕ(~Sℓ)=ϕ(⟦~Aℓ−1⟧W⊤ℓ+1b⊤ℓ), (15)

where is the nonlinearity, applied elementwise, is a vector of ones, and is the vector of biases. In backpropagation, the activation derivatives are computed as:

 D~Aℓ−1=⟦D~Sℓ⟧˘Wℓ. (16)

Finally, the gradient for the weights is computed as

 DWℓ=D~S⊤ℓ⟦~Aℓ−1⟧ (17)

The matrix products are computed using the cuBLAS function cublasSgemm. In practice, the expanded matrix may be too large to store in memory. In this case, a subset of the rows of are computed and processed at a time.

We will also use the matrix and the matrix to denote the activations and pre-activations for a single training case. and can be substituted for and in Eqns. 15-17.

For fully connected networks, it is often convenient to append a homogeneous coordinate to the activations so that the biases can be folded into the weights (see Section 2.2). For convolutional layers, there is no obvious way to add extra activations such that the convolution operation simulates the effect of biases. However, we can achieve an analogous effect by adding a homogeneous coordinate (i.e. a column of all 1’s) to the expanded activations. We will denote this

. Similarly, we can prepend the bias vector to the weights matrix:

. The homogeneous coordinate is not typically used in conv net implementations, but it will be convenient for us notationally. For instance, the forward pass can be written as:

 ~Aℓ=ϕ(⟦~Aℓ−1⟧H¯W⊤ℓ) (18)

Table 1 summarizes all of the conv net notation used in this paper.

## 3 Kronecker factorization for convolution layers

We begin by assuming a block-diagonal approximation to the Fisher matrix like that of K-FAC, where each block contains all the parameters relevant to one layer (see Section 2.2). (Recall that these blocks are typically too large to invert exactly, or even represent explicitly, which is why the further Kronecker approximation is required.) The Kronecker factorization from K-FAC applies only to fully connected layers. Convolutional networks introduce several kinds of layers not found in fully connected feed-forward networks: convolution, pooling, and response normalization. Since pooling and response normalization layers don’t have trainable weights, they are not included in the Fisher matrix. However, we must deal with convolution layers. In this section, we present our main contribution, an approximate Kronecker factorization for the blocks of corresponding to convolution layers. In the tradition of fast food puns (Ranzato & Hinton, 2010; Yang et al., 2014), we call our method Kronecker Factors for Convolution (KFC).

For this section, we focus on the Fisher block for a single layer, so we drop the layer indices. Recall that the Fisher matrix is the covariance of the log-likelihood gradient under the model’s distribution. (In this paper, all expectations are with respect to the model’s distribution unless otherwise specified.) By plugging in Eqn. 13, the entries corresponding to weight derivatives are given by:

 E[Dwi,j,δDwi′,j′,δ′] =E⎡⎣(∑t∈Taj,t+δDsi,t)⎛⎝∑t′∈Taj′,t′+δ′Dsi′,t′⎞⎠⎤⎦ (19)

To think about the computational complexity of computing the entries directly, consider the second convolution layer of AlexNet (Krizhevsky et al., 2012), which has 48 input feature maps, 128 output feature maps, spatial locations, and filters. Since there are weights and 128 biases, the full block would require 60.5 billion entries to represent explicitly, and inversion is clearly impractical.

Recall that K-FAC approximation for classical fully connected networks can be derived by approximating activations and pre-activation derivatives as being statistically independent (this is the IAD approximation below). Deriving an analogous Fisher approximation for convolution layers will require some additional approximations.

Here are the approximations we will make in deriving our Fisher approximation:

• Independent activations and derivatives (IAD). The activations are independent of the pre-activation derivatives, i.e. .

• Spatial homogeneity (SH). The first-order statistics of the activations are independent of spatial location. The second-order statistics of the activations and pre-activation derivatives at any two spatial locations and depend only on . This implies there are functions , and such that:

 E[aj,t] =M(j) (20) E[aj,taj′,t′] =Ω(j,j′,t′−t) (21) E[Dsi,tDsi′,t′] =Γ(i,i′,t′−t). (22)

Note that under the model’s distribution, so .

• Spatially uncorrelated derivatives (SUD). The pre-activation derivatives at any two distinct spatial locations are uncorrelated, i.e.  for .

We believe SH is fairly innocuous, as one is implicitly making a spatial homogeneity assumption when choosing to use convolution in the first place. SUD perhaps sounds like a more severe approximation, but in fact appeared to describe the model’s distribution quite well in the networks we investigated; this is analyzed empirially in Section 5.1.

We now show that combining the above three approximations yields a Kronecker factorization of the Fisher blocks. For simplicity of notation, assume the data are two-dimensional, so that the offsets can be parameterized with indices and , and denote the dimensions of the activations map as . The formulas can be generalized to data dimensions higher than 2 in the obvious way.

###### Theorem 1.

Combining approximations IAD, SH, and SUD yields the following factorization:

 E[Dwi,j,δDwi′,j′,δ′] =β(δ,δ′)Ω(j,j′,δ′−δ)Γ(i,i′,0), E[Dwi,j,δDbi′] =β(δ)M(j)Γ(i,i′,0) E[DbiDbi′] =|T|Γ(i,i′,0) (23)

where

 β(δ) ≜(T1−|δ1|)(T2−|δ2|) β(δ,δ′) ≜(T1−max(δ1,δ′1,0)+min(δ1,δ′1,0))⋅(T2−max(δ2,δ′2,0)+min(δ2,δ′2,0)) (24)
###### Proof.

See Appendix B. ∎

To talk about how this fits in to the block diagonal approximation to the Fisher matrix , we now restore the explicit layer indices and use the vectorized notation from Section 2.3.1. The above factorization yields a Kronecker factorization of each block, which will be useful for computing their inverses (and ultimately our approximate natural gradient). In particular, if denotes the block of the approximate Fisher for layer , Eqn. 23 yields our KFC factorization of into a Kronecker product of smaller factors:

 ^Fℓ=Ωℓ−1⊗Γℓ, (25)

where

 [Ωℓ−1]j|Δ|+δ,j′|Δ|+δ′ ≜β(δ,δ′)Ω(j,j′,δ′−δ) j|Δ|+δ,0=[Ωℓ−1]0,j|Δ|+δ ≜β(δ)M(j) 0,0 ≜|T| i,i′ ≜Γ(i,i′,0). (26)

(We will derive much simpler formulas for and in the next section.) Using this factorization, the rest of the K-FAC algorithm can be carried out without modification. For instance, we can compute the approximate natural gradient using a damped version of analogously to Eqns. 9 and 11 of Section 2.2:

 ^F(γ)ℓ =Ω(γ)ℓ−1⊗Γ(γ)ℓ (27) ≜(Ωℓ−1+πℓ√λ+γI)⊗(Γℓ+1πℓ√λ+γI). (28) ^∇h=[^F(γ)]−1∇h (29)

Returning to our running example of AlexNet, is a matrix. Therefore the factors and are and , respectively. These matrices are small enough that they can be represented exactly and inverted in a reasonable amount of time, allowing us to efficiently compute the approximate natural gradient direction using Eqn. 29.

### 3.1 Estimating the factors

Since the true covariance statistics are unknown, we estimate them empirically by sampling from the model’s distribution, similarly to Martens & Grosse (2015). To sample derivatives from the model’s distribution, we select a mini-batch, sample the outputs from the model’s predictive distribution, and backpropagate the derivatives.

We need to estimate the Kronecker factors and . Since these matrices are defined in terms of the autocovariance functions and , it would appear natural to estimate these functions empirically. Unfortunately, if the empirical autocovariances are plugged into Eqn. 26, the resulting

may not be positive semidefinite. This is a problem, since negative eigenvalues in the approximate Fisher could cause the optimization to diverge (a phenomenon we have observed in practice). An alternative which at least guarantees PSD matrices is to simply ignore the boundary effects, taking

in Eqn. 26. Sadly, we found this to give very inaccurate covariances, especially for higher layers, where the filters are of comparable size to the activation maps.

Instead, we estimate each directly using the following fact:

###### Theorem 2.

Under assumption SH,

 Ωℓ =E[⟦Aℓ⟧⊤H⟦Aℓ⟧H] (30) Γℓ =1|T|E[DS⊤ℓDSℓ]. (31)

(The notation is defined in Section 2.3.1.)

###### Proof.

See Appendix B. ∎

Using this result, we define the empirical statistics for a given mini-batch:

 ^Ωℓ =1M⟦~Aℓ⟧⊤H⟦~Aℓ⟧H ^Γℓ =1M|T|D~S⊤ℓD~Sℓ (32)

Since the estimates and are computed in terms of matrix inner products, they are always PSD matrices. Importantly, because and are the same matrices used to implement the convolution operations (Section 2.3.1), the computation of covariance statistics enjoys the same memory locality properties as the convolution operations.

At the beginning of training, we estimate and from the full dataset (or a large subset) using Eqn. 32. Subsequently, we maintain exponential moving averages of these matrices, where these equations are applied to each mini-batch, i.e.

 Ωℓ ←ξΩℓ+(1−ξ)^Ωℓ Γℓ ←ξΓℓ+(1−ξ)^Γℓ, (33)

where is a parameter which determines the timescale for the moving average.

### 3.2 Using KFC in optimization

So far, we have defined an approximation to the Fisher matrix which can be tractably inverted. This can be used in any number of ways in the context of optimization, most simply by using as an approximation to the natural gradient . Alternatively, we could use it in the context of the full K-FAC algorithm, or as a preconditioner for iterative second-order methods (Martens, 2010; Vinyals & Povey, 2012; Sohl-Dickstein et al., 2014).

In our experiments, we explored two particular instantiations of KFC in optimization algorithms. First, in order to provide as direct a comparison as possible to standard SGD-based optimization, we used in the context of a generic approximate natural gradient descent procedure; this procedure is like SGD, except that is substituted for the Euclidean gradient. Additionally, we used momentum, update clipping, and parameter averaging — all standard techniques in the context of stochastic optimization.222Our SGD baseline used momentum and parameter averaging as well. Clipping was not needed for SGD, for reasons explained in Appendix A.1. One can also view this as a preconditioned SGD method, where is used as the preconditioner. Therefore, we refer to this method in our experiments as KFC-pre (to distinguish it from the KFC approximation itself). This method is spelled out in detail in Appendix A.1.

We also explored the use of in the context of K-FAC, which (in addition to the techniques of Section 2.2), includes methods for adaptively changing the learning rate, momentum, and damping parameters over the course of optimization. The full algorithm is given in Appendix A.2. Our aim was to measure how KFC can perform in the context of a sophisticated and well-tuned second-order optimization procedure. We found that the adaptation methods tended to choose stable values for the learning rate, momentum, and damping parameters, suggesting that these could be replaced with fixed values (as in KFC-pre). Since both methods performed similarly, we report results only for KFC-pre. We note that this finding stands in contrast with the autoencoder experiments of Martens & Grosse (2015), where the adapted parameters varied considerably over the course of optimization.

With the exception of inverting the Kronecker factors, all of the heavy computation for our methods was performed on the GPU. We based our implementation on CUDAMat (Mnih, 2009) and the convolution kernels provided by the Toronto Deep Learning ConvNet (TDLCN) package (Srivastava, 2015). Full details on our GPU implementation and other techniques for minimizing computational overhead are given in Appendix A.3.

## 4 Theoretical analysis

### 4.1 Invariance

Natural gradient descent is motivated partly by way of its invariance to reparameterization: regardless of how the model is parameterized, the updates are equivalent up to the first order. Approximations to natural gradient don’t satisfy full invariance to parameterization, but certain approximations have been shown to be invariant to more limited, but still fairly broad, classes of transformations. Ollivier (2015) showed that one such approximation was invariant to (invertible) affine transformations of individual activations. This class of transformations includes replacing sigmoidal with tanh activation functions, as well as the centering transformations discussed in the next section. Martens & Grosse (2015) showed that K-FAC is invariant to a broader class of reparameterizations: affine transformations of the activations (considered as a group), both before and after the nonlinearity. In addition to affine transformations of individual activations, this class includes transformations which whiten the activations to have zero mean and unit covariance. The transformations listed here have all been used to improve optimization performance (see next section), so these invariance properties provide an interesting justification of approximations to natural gradient methods. I.e., to the extent that these transformations help optimization, approximate natural gradient descent methods can be expected to achieve such benefits automatically.

For convolutional layers, we cannot expect an algorithm to be invariant to arbitrary affine transformations of a given layer’s activations, as such transformations can change the set of functions which are representable. (Consider for instance, a transformation which permutes the spatial locations.) However, we show that the KFC updates are invariant to homogeneous, pointwise

affine transformations of the activations, both before and after the nonlinearity. This is perhaps an overly limited statement, as it doesn’t use the fact that the algorithm accounts for spatial correlations. However, it still accounts for a broad set of transformations, such as normalizing activations to be zero mean and unit variance either before or after the nonlinearity.

To formalize this, recall that a layer’s activations are represented as a matrix and are computed from that layer’s pre-activations by way of an elementwise nonlinearity, i.e. . We replace this with an activation function which additionally computes affine transformations before and after the nonlinearity. Such transformations can be represented in matrix form:

 A†ℓ=ϕ†ℓ(S†ℓ)=ϕℓ(S†ℓUℓ+1c⊤ℓ)Vℓ+1d⊤ℓ, (34)

where and are invertible matrices, and and are vectors. For convenience, the inputs to the network can be treated as an activation function which takes no arguments. We also assume the final layer outputs are not transformed, i.e.  and . KFC is invariant to this class of transformations:

###### Theorem 3.

Let be a network with parameter vector and activation functions . Given activation functions defined as in Eqn. 34, there exists a parameter vector such that a network with parameters and activation functions computes the same function as . The KFC updates on and are equivalent, in that the resulting networks compute the same function.

###### Proof.

See Appendix B. ∎

Invariance to affine transformations also implies approximate invariance to smooth nonlinear transformations; see Martens (2014) for further discussion.

### 4.2 Relationship with other algorithms

Other neural net optimization methods have been proposed which attempt to correct for various statistics of the activations or gradients. Perhaps the most commonly used are algorithms which attempt to adapt learning rates for individual parameters based on the variance of the gradients (LeCun et al., 1998; Duchi et al., 2011; Tieleman & Hinton, 2012; Zeiler, 2013; Kingma & Ba, 2015). These can be thought of as diagonal approximations to the Hessian or the Fisher matrix.333Some of these methods use the empirical Fisher matrix, which differs from the proper Fisher matrix in that the targets are taken from the training data rather than sampled from the model’s predictive distribution. The empirical Fisher matrix is less closely related to the curvature than is the proper one (Martens, 2014).

Another class of approaches attempts to reparameterize a network such that its activations have zero mean and unit variance, with the goals of preventing covariate shift and improving the conditioning of the curvature (Cho et al., 2013; Vatanen et al., 2013; Ioffe & Szegedy, 2015). Centering can be viewed as an approximation to natural gradient where the Fisher matrix is approximated with a directed Gaussian graphical model (Grosse & Salakhutdinov, 2015). As discussed in Section 4.1

, KFC is invariant to re-centering of activations, so it ought to automatically enjoy the optimization benefits of centering. However, batch normalization

(Ioffe & Szegedy, 2015) includes some effects not automatically captured by KFC. First, the normalization is done separately for each mini-batch rather than averaged across mini-batches; this introduces stochasticity into the computations which may serve as a regularizer. Second, it discourages large covariate shifts in the pre-activations, which may help to avoid dead units. Since batch normalization is better regarded as a modification to the architecture than an optimization algorithm, it can be combined with KFC; we investigated this in our experiments.

Projected Natural Gradient (PRONG; Desjardins et al., 2015)

goes a step further than centering methods by fully whitening the activations in each layer. In the case of fully connected layers, the activations are transformed to have zero mean and unit covariance. For convolutional layers, they apply a linear transformation that whitens the activations

across feature maps

. While PRONG includes clever heuristics for updating the statistics, it’s instructive to consider an idealized version of the method which has access to the exact statistics. We can interpret this idealized PRONG in our own framework as arising from following two additional approximations:

• Spatially uncorrelated activations (SUA). The activations at any two distinct spatial locations are uncorrelated, i.e.  for . Also assuming SH, the correlations can then be written as .

• White derivatives (WD). Pre-activation derivatives are uncorrelated and have spherical covariance, i.e. . We can assume WLOG that the proportionality constant is 1, since any scalar factor can be absorbed into the learning rate.

###### Theorem 4.

Combining approximations IAD, SH, SUA, and WD results in the following approximation to the entries of the Fisher matrix:

 E[Dwi,j,δDwi′,j′,δ′]=β(δ,δ′)~Ω(j,j′,δ′−δ)1i=i′, (35)

where is the indicator function and is the uncentered autocovariance function. ( is defined in Theorem 1. Formulas for the remaining entries are given in Appendix B.) If the term is dropped, the resulting approximate natural gradient descent update rule is equivalent to idealized PRONG, up to rescaling.

As we later discuss in Section 5.1, assumption WD appears to hold up well empirically, while SUA appears to lose a lot of information. Observe, for instance, that the input images are themselves treated as a layer of activations. Assumption SUA

amounts to modeling each channel of an image as white noise, corresponding to a flat power spectrum. Images have a well-characterized

power spectrum with (Simoncelli & Olshausen, 2001), which implies that the curvature may be much larger in directions corresponding to low-frequency Fourier components than in directions corresponding to high-frequency components.

## 5 Experiments

We have evaluated our method on two standard image recognition benchmark datasets: CIFAR-10

(Krizhevsky, 2009), and Street View Housing Numbers

(SVHN;

Netzer et al., 2011)
. Our aim is not to achieve state-of-the-art performance, but to evaluate KFC’s ability to optimize previously published architectures. We first examine the probabilistic assumptions, and then present optimization results.

For CIFAR-10, we used the architecture from cuda-convnet which achieved 18% error in 20 minutes. This network consists of three convolution layers and a fully connected layer. (While cuda-convnet provides some better-performing architectures, we could not use these, since these included locally connected layers, which KFC can’t handle.) For SVHN, we used the architecture of Srivastava (2013). This architecture consists of three convolutional layers followed by three fully connected layers, and uses dropout for regularization. Both of these architectures were carefully tuned for their respective tasks. Furthermore, the TDLCN CUDA kernels we used were carefully tuned at a low level to implement SGD updates efficiently for both of these architectures. Therefore, we believe our SGD baseline is quite strong.

### 5.1 Evaluating the probabilistic modeling assumptions

In defining KFC, we combined three probabilistic modeling assumptions: independent activations and derivatives (IAD), spatial homogeneity (SH), and spatially uncorrelated derivatives (SUD). As discussed above, IAD is the same approximation made by standard K-FAC, and it was investigated in detail both theoretically and empirically by Martens & Grosse (2015). One implicitly assumes SH when choosing to use a convolutional architecture. However, SUD is perhaps less intuitive. Why should we suppose the derivatives are spatially uncorrelated? Conversely, why not go a step further and assume the activations are spatially uncorrelated (as does PRONG; see Section 4.2) or even drop all of the correlations (thereby obtaining a much simpler diagonal approximation to the Fisher matrix)?

We investigated the autocorrelation functions for networks trained on CIFAR-10 and SVHN, each with 50 epochs of SGD. (These models were trained long enough to achieve good test error, but not long enough to overfit.) Derivatives were sampled from the model’s distribution as described in Section

2.2. Figure 1(a) shows the autocorrelation functions of the pre-activation gradients for three (arbitrary) feature maps in all of the convolution layers of both networks. Figure 1(b) shows the correlations between derivatives for different feature maps in the same spatial position. Evidently, the derivatives are very weakly correlated, both spatially and cross-map, although there are some modest cross-map correlations in the first layers of both models, as well as modest spatial correlations in the top convolution layer of the CIFAR-10 network. This suggests that SUD is a good approximation for these networks.

Interestingly, the lack of correlations between derivatives appears to be a result of max-pooling. Max-pooling has a well-known sparsifying effect on the derivatives, as any derivative is zero unless the corresponding activation achieves the maximum within its pooling group. Since neighboring locations are unlikely to simultaneously achieve the maximum, max-pooling weakens the spatial correlations. To test this hypothesis, we trained networks equivalent to those described above, except that the max-pooling layers were replaced with average pooling. The spatial autocorrelations and cross-map correlations are shown in Figure 1(c, d). Replacing max-pooling with average pooling dramatically strengthens both sets of correlations.

In contrast with the derivatives, the activations have very strong correlations, both spatially and cross-map, as shown in Figure 2. This suggests the spatially uncorrelated activations (SUA) assumption implicitly made by some algorithms could be problematic, despite appearing superficially analogous to SUD.

### 5.2 Optimization performance

We evaluated KFC-pre in the context of optimizing deep convolutional networks. We compared against stochastic gradient descent (SGD) with momentum, which is widely considered a strong baseline for training conv nets. All architectural choices (e.g. sizes of layers) were kept consistent with the previously published configurations. Since the focus of this work is optimization rather than generalization, metaparameters were tuned with respect to training error. This protocol was favorable to the SGD baseline, as the learning rates which performed the best on training error also performed the best on test error.555For KFC-pre, we encountered a more significant tradeoff between training and test error, most notably in the choice of mini-batch size, so the presented results do not reflect our best runs on the test set. For instance, as reported in Figure 3, the test error on CIFAR-10 leveled off at 18.5% after 5 minutes, after which the network started overfitting. When we reduced the mini-batch size from 512 to 128, the test error reached 17.5% after 5 minutes and 16% after 35 minutes. However, this run performed far worse on the training set. On the flip side, very large mini-batch sizes hurt generalization for both methods, as discussed in Section 5.3. For both SGD and KFC-pre, we tuned the learning rates from the set separately for each experiment. For KFC-pre, we also chose several algorithmic parameters using the method of Appendix A.3, which considers only per-epoch running time and not final optimization performance.666For SGD, we used a momentum parameter of 0.9 and mini-batches of size 128, which match the previously published configurations. For KFC-pre, we used a momentum parameter of 0.9, mini-batches of size 512, and a damping parameter . In both cases, our informal explorations did not find other values which performed substantially better in terms of training error.

For both SGD and KFC-pre, we used an exponential moving average of the iterates (see Appendix A.1) with a timescale of 50,000 training examples (which corresponds to one epoch on CIFAR-10). This helped both SGD and KFC-pre substantially. All experiments for which wall clock time is reported were run on a single Nvidia GeForce GTX Titan Z GPU board.

As baselines, we also tried Adagrad (Duchi et al., 2011)

(Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2015), but none of these approaches outperformed carefully tuned SGD with momentum. This is consistent with the observations of Kingma & Ba (2015).

Figure 3(a,b) shows the optimization performance on the CIFAR-10 dataset, in terms of wall clock time. Both KFC-pre and SGD reached approximately the previously published test error of 18% before they started overfitting. However, KFC-pre reached 19% test error in 3 minutes, compared with 9 minutes for SGD. The difference in training error was more significant: KFC-pre reaches a training error of 6% in 4 minutes, compared with 30 minutes for SGD. On SVHN, KFC-pre reached the previously published test error of 2.78% in 120 minutes, while SGD did not reach it within 250 minutes. (As discussed above, test error comparisons should be taken with a grain of salt because algorithms were tuned based on training error; however, any biases introduced by our protocol would tend to favor the SGD baseline over KFC-pre.)

Batch normalization (BN Ioffe & Szegedy, 2015) has recently had much success at training a variety of neural network architectures. It has been motivated both in terms of optimization benefits (because it reduces covariate shift) and regularization benefits (because it adds stochasticity to the updates). However, BN is best regarded not as an optimization algorithm, but as a modification to the network architecture, and it can be used in conjunction with algorithms other than SGD. We modified the original CIFAR-10 architecture to use batch normalization in each layer. Since the parameters of a batch normalized network would have a different scale from those of an ordinary network, we disabled the regularization term so that both networks would be optimized to the same objective function. While our own (inefficient) implementation of batch normalization incurred substantial computational overhead, we believe an efficient implementation ought to have very little overhead; therefore, we simulated an efficient implementation by reusing the timing data from the non-batch-normalized networks. Learning rates were tuned separately for all four conditions (similarly to the rest of our experiments).

Training curves are shown in Figure 4. All of the methods achieved worse test error than the original network as a result of regularization being eliminated. However, the BN networks reached a lower test error than the non-BN networks before they started overfitting, consistent with the stochastic regularization interpretation of BN.777Interestingly, the BN networks were slower to optimize the training error than their non-BN counterparts. We speculate that this is because (1) the SGD baseline, being carefully tuned, didn’t exhibit the pathologies that BN is meant to correct for (i.e. dead units and extreme covariate shift), and (2) the regularization effects of BN made it harder to overfit. For both the BN and non-BN architectures, KFC-pre optimized both the training and test error and NLL considerably faster than SGD. Furthermore, it appeared not to lose the regularization benefit of BN. This suggests that KFC-pre and BN can be combined synergistically.

### 5.3 Potential for distributed implementation

Much work has been devoted recently to highly parallel or distributed implementations of neural network optimization (e.g. Dean et al. (2012)). Synchronous SGD effectively allows one to use very large mini-batches efficiently, which helps optimization by reducing the variance in the stochastic gradient estimates. However, the per-update performace levels off to that of batch SGD once the variance is no longer significant and curvature effects come to dominate. Asynchronous SGD partially alleviates this issue by using new network parameters as soon as they become available, but needing to compute gradients with stale parameters limits the benefits of this approach.

As a proxy for how the algorithms are likely to perform in a highly distributed setting888Each iteration of KFC-pre requires many of the same computations as SGD, most notably computing activations and gradients. There were two major sources of additional overhead: maintaining empirical averages of covariance statistics, and computing inverses or eigendecompositions of the Kronecker factors. These additional operations can almost certainly be performed asynchronously; in our own experiments, we only periodically performed these operations, and this did not cause a significant drop in performance. Therefore, we posit that each iteration of KFC-pre requires a comparable number of sequential operations to SGD for each weight update. This is in contrast to other methods which make good use of large mini-batches such as Hessian-free optimization (Martens, 2010), which requires many sequential iterations for each weight update. KFC-pre also adds little communication overhead, as the Kronecker factors need not be sent to the worker nodes which compute the gradients., we measured the classification error as a function of the number of iterations (weight updates) for each algorithm. Both algorithms were run with large mini-batches of size 4096 (in place of 128 for SGD and 512 for KFC-pre). Figure 5 shows training curves for both algorithms on CIFAR-10 and SVHN, using the same architectures as above.999Both SGD and KFC-pre reached a slightly worse test error before they started overfitting, compared with the small-minibatch experiments of the previous section. This is because large mini-batches lose the regularization benefit of stochastic gradients. One would need to adjust the regularizer in order to get good generalization performance in this setting. KFC-pre required far fewer weight updates to achieve good training and test error compared with SGD. For instance, on CIFAR-10, KFC-pre obtained a training error of 10% after 300 updates, compared with 6000 updates for SGD, a 20-fold improvement. Similar speedups were obtained on test error and on the SVHN dataset. These results suggest that a distributed implementation of KFC-pre has the potential to obtain large speedups over distributed SGD-based algorithms.

## References

• Amari (1998) Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
• Chapelle & Erhan (2011) Chapelle, O. and Erhan, D. Improved preconditioner for Hessian-free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
• Chellapilla et al. (2006) Chellapilla, K., Puri, S., and Simard, P.

High performance convolutional neural networks for document processing.

In International Workshop on Frontiers in Handwriting Recognition, 2006.
• Cho et al. (2013) Cho, K., Raiko, T., and Ilin, A.

Enhanced gradient for training restricted Boltzmann machines.

Neural Computation, 25:805–813, 2013.
• Dean et al. (2012) Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Neural Information Processing Systems, 2012.
• Demmel (1997) Demmel, J. W. Applied Numerical Linear Algebra. SIAM, 1997.
• Desjardins et al. (2015) Desjardins, G., Simonyan, K., Pascanu, R., and Kavukcuoglu, K. Natural neural networks. arXiv:1507.00210, 2015.
• Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
• Grosse & Salakhutdinov (2015) Grosse, Roger and Salakhutdinov, Ruslan. Scaling up natural gradient by sparsely factorizing the inverse Fisher matrix. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
• Heskes (2000) Heskes, Tom.

On “natural” learning and pruning in multilayered perceptrons.

Neural Computation, 12(4):881–901, 2000.
• Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Neural Computation, 9:1735–1780, 1997.
• Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
• Kingma & Ba (2015) Kingma, D. P. and Ba, J. L. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015.
• Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
• Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
• Le Roux et al. (2008) Le Roux, Nicolas, Manzagol, Pierre-antoine, and Bengio, Yoshua. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, pp. 849–856. MIT Press, 2008.
• LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989.
• LeCun et al. (1998) LeCun, Y., Bottou, L., Orr, G., and Müller, K. Efficient backprop. Neural networks: Tricks of the trade, pp. 546–546, 1998.
• Martens (2010) Martens, J. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
• Martens (2014) Martens, J. New insights and perspectives on the natural gradient method, 2014.
• Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015.
• Mnih (2009) Mnih, V. CUDAMat: A CUDA-based matrix class for Python. Technical Report 004, University of Toronto, 2009.
• Moré (1978) Moré, J.J. The Levenberg-Marquardt algorithm: implementation and theory. Numerical analysis, pp. 105–116, 1978.
• Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems Deep Learning and Unsupervised Feature Learning Workshop, 2011.
• Nocedal & Wright (2006) Nocedal, Jorge and Wright, Stephen J. Numerical optimization. Springer, 2. ed. edition, 2006.
• Ollivier (2015) Ollivier, Y. Riemannian metrics for neural networks I: feedforward networks. Information and Inference, 4(2):108–153, 2015.
• Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, 2013.
• Pascanu et al. (2014) Pascanu, R., Dauphin, Y. N., Ganguli, S., and Bengio, Y. On the saddle point problem for non-convex optimization. arXiv:1405.4604, 2014.
• Polyak & Juditsky (1992) Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal of Control and Optimization, 30(4):838–855, 1992.
• Povey et al. (2015) Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of DNNs with natural gradient and parameter averaging. In International Conference on Learning Representations: Workshop track, 2015.
• Ranzato & Hinton (2010) Ranzato, M. and Hinton, G. E.

Modeling pixel means and covariances using factorized third-order Boltzmann machines.

In Computer Vision and Pattern Recognition, 2010.
• Rumelhart et al. (1986) Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
• Schraudolph (2002) Schraudolph, Nicol N. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14, 2002.
• Simoncelli & Olshausen (2001) Simoncelli, E. P. and Olshausen, B. A. Natural image statistics and neural representation. Annual Review of Neuroscience, 24:1193–1216, 2001.
• Sohl-Dickstein et al. (2014) Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In International Conference on Machine Learning, 2014.
• Srivastava (2013) Srivastava, N. Improving neural networks with dropout. Master’s thesis, University of Toronto, 2013.
• Srivastava (2015) Srivastava, N. Toronto Deep Learning ConvNet.
• Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. V. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.
• Swersky et al. (2010) Swersky, K., Chen, Bo, Marlin, B., and de Freitas, N. A tutorial on stochastic approximation algorithms for training restricted Boltzmann machines and deep belief nets. In Information Theory and Applications Workshop (ITA), 2010, pp. 1–10, Jan 2010.
• Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5, RMSProp. In Coursera course Neural Networks for Machine Learning, 2012.
• Vatanen et al. (2013) Vatanen, Tommi, Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Pushing stochastic gradient towards second-order methods – backpropagation learning with transformations in nonlinearities. 2013.
• Vinyals & Povey (2012) Vinyals, O. and Povey, D. Krylov subspace descent for deep learning. In

International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2012.
• Yang et al. (2014) Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. arXiv:1412.7149, 2014.
• Zeiler (2013) Zeiler, Matthew D. ADADELTA: An adaptive learning rate method. 2013.

## Appendix A Optimization methods

### a.1 KFC as a preconditioner for SGD

The first optimization procedure we used in our experiments was a generic natural gradient descent approximation, where was used to approximate . This procedure is like SGD with momentum, except that is substituted for the Euclidean gradient. One can also view this as a preconditioned SGD method, where is used as the preconditioner. To distinguish this optimization procedure from the KFC approximation itself, we refer to it as KFC-pre. Our procedure is perhaps more closely analogous to earlier Kronecker product-based natural gradient approximations (Heskes, 2000; Povey et al., 2015) than to K-FAC itself.

In addition, we used a variant of gradient clipping

(Pascanu et al., 2013) to avoid instability. In particular, we clipped the approximate natural gradient update so that , where is estimated using 1/4 of the training examples from the current mini-batch. One motivation for this heuristic is that approximates the KL divergence of the predictive distributions before and after the update, and one wouldn’t like the predictive distributions to change too rapidly. The value can be computed using curvature-vector products (Schraudolph, 2002). The clipping was only triggered near the beginning of optimization, where the parameters (and hence also the curvature) tended to change rapidly.101010This may be counterintuitive, since SGD applied to neural nets tends to take small steps early in training, at least for commonly used initializations. For SGD, this happens because the initial parameters, and hence also the initial curvature, are relatively small in magnitude. Our method, which corrects for the curvature, takes larger steps early in training, when the error signal is the largest. Therefore, one can likely eliminate this step by initializing from a model partially trained using SGD.

Taking inspiration from Polyak averaging (Polyak & Juditsky, 1992; Swersky et al., 2010), we used an exponential moving average of the iterates. This helps to smooth out the variability caused by the mini-batch selection. The full optimization procedure is given in Algorithm 1.

### a.2 Kronecker-factored approximate curvature

The central idea of K-FAC is the combination of approximations to the Fisher matrix described in Section 2.2. While one could potentially perform standard natural gradient descent using the approximate natural gradient , perhaps with a fixed learning rate and with fixed Tikhonov-style damping/reglarization, Martens & Grosse (2015) found that the most effective way to use was within a robust 2nd-order optimization framework based on adaptively damped quadratic models, similar to the one employed in HF (Martens, 2010). In this section, we describe the K-FAC method in detail, while omitting certain aspects of the method which we do not use, such as the block tri-diagonal inverse approximation.

K-FAC uses a quadratic model of the objective to dynamically choose the step size and momentum decay parameter at each step. This is done by taking where is the update computed at the previous iteration, and minimizing the following quadratic model of the objective (over the current mini-batch):

 M(θ+v)=h(θ)+∇h⊤v+12v⊤(F+rI)v. (36)

where we assume the is the expected loss plus an -regularization term of the form . Since behaves like a curvature matrix, this quadratic function is similar to the second-order Taylor approximation to . Note that here we use the exact for the mini-batch, rather than the approximation . Intuitively, one can think of as being itself iteratively optimized at each step in order to better minimize , or in other words, to more closely match the true natural gradient (which is the exact minimum of ). Interestingly, in full batch mode, this method is equivalent to performing preconditioned conjugate gradient in the vicinity of a local optimum (where remains approximately constant).

To see how this minimization over and can be done efficiently, without computing the entire matrix , consider the general problem of minimizing on the subspace spanned by arbitrary vectors . (In our case, , and .) The coefficients can be found by solving the linear system , where and . To compute the matrix , we compute each of the matrix-vector products using automatic differentiation (Schraudolph, 2002).

Both the approximate natural gradient and the update (generated as described above) arise as the minimum, or approximate minimum, of a corresponding quadratic model. In the case of , this model is given by and is designed to be a good local approximation to the objective . Meanwhile, the quadratic model which is implicitly minimized when computing is designed to approximate (by approximating with ).

Because these quadratic models are approximations, naively minimizing them over can lead to poor results in both theory and practice. To help deal with this problem K-FAC employs an adaptive Tikhonov-style damping scheme applied to each of them (the details of which differ in either case).

To compensate for the inaccuracy of as a model of , K-FAC adds a Tikhonov regularization term to which encourages the update to remain small in magnitude, and thus more likely to remain in the region where is a reasonable approximation to . This amounts to replacing with in Eqn. 36. Note that this technique is formally equivalent to performing constrained minimization of within some spherical region around (a “trust-region”). See for example Nocedal & Wright (2006).

K-FAC uses the well-known Levenberg-Marquardt technique (Moré, 1978) to automatically adapt the damping parameter so that the damping is loosened or tightened depending on how accurately predicts the true decrease in the objective function after each step. This accuracy is measured by the so-called “reduction ratio”, which is given by

 ρ=h(θ)−h(θ+v)M(θ)−M(θ+v), (37)

and should be close to 1 when the quadratic approximation is reasonably accurate around the given value of . The update rule for is as follows:

 λ←⎧⎪⎨⎪⎩λ⋅λ−if % ρ>3/4λif 1/4≤ρ≤3/4λ⋅λ+if ρ<1/4 (38)

where and are constants such that .

To compensate for the inaccuracy of , and encourage to be smaller and more conservative, K-FAC similarly adds to before inverting it. As discussed in Section 2.2, this can be done approximately by adding multiples of to each of the Kronecker factors and of before inverting them. Alternatively, an exact solution can be obtained by expanding out the eigendecomposition of each block of , and using the following identity:

 [^Fℓ+γI]−1 =[(QΨ⊗QΓ)(DΨ⊗DΓ)(Q⊤Ψ⊗Q⊤Γ)+γI]−