# Fisher Information and Natural Gradient Learning of Random Deep Networks

A deep neural network is a hierarchical nonlinear model transforming input signals to output signals. Its input-output relation is considered to be stochastic, being described for a given input by a parameterized conditional probability distribution of outputs. The space of parameters consisting of weights and biases is a Riemannian manifold, where the metric is defined by the Fisher information matrix. The natural gradient method uses the steepest descent direction in a Riemannian manifold, so it is effective in learning, avoiding plateaus. It requires inversion of the Fisher information matrix, however, which is practically impossible when the matrix has a huge number of dimensions. Many methods for approximating the natural gradient have therefore been introduced. The present paper uses statistical neurodynamical method to reveal the properties of the Fisher information matrix in a net of random connections under the mean field approximation. We prove that the Fisher information matrix is unit-wise block diagonal supplemented by small order terms of off-block-diagonal elements, which provides a justification for the quasi-diagonal natural gradient method by Y. Ollivier. A unitwise block-diagonal Fisher metrix reduces to the tensor product of the Fisher information matrices of single units. We further prove that the Fisher information matrix of a single unit has a simple reduced form, a sum of a diagonal matrix and a rank 2 matrix of weight-bias correlations. We obtain the inverse of Fisher information explicitly. We then have an explicit form of the natural gradient, without relying on the numerical matrix inversion, which drastically speeds up stochastic gradient learning.

## Authors

• 11 publications
• 8 publications
• 3 publications
• ### Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descen...
03/19/2015 ∙ by James Martens, et al. ∙ 0

• ### Natural Wake-Sleep Algorithm

The benefits of using the natural gradient are well known in a wide rang...
08/15/2020 ∙ by Csongor Várady, et al. ∙ 5

• ### Practical Riemannian Neural Networks

We provide the first experimental results on non-synthetic datasets for ...
02/25/2016 ∙ by Gaétan Marceau-Caron, et al. ∙ 0

• ### A scale-dependent notion of effective dimension

We introduce a notion of "effective dimension" of a statistical model ba...
01/29/2020 ∙ by Oksana Berezniuk, et al. ∙ 18

• ### Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC

It is challenging to develop stochastic gradient based scalable inferenc...
06/06/2017 ∙ by Yulai Cong, et al. ∙ 0

• ### Theoretical foundation for CMA-ES from information geometric perspective

This paper explores the theoretical basis of the covariance matrix adapt...
06/04/2012 ∙ by Youhei Akimoto, et al. ∙ 0

• ### Pathological spectra of the Fisher information metric and its variants in deep neural networks

The Fisher information matrix (FIM) plays an essential role in statistic...
10/14/2019 ∙ by Ryo Karakida, et al. ∙ 29

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In modern deep learning, multilayer neural networks are usually trained by using the stochastic gradient-descent method (See Amari, 1967 for one of the earliest proposal of stochastic gradient descent for the purpose of applying multilayer networks). The parameter space of multilayer networks forms a Riemannian space equipped with Fisher information metric. Thus, instead of the usual gradient descent method, the natural gradient or Riemannian gradient method, which takes account of the geometric structure of the Riemmanian space, is more effective for learning (Amari, 1998). However, it has been difficult to apply the natural gradient descent because it needs the inversion of the Fisher information matrix, which is computationally heavy. Many approximation methods reducing computational costs have therefore been proposed (see Pascanu & Bengio, 2013; Grosse & Martens, 2016; Martens, 2017).

To resolve the computational difficulty of the natural gradient, we analyze the Fisher information matrix of a random network, where the connection weights and biases are randomly assigned, by using the mean field approximation (See also our accompanying paper Amari, Karakida and Oizumi, 2018 for the analysis of feedforward paths). We prove that, when the number of neural units in each layer is sufficiently large, the subblocks of the Fisher information matrix corresponding to different layers are of order , which is negligibly small. Thus, is approximated by a layer-wise diagonalized matrix. Furthermore, within the same layer, the subblocks among different units are also of order .

This gives a justification for the approximated natural gradient method proposed by Kurita (1994) and studied in detail by Ollivier (2015) and Marceau-Caron and Ollivier (2016), where the unit-wise diagonalized

was used. We further study the Fisher information matrix of a unit —that is, a simple perceptron— for the purpose of implementing unit-wise natural gradient learning. We obtain an explicit form of the Fisher information matrix and its inverse under the assumption that inputs are subject to the standard non-correlated Gaussian distribution with mean 0. The unit-wise natural gradient is explicitly formulated without numerical matrix inversion, provided inputs signals are subject to independent Gaussian distributions with mean 0, making it possible that natural gradient learning is realized without the burden of heavy computation. The results justify the quasi-diagonal approximation of the Fisher information matrix proposed by Y. Ollivier (2015), although our results are not exactly the same as Ollivier’s results. Our approximation method is justified only for random networks under the mean-field assumption. However, it is expected that it would be effective for training actual deep networks considering the good performance shown in Olivier, 2015 and Marceau-Caron and Ollivier, 2016.

The results can be extended to residual deep networks with ReLU. We show that the inputs to each layer are approximately subject to 0-mean independent Gaussian distributions in the case of a resnet, because of random linear transformations after nonlinear transformations in all layers. Therefore, our method would be particularly effective when residual networks are used.

To understand the structure of the Fisher information matrix, refer to Karakida, Akaho and Amari (2018), which analyzes the characteristics (the distribution of its eigenvalues) of the Fisher information matrix of a random net for the first time.

## 2 Deep neural networks

We consider a deep neural network consisting of layers. Let

be the input vectors to the

-th layer and the output vector of the -th layer (see Figure 1).

The input-output relation of the -th layer is written as

 lxi=φ(∑jlwijl−1xj+lbi), (1)

where

is an activation function such as a rectified linear function (ReLU), sigmoid function, etc. Let

be the number of neurons in the

-th layer. We assume that are large, but the number of neurons in the final layer, , can be small. Even is allowed. The weights and biases

are random variables subject to independent Gaussian distributions with mean 0 and variances

and , respectively. Note that each weight is a random variable of order , but the weighted sum is of order 1.

We recapturate briefly the feedforward analysis of input signals given in Poole et al., 2016 and Amari, Karakida and Oizumi, 2018, to introduce the activity and enlargement factor . They also play a role in the feedback analysis obtaining the Fisher information (Schoenholtz et al., 2016; Karakida, Akaho and Amari, 2018).

Let us put

 lui=∑jlwijl−1xj+lbi. (2)

Given , are independently and identically distributed (iid) Gaussian random variables with mean 0 and variance

 τ2l=σ2ln∑(l−1xj)2+σ2bl=l−1Aσ2l+σ2bl, (3)

where

 l−1A=1nl−1∑l−1xj2 (4)

is the total activity of input .

It is easy to show how develops across the layers. Since are iid when

is fixed, the law of large numbers guarantees that their sum is replaced by the expectation when

is large. Putting where is the standard Gaussian variables, we have a recursive equation,

 lA=∫{φ(τlv)}2Dv, (5)

where in equation (3) depends on and

 Dv=1√2πexp{−v22}dv. (6)

Since equation (1) gives the transformation from to , we study how a small difference in the input develops to give difference in the output. By differentiating equation (1), we have

 dlx=lBdl−1x (7)

where

 lB=∂lx∂l−1x (8)

is a matrix whose -th element is given by

 Bilil−1=φ′(uil)wilil−1. (9)

It is a random variable of order . Here and hereafter, we denote by , eliminating superfix and using and instead of and . These index notations are convenient for showing that the corresponding ’s belong to layer .

We show how the square of the Euclidean length of ,

 dls2=∑li(dxli)2, (10)

is related to that of . This relation can be seen from

 dls2=∑il,il−1,i′l−1Bilil−1Bili′l−1dxil−1dxi′l−1. (11)

For any pair and , random variables are iid for all when is fixed, so the law of large numbers guarantees that

 (12)

where is the expectation with respect to the weights and biases and represents small terms of stochastic order . We use the mean field property that has the self-averaging property and the average of the product of and in equation (12) splits as

 E[φ′(uil)2]E[wilil−1wili′l−1]. (13)

This is justified in appendix I. By putting

 lχ=σ2l∫{φ′(τlv)}2Dv, (14)

we have from equation (11)

 dls2=lχdl−1s2, (15)

by using

 E[wilil−1wili′l−1]=σ2lnlδil−1i′l−1. (16)

Here which depends on , is the enlargement factor showing how is enlarged or reduced across layer .

From the recursive relation (15), we have

 dLs2 = χLldl−1s2, (17) χLl = LχL−1χ⋯lχ. (18)

Assume that all the are equal. Then, it gives the Lyapunov exponent of dynamics equation (15). When it is larger than 1, the length diverges as the layers proceed, whereas when it is smaller than 1, the length decays to 0. The dynamics of is chaotic when (Poole et al, 2016). Interesting information processing takes place at the edge of chaos, where is nearly equal to 1 (Yang & Schoenholz, 2017). We have interest in the case where is nearly equal to 1, but each ’s are distributed, some being smaller than 1 and the others larger than 1.

## 3 Fisher information of deep networks and natural gradient learning

We study a regression model in which the output of layer , ,

 y=Lx+ε, (19)

where is a multivariate Gaussian random variable with mean 0 and identity covariance matrix . Then the probability of given input is

 p(y|x;W)=1(√2π)nLexp{−12∣∣∣y−Lx∣∣∣2}, (20)

where consists of all the parameters , and , . The Fisher information matrix is given by

 G=Ex,y[(∂Wlogp)(∂Wlogp)], (21)

where denotes the expectation with respect to randomly generated input and resultant and is gradient with respect to . By using error vector in (19), we have

 ∂Wlogp=ε⋅∂WLx. (22)

For fixed , expectation with respect to is replaced by that of , where . Hence, (21) is given by

 G=Ex⎡⎣∑iL{∂Wφ(uiL)}{∂Wφ(uiL)}⎤⎦. (23)

Here, we use the dyadic or tensor notation that implies a matrix , instead of vector-matrix notation for column vectors.

Online learning is a method of modifying the current such that the current loss

 l=12∣∣y−xLt∣∣2 (24)

decreases, where is the current input-output pair. The stochastic gradient decent method (proposed in Amari, 1967) uses the gradient of to modify ,

 ΔW=−η∂l∂W. (25)

Historically, the first simulation results applied to four-layer networks for pattern classification were given in a Japanese book (Amari, 1968). The minibatch method uses the average of over minibatch samples.

The negative gradient is a direction to decrease the current loss but is not steepest in a Riemannian manifold. The true steepest direction is given by

 ~∇l=G−1∂l∂W, (26)

which is called the natural or Riemannian gradient (Amari, 1998). The natural gradient method is given by

 ΔW=−η~∇l. (27)

It is known to be Fisher efficient for estimating

(Amari, 1998). Although it gives excellent performances, the inversion of is computationally very difficult.

To avoid difficulty, the quasi-diagonal natural gradient method was proposed in Ollivier (2015) and was shown to be very efficient in Marcereau-Caron and Ollivier (2016). A recent proposal (Ollivier, 2017) looks very promising for realizing natural gradient learning. The present paper analyzes the structure of the Fisher information matrix. It will give a justification of the quasi-diagonal natural gradient method. By using it, we propose a new method of realizing natural gradient learning without the burden of inverting .

## 4 Structure of Fisher information matrix

To calculate elements of , we use a new notation combining connection weights and bias into one vector,

 lw∗=(lw,lb). (28)

For the -th unit of layer , it is

 w∗il=(wilil−1,bil). (29)

For , we have the recursive relation

 ∂lx∂mw∗=∂lx∂l−1x∂l−1x∂mw∗=lB∂l−1x∂mw∗. (30)

Starting from and using

 ∂mx∂mw∗=φ′(mu)m−1x, (31)

we have

 ∂Lx∂mw∗=LB⋯m+1Bφ′(mu)m−1x. (32)

Put

 BLm+1=LB⋯m+1B, (33)

which is a product of matrices. The elements of are denoted by .

We calculate the Fisher information given in equation (23). The elements of with respect to layers and are written as

 G(mw∗,lw∗)=Ex⎡⎣∂Lx∂mw∗⋅∂Lx∂lw∗⎤⎦, (34)

where denotes the innor product with respect to . The emelents of are, for fixed ,

 BiLimφ′(uim)xim−1. (35)

Hence, (34) is written in the component form as

 [G(mw∗,lw∗)]imilim−1il−1=∑iLBiLimBiLilφ′(uim)φ′(uil)xim−1xil−1. (36)

We first consider the case , that is, two neurons are in the same layer . The following lemma is usuful for evaluating .

Domino Lemma  We assume that all are of order .

 ∑iL,i′LδiLi′LBiLimBi′Li′m=χLm+1δimi′m+Op(1√n). (37)
###### Proof.

We first prove the case with . We have

 (38)

When , this is a sum of iid random variables , when input is fixed. Therefore, the law of large numbers guarantees that, as goes to infinity, their sum converges to the expectation,

 nLEx[{φ′(uiL)}2(wiLiL−1)2]=Lχ (39)

under the mean field approximation for any . For fixed , the right-hand side of equation (38) is also a sum of iid variables with mean 0. Hence, its mean is 0. We evaluate its variance, proving that the variance is

 nLE[{φ′(uiL)}4(wiLiL−1)2(wiLi′L−1)2] (40)

which is of order , because is of order . Hence we have

 ∑δiLi′LBiLiL−1Bi′Li′L−1=LχδiL−1i′L−1+Op(1√nL). (41)

When , we repeat the process . Then in the left-hand side of equation (37) propagates to give like the domino effect, leaving multiplicative factors . This proves the theorem. ∎

Remark: The domino lemma holds irrespective of or , provided they are large. However, matrix is not of full rank, its rank being .

By using this result, we evaluate off-diagonal blocks of under the mean field approximation (39).

###### Theorem 1.

The Fisher information matrix is unit-wise diagonal except for terms of stochastic order .

###### Proof.

We first calculate the off-diagonal blocks of the Fisher information matrix within the same layers. The Fisher information submatrix within layer is

 G(mw∗,mw′∗)=Ex[∑δiLi′LBiLimBiLi′mφ′(uim)φ′(ui′m)xim−1xi′m−1], (42)

which are elements of submatrix of corresponding to neurons and both in the same layer . By the domino lemma, we have

 G(mw∗,mw′∗)=Ex[χLm{φ′(uim)}2xim−1xi′m−1]δimi′m+Op(1√n). (43)

This shows that the submatrix is unit-wise block diagonal: that is, the blocks of different neurons and are 0 except for terms of order .

We next study the blocks of different layers and ,

 G(∗wl,∗wm)=Ex⎡⎢⎣∑iL,i′LδiLi′LBiLilBi′Limφ′(uil)φ′(uim)l−1xm−1x⎤⎥⎦. (44)

We have

 BiLim=∑ilBiLilBilim. (45)

By using the domino lemma, is written as

 Ex[χLlBilimφ′(uil)φ′(uim)l−1xm−1x]. (46)

When ,

 Bilim−1=φ′(uil)wilim−1 (47)

and hence it is of order . In general, is a sum of mean iid random variables with variance of order . Hence, its mean is 0 and variance is of order , proving that (46) is of order . ∎

Inspired from this, we define a new metric as an approximation of , such that all the off-diagonal block terms of are discarded, putting them equal to 0. We study the natural (Riemannian) gradient method which uses as the Riemannian metric. Note that is an approximation of , tending to for in the max-norm, but is not a good approximation to . This is because the max-norm of a matrix is not sub-multiplicative. See the remark below.

Remark:  One should note that the approximately block diagonal structure is not closed in the matrix multiplication and inversion. Even though is approximately unitwise block diagonal, its square is not, as is shown in the following. For simplicity, we assume that

 G=I+1√nB, (48)

where

is an identity matrix and

is a random matrix of order 1,

being independent random variables subject to . Then

 G2=I+2B√n+1nB2. (49)

Here the -th element of is

 ∑kBikBkj, (50)

a sum of independent random variables. Hence, although its mean is 0, it is of order 1. Hence, the off-diagonal elements are no more small. The same situation holds for .

We may also note that the Riemannian magnitude of vector ,

 aTGa=∑Gijaiaj (51)

is not approximated by , because we cannot neglect the off-diagonal elements of .

Recently, Karakida, Akaho Amari (2018) analyzed characteristics of the original metric (not ). They evaluated the traces of and to analyze the distribution of eigenvalues of , which proves that the small off-diagonal elements cause a long-tail distribution of eigenvalues. This elucidates the landscape of the error surface in a random deep net. In contrast, the present study focuses on the approximated metric . It enables us to give an explicit form of the Fisher information matrix, directly applicable to natural gradient methods, as follows.

## 5 Unit-wise Fisher information

Because is unit-wise block-diagonal, it is enough to calculate the Fisher information matrices of single units. We assume that its input vector is subject to . This does not hold in general. However, it holds approximately for a randomly connected resnet, as is shown in the next section.

Let us introduce a new -dimensional vectors for a single unit:

 w∗ = (w,w0), (52) x∗ = (x,x0), (53)

where and . Then, the output of the unit is , . The Fisher information matrix is an matrix written as

 G=Ex[{φ′(u)}2x∗x∗]. (54)

We introduce a set of new orthonormal basis vectors in the space of as

 e∗0 = (0,⋯,0,1), (55) e∗i = (ai,0),i=1,2,⋯,n−1, (56) e∗n = 1w(w,0),w2=w⋅w, (57)

where , are arbitrary orthogonal unit vectors, satisfying , . That is, is a rotation of and we put .

Here are mutually orthogonal unit vectors and is the unit vector in the direction of . Since and are represented in the new basis as

 x∗=n∑i=0x∗ie∗i,w∗=be∗0+we∗n, (58)

we have

 G=E[{φ′(w∗⋅x∗)}2x∗x∗]. (59)

Moreover, are orthogonal transformation of . Hence, , are jointly independent Gaussian, subject to , and .

In order to obtain , let us put

 G=n∑i,j=0Aije∗ie∗j (60)

in the dyadic notation. Then, the coefficients are given by

 Aij=e∗iGe∗j, (61)

which are elements of in the coordinate system . From and equation (59), we have

 A00 = e∗0Ge∗0=∫{φ′(wx∗n+w0)}2Dx∗n, (62) A0n = e∗0Ge∗n=∫x∗n{φ′(wx∗n+w0)}2Dx∗n, (63) Ann = e∗nGe∗n=∫x∗2n{φ′(wx∗n+w0)}2Dx∗n, (64)

which depend on . We further have, for ,

 Aii = e∗iGe∗i=A00, (65) Aij = e∗iGe∗j=0(j≠i), (66) Ai0 = e∗iGe∗0=0(i≠n). (67)

From these, we obtain in the dyadic form

 G = (68) +A0n(e∗0e∗n+e∗ne∗0).

The elements of in the basis are

 G=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣A000⋱0A00\Large{\>0}\rule{0.0pt}{17.0pt}\Large{0}AnnAn0An0A00.⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (69)

which shows that is a sum of a diagonal matrix and a rank 2 matrix.

The inverse of has the same block form as equations (68) and (69). Note that

 ∑e∗ie∗i = I, (70) e∗ne∗n = 1w2ww, (71) e∗0e∗n+e∗ne∗0 = (72) e∗0e∗0 = (73)

where .

By using these relations, is expressed in the original basis as

 G=∑A00I+(Ann−A00)w2~w~w+A0nw(e0~w+~we0). (74)

The inverse of has also the same form, so we have an explicit form of

 G−1 = ¯A00I+Xw2~w~w+Yw(e∗0~w+~we∗0) (76) +Ze∗0e∗0,

where

 ¯A00 = 1A00, (77) X = 1DA00−¯A00,Y=−An0D,Z=AnnD−¯A00, (78) D = A00Ann−A2n0. (79)

By using the above equations, is obtained explicitly, so we do not need to calculate back-propagated and its inverse for the natural gradient update of .

The natural gradient method for each unit is written by using the back-propagated error as

 Δw∗=−ηeG−1x∗, (80)

which splits as

 Δw = −ηe[¯A00x+(Xw2w⋅x+Yw)w], (81) Δw0 = −ηe(¯A00+w⋅xwY+Z)w0. (82)

The back-propergated error is calculated as follows. Let be the back-propagated error of neuron in layer

. It is given by the well-known error backpropagation as

 eim = ∑iL,im+1LeiLBiLiL,im+1φ′(uim+1)wim+1im, (83) Le = y−Lx. (84)

We can implement the unit-wise natural gradient method using equations (81) and (82) without calculating . However, the unit-wise is derived under the condition that the input to each neuron is subject to a 0-mean Gaussian distribution. This does not hold in general, so we need to adjust by a linear transformation. We will see that a residual network automatically makes the input to each layer be subject to a 0-mean Gaussian distribution.

Obviously, is no more random Gaussian with mean 0 after learning. However, since the unit-wise natural gradient proposed here is computationally so easy, it is worth trying for practical applications even after learning.

Except for a rank 1 term and bias terms , is a diagonal matrix. In other words, as is seen in equation (72), it is diagonal except for a raw and column corresponding to the bias terms and the rank 1 term . Except for the rank 1 term , it has the same structure as that of the quasi-diagonal matrix of Ollivier (2015), justifying the quasi-diagonal method.

## 6 Fisher information of residual network

The residual network has direct paths from its input to output in each layer. We treat the following block of layer : The layer transforms input to output by

 lxi = ∑jlvijφ(luj)+αl−1xi, (85) luj = ∑klwjkl−1xk+lbj (86)

(see Figure 2).

Here is a decay factor, ( is conventionally used), and are randomly generated iid Gaussian variables subject to .

We show how the activity develops in a residual network (Yang and Schoenholtz, 2017). We easily have the recursive relation,

 lA = 1n∑(vijφ(luj)+αl−1xi)(vikφ(luk)+αl−1xi) (87)