In modern deep learning, multilayer neural networks are usually trained by using the stochastic gradient-descent method (See Amari, 1967 for one of the earliest proposal of stochastic gradient descent for the purpose of applying multilayer networks). The parameter space of multilayer networks forms a Riemannian space equipped with Fisher information metric. Thus, instead of the usual gradient descent method, the natural gradient or Riemannian gradient method, which takes account of the geometric structure of the Riemmanian space, is more effective for learning (Amari, 1998). However, it has been difficult to apply the natural gradient descent because it needs the inversion of the Fisher information matrix, which is computationally heavy. Many approximation methods reducing computational costs have therefore been proposed (see Pascanu & Bengio, 2013; Grosse & Martens, 2016; Martens, 2017).
To resolve the computational difficulty of the natural gradient, we analyze the Fisher information matrix of a random network, where the connection weights and biases are randomly assigned, by using the mean field approximation (See also our accompanying paper Amari, Karakida and Oizumi, 2018 for the analysis of feedforward paths). We prove that, when the number of neural units in each layer is sufficiently large, the subblocks of the Fisher information matrix corresponding to different layers are of order , which is negligibly small. Thus, is approximated by a layer-wise diagonalized matrix. Furthermore, within the same layer, the subblocks among different units are also of order .
This gives a justification for the approximated natural gradient method proposed by Kurita (1994) and studied in detail by Ollivier (2015) and Marceau-Caron and Ollivier (2016), where the unit-wise diagonalized
was used. We further study the Fisher information matrix of a unit —that is, a simple perceptron— for the purpose of implementing unit-wise natural gradient learning. We obtain an explicit form of the Fisher information matrix and its inverse under the assumption that inputs are subject to the standard non-correlated Gaussian distribution with mean 0. The unit-wise natural gradient is explicitly formulated without numerical matrix inversion, provided inputs signals are subject to independent Gaussian distributions with mean 0, making it possible that natural gradient learning is realized without the burden of heavy computation. The results justify the quasi-diagonal approximation of the Fisher information matrix proposed by Y. Ollivier (2015), although our results are not exactly the same as Ollivier’s results. Our approximation method is justified only for random networks under the mean-field assumption. However, it is expected that it would be effective for training actual deep networks considering the good performance shown in Olivier, 2015 and Marceau-Caron and Ollivier, 2016.
The results can be extended to residual deep networks with ReLU. We show that the inputs to each layer are approximately subject to 0-mean independent Gaussian distributions in the case of a resnet, because of random linear transformations after nonlinear transformations in all layers. Therefore, our method would be particularly effective when residual networks are used.
To understand the structure of the Fisher information matrix, refer to Karakida, Akaho and Amari (2018), which analyzes the characteristics (the distribution of its eigenvalues) of the Fisher information matrix of a random net for the first time.
2 Deep neural networks
We consider a deep neural network consisting of layers. Let
be the input vectors to the-th layer and the output vector of the -th layer (see Figure 1).
The input-output relation of the -th layer is written as
be the number of neurons in the-th layer. We assume that are large, but the number of neurons in the final layer, , can be small. Even is allowed. The weights and biases and , respectively. Note that each weight is a random variable of order , but the weighted sum is of order 1.
We recapturate briefly the feedforward analysis of input signals given in Poole et al., 2016 and Amari, Karakida and Oizumi, 2018, to introduce the activity and enlargement factor . They also play a role in the feedback analysis obtaining the Fisher information (Schoenholtz et al., 2016; Karakida, Akaho and Amari, 2018).
Let us put
Given , are independently and identically distributed (iid) Gaussian random variables with mean 0 and variance
is the total activity of input .
It is easy to show how develops across the layers. Since are iid when
is fixed, the law of large numbers guarantees that their sum is replaced by the expectation whenis large. Putting where is the standard Gaussian variables, we have a recursive equation,
where in equation (3) depends on and
is a matrix whose -th element is given by
It is a random variable of order . Here and hereafter, we denote by , eliminating superfix and using and instead of and . These index notations are convenient for showing that the corresponding ’s belong to layer .
We show how the square of the Euclidean length of ,
is related to that of . This relation can be seen from
For any pair and , random variables are iid for all when is fixed, so the law of large numbers guarantees that
where is the expectation with respect to the weights and biases and represents small terms of stochastic order . We use the mean field property that has the self-averaging property and the average of the product of and in equation (12) splits as
This is justified in appendix I. By putting
we have from equation (11)
Here which depends on , is the enlargement factor showing how is enlarged or reduced across layer .
From the recursive relation (15), we have
Assume that all the are equal. Then, it gives the Lyapunov exponent of dynamics equation (15). When it is larger than 1, the length diverges as the layers proceed, whereas when it is smaller than 1, the length decays to 0. The dynamics of is chaotic when (Poole et al, 2016). Interesting information processing takes place at the edge of chaos, where is nearly equal to 1 (Yang & Schoenholz, 2017). We have interest in the case where is nearly equal to 1, but each ’s are distributed, some being smaller than 1 and the others larger than 1.
3 Fisher information of deep networks and natural gradient learning
We study a regression model in which the output of layer , ,
where is a multivariate Gaussian random variable with mean 0 and identity covariance matrix . Then the probability of given input is
where consists of all the parameters , and , . The Fisher information matrix is given by
where denotes the expectation with respect to randomly generated input and resultant and is gradient with respect to . By using error vector in (19), we have
For fixed , expectation with respect to is replaced by that of , where . Hence, (21) is given by
Here, we use the dyadic or tensor notation that implies a matrix , instead of vector-matrix notation for column vectors.
Online learning is a method of modifying the current such that the current loss
decreases, where is the current input-output pair. The stochastic gradient decent method (proposed in Amari, 1967) uses the gradient of to modify ,
Historically, the first simulation results applied to four-layer networks for pattern classification were given in a Japanese book (Amari, 1968). The minibatch method uses the average of over minibatch samples.
The negative gradient is a direction to decrease the current loss but is not steepest in a Riemannian manifold. The true steepest direction is given by
which is called the natural or Riemannian gradient (Amari, 1998). The natural gradient method is given by
It is known to be Fisher efficient for estimating(Amari, 1998). Although it gives excellent performances, the inversion of is computationally very difficult.
To avoid difficulty, the quasi-diagonal natural gradient method was proposed in Ollivier (2015) and was shown to be very efficient in Marcereau-Caron and Ollivier (2016). A recent proposal (Ollivier, 2017) looks very promising for realizing natural gradient learning. The present paper analyzes the structure of the Fisher information matrix. It will give a justification of the quasi-diagonal natural gradient method. By using it, we propose a new method of realizing natural gradient learning without the burden of inverting .
4 Structure of Fisher information matrix
To calculate elements of , we use a new notation combining connection weights and bias into one vector,
For the -th unit of layer , it is
For , we have the recursive relation
Starting from and using
which is a product of matrices. The elements of are denoted by .
We calculate the Fisher information given in equation (23). The elements of with respect to layers and are written as
where denotes the innor product with respect to . The emelents of are, for fixed ,
Hence, (34) is written in the component form as
We first consider the case , that is, two neurons are in the same layer . The following lemma is usuful for evaluating .
Domino Lemma We assume that all are of order .
We first prove the case with . We have
When , this is a sum of iid random variables , when input is fixed. Therefore, the law of large numbers guarantees that, as goes to infinity, their sum converges to the expectation,
under the mean field approximation for any . For fixed , the right-hand side of equation (38) is also a sum of iid variables with mean 0. Hence, its mean is 0. We evaluate its variance, proving that the variance is
which is of order , because is of order . Hence we have
When , we repeat the process . Then in the left-hand side of equation (37) propagates to give like the domino effect, leaving multiplicative factors . This proves the theorem. ∎
Remark: The domino lemma holds irrespective of or , provided they are large. However, matrix is not of full rank, its rank being .
By using this result, we evaluate off-diagonal blocks of under the mean field approximation (39).
The Fisher information matrix is unit-wise diagonal except for terms of stochastic order .
We first calculate the off-diagonal blocks of the Fisher information matrix within the same layers. The Fisher information submatrix within layer is
which are elements of submatrix of corresponding to neurons and both in the same layer . By the domino lemma, we have
This shows that the submatrix is unit-wise block diagonal: that is, the blocks of different neurons and are 0 except for terms of order .
We next study the blocks of different layers and ,
By using the domino lemma, is written as
and hence it is of order . In general, is a sum of mean iid random variables with variance of order . Hence, its mean is 0 and variance is of order , proving that (46) is of order . ∎
Inspired from this, we define a new metric as an approximation of , such that all the off-diagonal block terms of are discarded, putting them equal to 0. We study the natural (Riemannian) gradient method which uses as the Riemannian metric. Note that is an approximation of , tending to for in the max-norm, but is not a good approximation to . This is because the max-norm of a matrix is not sub-multiplicative. See the remark below.
Remark: One should note that the approximately block diagonal structure is not closed in the matrix multiplication and inversion. Even though is approximately unitwise block diagonal, its square is not, as is shown in the following. For simplicity, we assume that
is an identity matrix and
is a random matrix of order 1,being independent random variables subject to . Then
Here the -th element of is
a sum of independent random variables. Hence, although its mean is 0, it is of order 1. Hence, the off-diagonal elements are no more small. The same situation holds for .
We may also note that the Riemannian magnitude of vector ,
is not approximated by , because we cannot neglect the off-diagonal elements of .
Recently, Karakida, Akaho Amari (2018) analyzed characteristics of the original metric (not ). They evaluated the traces of and to analyze the distribution of eigenvalues of , which proves that the small off-diagonal elements cause a long-tail distribution of eigenvalues. This elucidates the landscape of the error surface in a random deep net. In contrast, the present study focuses on the approximated metric . It enables us to give an explicit form of the Fisher information matrix, directly applicable to natural gradient methods, as follows.
5 Unit-wise Fisher information
Because is unit-wise block-diagonal, it is enough to calculate the Fisher information matrices of single units. We assume that its input vector is subject to . This does not hold in general. However, it holds approximately for a randomly connected resnet, as is shown in the next section.
Let us introduce a new -dimensional vectors for a single unit:
where and . Then, the output of the unit is , . The Fisher information matrix is an matrix written as
We introduce a set of new orthonormal basis vectors in the space of as
where , are arbitrary orthogonal unit vectors, satisfying , . That is, is a rotation of and we put .
Here are mutually orthogonal unit vectors and is the unit vector in the direction of . Since and are represented in the new basis as
Moreover, are orthogonal transformation of . Hence, , are jointly independent Gaussian, subject to , and .
In order to obtain , let us put
in the dyadic notation. Then, the coefficients are given by
which are elements of in the coordinate system . From and equation (59), we have
which depend on . We further have, for ,
From these, we obtain in the dyadic form
The elements of in the basis are
which shows that is a sum of a diagonal matrix and a rank 2 matrix.
By using these relations, is expressed in the original basis as
The inverse of has also the same form, so we have an explicit form of
By using the above equations, is obtained explicitly, so we do not need to calculate back-propagated and its inverse for the natural gradient update of .
The natural gradient method for each unit is written by using the back-propagated error as
which splits as
The back-propergated error is calculated as follows. Let be the back-propagated error of neuron in layer
. It is given by the well-known error backpropagation as
We can implement the unit-wise natural gradient method using equations (81) and (82) without calculating . However, the unit-wise is derived under the condition that the input to each neuron is subject to a 0-mean Gaussian distribution. This does not hold in general, so we need to adjust by a linear transformation. We will see that a residual network automatically makes the input to each layer be subject to a 0-mean Gaussian distribution.
Obviously, is no more random Gaussian with mean 0 after learning. However, since the unit-wise natural gradient proposed here is computationally so easy, it is worth trying for practical applications even after learning.
Except for a rank 1 term and bias terms , is a diagonal matrix. In other words, as is seen in equation (72), it is diagonal except for a raw and column corresponding to the bias terms and the rank 1 term . Except for the rank 1 term , it has the same structure as that of the quasi-diagonal matrix of Ollivier (2015), justifying the quasi-diagonal method.
6 Fisher information of residual network
The residual network has direct paths from its input to output in each layer. We treat the following block of layer : The layer transforms input to output by
(see Figure 2).
Here is a decay factor, ( is conventionally used), and are randomly generated iid Gaussian variables subject to .
We show how the activity develops in a residual network (Yang and Schoenholtz, 2017). We easily have the recursive relation,