# Pathological spectra of the Fisher information metric and its variants in deep neural networks

The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic behavior when the network is sufficiently wide and has random weights and biases. Various FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues take on large values while most of them are close to zero. This implies that the local shape of the parameter space or loss landscape is very steep in a few specific directions and almost flat in the other directions. Similar pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. The quantitative understanding of the FIM and its variants provided here offers important perspectives on learning and signal processing in large-scale DNNs.

## Authors

• 7 publications
• 6 publications
• 11 publications
• ### Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach

This study analyzes the Fisher information matrix (FIM) by applying mean...
06/04/2018 ∙ by Ryo Karakida, et al. ∙ 0

• ### Lightlike Neuromanifolds, Occam's Razor and Deep Learning

Why do deep neural networks generalize with a very high dimensional para...
05/27/2019 ∙ by Ke Sun, et al. ∙ 3

• ### The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

The Fisher information matrix (FIM) is fundamental for understanding the...
06/14/2020 ∙ by Tomohiro Hayase, et al. ∙ 0

• ### The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks

Normalization methods play an important role in enhancing the performanc...
06/07/2019 ∙ by Ryo Karakida, et al. ∙ 5

• ### Fisher Information and Natural Gradient Learning of Random Deep Networks

A deep neural network is a hierarchical nonlinear model transforming inp...
08/22/2018 ∙ by Shun-ichi Amari, et al. ∙ 0

• ### A Differential Topological View of Challenges in Learning with Feedforward Neural Networks

Among many unsolved puzzles in theories of Deep Neural Networks (DNNs), ...
11/26/2018 ∙ by Hao Shen, et al. ∙ 0

• ### Between Homomorphic Signal Processing and Deep Neural Networks: Constructing Deep Algorithms for Polyphonic Music Transcription

This paper presents a new approach in understanding how deep neural netw...
06/26/2017 ∙ by Li Su, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Deep neural networks (DNNs) have outperformed many standard machine-learning methods in practical applications LeCun et al. (2015)

. Despite their practical success, many theoretical aspects of DNNs remain to be uncovered, and there are still many heuristics used in deep learning. We need a solid theoretical foundation for elucidating how and under what conditions DNNs and their learning algorithms work well.

The Fisher information matrix (FIM) is a fundamental metric tensor that appears in statistics and machine learning. The FIM determines the Cramér-Rao bound for parameter estimation. An empirical FIM is equivalent to the Hessian of the loss function around a certain global minimum, and it affects the performance of optimization in machine learning. In information geometry, the FIM defines the Riemannian metric tensor of the parameter manifold of a statistical model

Amari (2016). The natural gradient method is a first-order gradient method in the Riemannian space, and it is characterized by the FIM and invariance under parameter coordinate transformations Amari (1998); Park et al. (2000); Pascanu and Bengio (2014); Ollivier (2015); Martens and Grosse (2015). It can be used to train various DNNs faster than conventional gradient methods can. The FIM also acts as a regularizer to prevent catastrophic forgetting Kirkpatrick et al. (2017); a DNN trained on one dataset can learn another dataset without forgetting information if the parameter change is regularized with the diagonal of the FIM.

However, our understanding of the FIM for neural networks has so far been limited to empirical studies and theoretical analyses of simple networks. Numerical experiments confirmed that the eigenvalue spectra of the FIM and those of the Hessian are highly distorted; that is, most eigenvalues are close to zero, while others take on large values LeCun et al. (1998); Sagun et al. (2017); Papyan (2019). Focusing on shallow neural networks, Pennington and Worah (2018)

theoretically analyzed the FIM’s eigenvalue spectra by using random matrix theory, and

Fukumizu (1996) derived a condition under which the FIM becomes singular. Liang et al. (2019) have connected FIMs to the generalization ability of DNNs by using model complexity, but their results are restricted to linear networks. Thus, theoretical evaluations of deeply nonlinear cases seem to be difficult because of iterated nonlinear transformations. To go one step further, it would be helpful if a framework that is widely applicable to various DNNs could be constructed.

Investigating DNNs with random weights has given promising results. When such DNNs are sufficiently wide, we can formulate their behavior by using simpler analytical equations through coarse-graining of the model parameters, as is discussed in mean field theory Amari (1974); Kadmon and Sompolinsky (2016); Poole et al. (2016); Schoenholz et al. (2017); Yang and Schoenholz (2017); Xiao et al. (2018); Yang (2019) and random matrix theory Pennington et al. (2018); Pennington and Bahri (2017); Pennington and Worah (2017). For example, Schoenholz et al. (2017)

proposed a mean field theory for backpropagation in fully-connected DNNs. This theory characterizes the amplitudes of gradients by using specific quantities, i.e.,

order parameters

in statistical physics, and enables us to quantitatively predict parameter regions that can avoid vanishing or explosive gradients. This theory is applicable to a wide class of DNNs with various non-linear activation functions and depths. Such DNNs with random weights are substantially connected to Gaussian process and kernel methods

Daniely et al. (2016); Lee et al. (2018); Matthews et al. (2018); Jacot et al. (2018). Furthermore, the theory of the neural tangent kernel (NTK) explains that even trained parameters are close enough to the random initialization in sufficiently wide DNNs and the performance of trained DNNs is determined by the NTK on the initialization Jacot et al. (2018); Lee et al. (2019); Arora et al. (2019).

Karakida et al. (2019a)

recently focused on FIM corresponding to the least square loss and proposed a framework to express certain eigenvalue statistic by using order parameters. They revealed that when conventional fully-connected networks with random initialization are sufficiently wide, the FIM’s eigenvalue spectrum asymptotically becomes pathologically distorted. As the network width increases, most of the eigenvalues become asymptotically close to zero while a small number of them take on huge values and become outliers. The distorted shape of the eigenvalue spectrum is consistent with empirical reports

LeCun et al. (1998); Sagun et al. (2017). While LeCun et al. (1991)

implied that such pathologically large eigenvalue might appear in multi-layered networks and affect the training dynamics, its theoretical elucidation has been limited to a data covariance matrix in a linear regression model. The results of

Karakida et al. (2019a) verify the large eigenvalues suggested in LeCun et al. (1991) and enables us to quantify them in wide and multi-layered networks. The obtained eigenvalue statistics are crucial in practice. As we make the network wider, the largest eigenvalue becomes larger and we have to make the learning rate smaller for the gradient methods to converge LeCun et al. (1998); Karakida et al. (2019a). Using statistics obtained in Karakida et al. (2019a), Sun and Nielsen (2019) investigated a new formulation of the minimum description length in DNNs that showed improved generalization performance.

In this paper, we extend the framework of the previous work Karakida et al. (2019a) and reveal that various FIMs and variants show pathological spectra. Our contributions are summarized as follows:

• FIM with soft-max outputs: While the FIM analyzed in the previous work Karakida et al. (2019a) corresponds to a squared error loss for regression tasks, the typical loss function used in classification tasks is the cross-entropy loss with soft-max outputs. We analyze this FIM for classification tasks and reveal that its spectrum is pathological. There are at least dominant eigenvalues, which is consistent with a recent experimental report Papyan (2019).

• Diagonal Blocks of FIM: We give a detailed analysis of the diagonal block parts of the FIM for regression tasks. Natural gradient algorithms often use a block diagonal approximation of the FIM Amari et al. (2019). We show that the diagonal blocks also suffer from pathological spectra.

• Connection to NTK: The NTK and FIM inherently share the same non-zero eigenvalues. Paying attention to a specific re-scaling of the parameters assumed in studies of NTK, we reveal that NTK’s eigenvalue statistics become independent of the width scale. Instead, the gap between the average and maximum eigenvalues increases with the sample size. This suggests that, as the sample size increases, the training dynamics converge non-uniformly and that calculations with the NTK become ill-conditioned. We also demonstrate a simple method to make eigenvalue statistics that are independent of both the width and the sample size.

• Metric tensors for input and feature spaces

: We consider metric tensors for input and feature spaces spanned by neurons in input and hidden layers. These metric tensors potentially enable us to quantitatively evaluate the robustness of DNNs against perturbations in the input and feedforward propagated signals. We show that these metric tensors have the averages of eigenvalues that are asymptotically close to zero while their largest eigenvalues maintain constant values. In the sense that the outlier of the spectrum is much far from most of the eigenvalues, the spectrum is pathologically distorted, similar to FIMs.

In summary, this study presents a unified perspective on the asymptotical spectra common to various wide networks.

### 2 Preliminaries

#### 2.1 Model

We investigated the fully-connected feedforward neural network shown in Fig. 1. The network consists of one input layer, hidden layers (), and one output layer. It includes shallow nets () and arbitrary deep nets (). The network width is denoted by . The pre-activations and activations of units in the -th layer are defined recursively in terms of the activations of the previous layer:

 uli=Ml−1∑j=1Wlijhl−1j+bli,  hli=ϕ(uli), (1)

for , which will be explained in the following. The input signals are , which propagate layer by layer by Eq. (1). We define the weight matrices as and the bias terms as . We will mainly focus on the case in which the activation function in the -th layer (network output) is linear, i.e.,

 fi:=hLi=uLi. (2)

The softmax output is also discussed in Section 3.3.

FIM computations require the chain rule of backpropagated signals

. The backpropagated signals are defined by and naturally appear in the derivatives of with respect to the parameters:

 ∂fk∂Wlij=δlk,ihl−1j,  ∂fk∂bli=δlk,i,δlk,i=ϕ′(uli)∑jδl+1k,jWl+1ji, (3)

for . To avoid complicating the notation, we will omit the index of the output unit, i.e., . To evaluate the above feedforward and backward signals, we assume the following conditions.

Random weights and biases: Suppose that the parameter set is an ensemble generated by

 Wliji.i.d.∼N(0,σ2w/Ml−1),  blii.i.d.∼N(0,σ2b), (4)

and thus is fixed, where

denotes a Gaussian distribution with zero mean and variance

. Treating the case in which different layers have different variances is straightforward. Note that the variances of the weights are scaled in the order of . In practice, the learning of DNNs usually starts from random initialization with this scaling Glorot and Bengio (2010); He et al. (2015). Regarding the network width, we set

 Ml=αlM  (0≤l≤L−1),  ML=C, (5)

and consider the limiting case of a sufficiently large with constant coefficients . The number of output units is taken to be a constant , as is usually done in practice.

Input samples: We assume that there are input samples (

) generated identically and independently from the input distribution. We generate the samples by using a standard normal distribution, i.e.,

 xj(t)i.i.d.∼N(0,1). (6)

Activation functions: Suppose the following two conditions: (i) the activation function has a polynomially bounded weak derivative. (ii) the network is non-centered, which means a DNN with bias terms () or activation functions with a non-zero Gaussian mean. The definition of the non-zero Gaussian mean is .

Condition (i) is used to obtain recurrence relations of backward order parameters Yang (2019). Condition (ii) makes it easy to evaluate the FIM for regression tasks Karakida et al. (2019a, b)

. The two conditions are valid in various realistic settings, because conventional networks include bias terms, and widely used activation functions, such as the sigmoid function and (leaky-) ReLUs, have bounded weak derivatives and non-zero Gaussian means. Different layers may have different activation functions.

#### 2.2 Overview of metric tensors

We will analyze two types of metric tensors (metric matrices) that determine the responses of network outputs, i.e., the response to a local change in parameters and the response to a local change in the input and hidden neurons. One can systematically understand these tensors from the perspective of perturbations of variables.

We denote the set of network parameters as and its dimension as . Next suppose we choose one network output unit . If is perturbed by an infinitesimal change , its change is given by a quadratic form after performing a Taylor expansion, i.e.,

 E[|fk(x;θ+dθ)−fk(x;θ)|2]∼dθ⊤Fkdθ, (7)
 Fk:=E[∇θfk∇θf⊤k], (8)

where is the derivative with respect to and denotes the expectation over an input distribution. The matrix acts as a metric tensor for the parameter space. ’s eigenvalues determine the robustness of the network output against the perturbation. As will be explained in Section 3.1,

 F=C∑k=1Fk (9)

has a special meaning because it is the Fisher information matrix (FIM).

Let us take a closer look at the structure of by using block matrices. can be partitioned into block matrices. We denote the -th block as (), whose weight part is given by . One can represent it in matrix form:

 Fll′k=E[δlk(δl′k)⊤⊗hl−1(hl′−1)⊤], (10)

where represents the Kronecker product. The variables and are functions of , and the expectation is taken over .

In analogy with the FIM, one can introduce a metric tensor that measures the response to a change in the neural activities. Make a vector of all the activations in the input and hidden layers, i.e.,

with . Next, define an infinitesimal perturbation of , i.e., , that is independent of . Then, the response can be written as

 E[|fk(h+dh;θ)−fk(h;θ)|2]∼dh⊤Akdh, (11)
 Ak:=E[∇hfk∇hf⊤k]. (12)

We refer to as the metric tensor for the input and feature spaces because each acts as the input to the next layer and corresponds to the features realized in the network. We take the average over input samples; this operation includes the trivial case of one sample, i.e., .

One can partition into block matrices whose -th block is expressed by an matrix:

 All′k:=(Wl+1)⊤E[δl+1k(δl′+1k)⊤]Wl′+1, (13)

for . In particular, the first diagonal block indicates the robustness of the network output against perturbation of the input:

 E[|fk(x+dx;θ)−fk(x;θ)|2]∼dx⊤A00kdx. (14)

Similar (but different) quantities have been investigated in terms of the robustness against input noise, such as sensitivity Novak et al. (2018), and robustness against adversarial examples Goodfellow et al. (2014).

#### 2.3 Order parameters in mean field theory

We use the following four types of order parameter, i.e., , which were used in various studies on wide DNNs Amari (1974); Poole et al. (2016); Schoenholz et al. (2017); Yang and Schoenholz (2017); Xiao et al. (2018); Lee et al. (2018). First, let us define the following variables for feedforward signal propagation;

 ^qlt:=1MlMl∑i=1hli(t)2,  ^qlst:=1MlMl∑i=1hli(s)hli(t), (15)

where is the output of the -th layer generated by the -th input sample (). The variable describes the total activity in the -th layer, and the variable describes the overlap between the activities for different input samples and

. These variables have been utilized to describe the depth to which signals can propagate from the perspective of order-to-chaos phase transitions

Poole et al. (2016). In the large limit, these variables can be recursively computed by integration over Gaussian distributions (Poole et al., 2016; Amari, 1974):

 ^ql+1t=∫Duϕ2(√ql+1tu),  ^ql+1st=Iϕ[ql+1t,ql+1st],ql+1t:=σ2w^qlt+σ2b,  ql+1st:=σ2w^qlst+σ2b, (16)

for . Because the input samples generated by Eq. (6) yield and for all and , in each layer takes the same value for all ; so does for all . The notation means integration over the standard Gaussian density. A two-dimensional Gaussian integral is given by

 Iϕ[a,b]:=∫DyDxϕ(√ax)ϕ(√a(cx+√1−c2y)) (17)

with . One can represent this integral in a bit simpler form, i.e.,

Next, let us define the following variables for backpropagated signals:

 ~qlt:=Ml∑i=1δli(t)2,  ~qlst:=Ml∑i=1δli(s)δli(t). (18)

Above, we omitted , the index of the output , because the symmetry in the layer makes the above variables independent of in the large limit. The backward variables are defined by the summations, while the feedforward ones in (15) are defined by the averages. Because we suppose , each is of and their sums are of in terms of the order notation . The variable is the magnitude of the backward signals and is their overlap. Previous studies found that and in the large limit are easily computed using the following recurrence relations Schoenholz et al. (2017); Yang (2019),

 ~qlt=σ2w~ql+1t∫Du[ϕ′(√qltu)]2,  ~qlst=σ2w~ql+1stIϕ′[qlt,qlst], (19)

for . A linear network output (2) leads to the following initialization of the recurrences: The previous studies showed excellent agreement between these backward order parameters and experimental results Schoenholz et al. (2017); Yang and Schoenholz (2017); Xiao et al. (2018). Although those studies required the so-called gradient independence assumption to derive these recurrences, Yang (2019) recently proved that such an assumption is unnecessary when condition (i) of the activation function is satisfied.

The order parameters () depend only on the type of activation function, depth, and the variance parameters and . The recurrence relations for the order parameters require iterations of one- and two-dimensional numerical integrals. Moreover, we can obtain explicit forms of the recurrence relations for some of the activation functions Karakida et al. (2019a).

### 3 Eigenvalue statistics of FIMs

This section shows the asymptotic eigenvalue statistics of the FIMs. When we have an metric tensor whose eigenvalues are (), we compute the following quantities:

 mλ:=1PP∑i=1λi,  sλ:=1PP∑i=1λ2i,  λmax:=maxiλi.

The obtained results are universal for any sample size and network ranging in size from shallow () to arbitrarily deep ().

#### 3.1 FIM for regression tasks

This subsection overviews the results obtained in the previous studies Karakida et al. (2019a, b). The metric tensor is equivalent to the Fisher information matrix (FIM) of neural network models Amari (1998); Pascanu and Bengio (2014); Ollivier (2015); Park et al. (2000); Martens and Grosse (2015), originally defined by

 F:=E[∇θlogp(x,y;θ)∇θlogp(x,y;θ)⊤]. (20)

The statistical model is given by , where

is the conditional probability distribution of the DNN of output

given input , and is an input distribution. The expectation is taken over the input-output pairs

of the joint distribution

. This FIM appears in the Kullback-Leibler divergence between a statistical model and an infinitesimal change to it:

The parameter space forms a Riemannian manifold and the FIM acts as its Riemannian metric tensor Amari (2016).

Basically, there are two types of FIM for supervised learning, depending on the definition of the statistical model. One type corresponds to the squared error loss for regression tasks; the other corresponds to the cross-entropy loss for classification tasks. The latter is discussed in Section

3.3. Let us consider the following statistical model for the regression task:

 p(y|x;θ)=1√2πexp(−12||y−f(x;θ)||2), (21)

where we denote the Euclidean norm as . The squared error loss is given by the log-likelihood of this model. Substituting into the original definition of FIM (20) and taking the integral over , one can easily confirm that it is equivalent to the metric tensor (9) introduced by the perturbation.

When input samples are available, we can replace the expectation of the FIM with the empirical mean:

 F=E[∇θfk∇θf⊤k]=1TT∑t=1∇θfk(t)∇θfk(t)⊤, (22)

where we have abbreviated the network outputs as to avoid complicating the notation. This is an empirical FIM in the sense that the average is computed over empirical input samples. We can express it in the matrix form shown in Fig. 2. Let us investigate this type of empirical metric tensor for arbitrary . One can set as a constant value or make it increase depending on . The empirical FIM (22) converges to the expected FIM as . Note that in the context of natural gradient algorithms, one may approximate the FIM (20) by taking an average over empirical input-output pairs () where is a training label. Recently, Kunstner et al. (2019) emphasized that in natural gradient algorithms, the FIM (22) performs better than that of the empirical pairs ().

The previous studies Karakida et al. (2019a, b) uncovered the following eigenvalue statistics of the FIM (22):

###### Theorem 3.1 (Karakida et al. (2019a),Karakida et al. (2019b)).

When is sufficiently large, the eigenvalue statistics of can be asymptotically evaluated as

 mλ ∼κ1CM,  sλ∼α(T−1Tκ22+κ21T)C,
 λmax∼α(T−1Tκ2+κ1T)M,

where , and positive constants and are obtained using order parameters,

 κ1:=L∑l=1 αl−1α~qlt^ql−1t,  κ2:=L∑l=1αl−1α~qlst^ql−1st.

The eigenspace corresponding to

largest eigenvalues is spanned by eigenvectors,

 E[∇θfk]  (k=1,...,C).

The mean of the eigenvalue spectrum asymptotically decreases in the order of in the large limit, while the variance takes a value of and the largest eigenvalue takes a huge value of . Note that is positive by definition and is positive under the condition (ii) of activation functions.

Theorem 1 has the following implication. Since the eigenvalues are non-negative by definition, the obtained statistics means that most of the eigenvalues are asymptotically close to zero, while the other eigenvalues are widely distributed; this behavior has been empirically known for decades (LeCun et al., 1998; Sagun et al., 2017; Papyan, 2019). Thus, when the network is sufficiently wide, one can see that the shape of the eigenvalue spectrum asymptotically becomes pathologically distorted. This implies that the parameter space of the DNNs is locally almost flat in most directions but highly distorted in a few specific directions. The following remarks are helpful for understanding further implications of the theorem.

Remark 1: Dependence on the Depth. As the depth increases, linearly increases in the sense that it is proportional to the sum over terms. As previous studies have reported Poole et al. (2016); Schoenholz et al. (2017), a large limit causes a qualitative change in the network state, known as a phase transition. Schoenholz et al. (2017) recommends setting () on the critical line of the phase transition. In such case, and take finite values and scales linearly to the depth. In contrast, the means of the eigenvalues remain unchanged. Therefore, deeper networks have more distorted parameter spaces.

Remark 2: Loss landscape and gradient methods. The empirical FIM (22) is equivalent to the Hessian of the loss around the global minimum with zero training loss Karakida et al. (2019a). This means that the local shape of the loss landscape is also asymptotically almost flat in most directions but very steep in a few specific directions. Karakida et al. (2019b) referred to the steep shape caused by as pathological sharpness. The sharpness of the loss landscape is connected to the learning rate of gradient methods for convergence. Experiments conducted by Karakida et al. (2019a) confirmed that a learning rate satisfying is necessary for the steepest gradient method to converge. Because diverges as the width increases, we need to carefully choose an appropriately scaled learning rate to train the DNNs. Furthermore, because increases as the depth increases, a deeper network has a steeper loss landscape around the minimum, which requires a smaller learning rate.

Remark 3: largest eigenvalues. The eigenspace corresponding to has the dimension of Karakida et al. (2019b). Fig. 3 (left) shows a typical spectrum of the FIM. We computed the eigenvalues of the FIM by using random Gaussian weights, biases, and inputs. We used deep Tanh networks with , , , and . Number of input samples was . The red histogram was made from eigenvalues over different networks with different random seeds. The histogram had two populations. The blue dashed histogram was made by eliminating the largest eigenvalues. It coincides with the smaller population. Thus, one can see that the larger population corresponds to the largest eigenvalues. The larger population in experiments can be distributed around because of finite .

The eigenvalue statistics of the smaller population were investigated in the previous work Karakida et al. (2019b)

in the context of batch normalization in the last layer. Such normalization includes mean subtraction, i.e.,

. The previous work analyzed the corresponding FIM;

 ¯F :=∑kE[∇θ¯fk∇θ¯f⊤k] =∑kE[∇θfk∇θf⊤k]−∑kE[∇θfk]E[∇θfk]⊤. (23)

The subtraction (23) means eliminating the largest eigenvalues from because we asymptotically have . Thus, the ’s eigenvalues correspond to the smaller population in the figure. The previous work Karakida et al. (2019b) theoretically confirmed that, under the condition , the mean of ’s eigenvalues is of and constructed the lower and upper bounds of the largest eigenvalue. Numerical experiments confirmed that the largest eigenvalue of is of . Note that when , the sample size is sufficiently large but the network satisfies and keeps overparameterized.

#### 3.2 Diagonal blocks of FIM

In the same way as Theorem 3.1, one can easily obtain eigenvalue statistics for diagonal blocks, that is, .

###### Theorem 3.2.

When is sufficiently large, the eigenvalue statistics of are asymptotically evaluated as

 mlλ ∼~qlt^ql−1tαlCM, slλ ∼αl−1αl(T−1T(~qlst^ql−1st)2+(~qlt^ql−1t)2T)C,
 λlmax∼αl−1(T−1T~qlst^ql−1st+~qlt^ql−1tT)M.

The eigenspace of corresponding to the largest eigenvalues is spanned by eigenvectors,

 E[∇θlfk]  (k=1,...,C).

The theorem is proved in Appendix A. We have a mean of

and second moment of

in each hidden layer. The largest eigenvalue is of . Regarding the pathological largest eigenvalues, we have, asymptotically,

 λmax=L∑l=1λlmax. (24)

Fig. 3 (right) empirically confirms that has a similar pathological spectrum to that of . Its experimental setting was the same as in the case of .

It is helpful to investigate the relation between and its diagonal blocks when one considers the diagonal block approximation of . Use of a diagonal block approximation can decrease the computational cost of natural gradient algorithms Ollivier (2015); Amari et al. (2019). When a matrix is composed only of diagonal blocks, its eigenvalues are given by those of each diagonal block. Therefore, approximated in this fashion has the same mean of the eigenvalues as the original and the largest eigenvalue , which is of . Thus, the diagonal block approximation also suffers from a pathological spectrum. Eigenvalues that are close to zero can make the inversion of the FIM in the natural gradient methods unstable, whereas using a damping term seems to be an effective way of dealing with this instability Martens and Grosse (2015).

#### 3.3 FIM for multi-label classification tasks

The cross-entropy loss is typically used in multi-label classification tasks. We define the -dimensional softmax function by

 gi(t):=exp(fi(t))∑kexp(fk(t)), (25)

for . The cross-entropy loss comes from the log-likelihood of the following statistical model :

 p(y|x;θ)=C∏igi(t)yi, (26)

where is a -dimensional one-hot vector. The cross-entropy loss is given by . Substituting the statistical model into the definition of the FIM (20) and taking the summation over , we find that

 Fcross=1TT∑s,t=1C∑k,k′Qst(k,k′)∇θfk(s)∇θfk′(t)⊤. (27)
 Qst(k,k′):={gk(t)δkk′−gk(t)gk′(t)}δst. (28)

This is also derived in Pascanu and Bengio (2014); Park et al. (2000). One can also characterize the by the robustness of the softmax function against the perturbation,

 E[||g(x;θ+dθ)−g(x;θ)||2]∼dθ⊤Fcrossdθ. (29)

is linked to through the matrix representation shown in Fig. 2 (a). One can view as a matrix representation with

, that is, the identity matrix. In contrast,

corresponds to the defined in (28). This matrix representation is useful for deriving the eigenvalue statistics (see Appendix B) and the following theorem:

###### Theorem 3.3.

When is sufficiently large, the eigenvalue statistics of are asymptotically evaluated as

 mλ∼β1Cκ1M,  sλ∼α(β2κ22+β3κ21T),
 β4α(T−1Tκ2+κ1T)M≤λmax≤√αsλM,

where the coefficients are given by

 β1 :=1−∑tTC∑igi(t)2, β2 :=∑s≠tT2{C∑igi(t)gi(s)−2C∑igi(t)2gi(s) +(C∑igi(t)gi(s))2}, β3 :=∑tT{C∑i(1−2gi(t))gi(t)2+(C∑igi(t)2)2}, β4 :=max1≤k≤C∑tTgk(t)(1−gk(t)).

We find that the eigenvalue spectrum shows the same width dependence as the FIM for regression tasks. Although the evaluation of in Theorem 3.3 is based on inequalities, one can see that is of and it linearly increases as the depth increases. The soft-max functions appear in the coefficients . It should be noted that the values of generally depend on the index of each soft-max output. This is because the values of the softmax functions depend on the specific configuration of and .

Fig. 4 shows that our theory predicts experimental results rather well for artificial data. We computed the eigenvalues of with random Gaussian weights, biases, and inputs. We set , , , and () = () in the tanh case, (

) in the ReLU case, and (

) in the linear case. The number of input samples was set to . The predictions of Theorem 3.3 coincided with the experimental results for sufficiently large widths.

Exhaustive experiments on the cross entropy loss have recently confirmed that there are dominant large eigenvalues (so-called outliers) Papyan (2019). Consistent with the results of this experimental study, we found that there are at least eigenvalues depending on the width scale:

###### Theorem 3.4.

has the first largest eigenvalues of .

The theorem is proved in Appendix C. These large eigenvalues are reminiscent of the largest eigenvalues of shown in Theorem 3.1. Informally speaking, the matrix in scatters the largest eigenvalues of . It would be interesting to extend the above theorem and theoretically quantify the dominant eigenvalues precisely.

### 4 Connection to Neural Tangent Kernel

#### 4.1 Scale-dependent eigenvalue statistics

The empirical FIM (22) is essentially connected to a recently proposed Gram matrix, i.e., the Neural Tangent Kernel (NTK). Jacot et al. (2018) formulated the NTK as

 Θ := ∇θf⊤∇θf,  with  θ={ωlij,βli}, (30)

where they assumed a special scaling,

 Wlij=σw√Ml−1ωlij,  bli=σbβli,  ωlij,βlii.i.d% .∼N(0,1). (31)

This scaling is called NTK parameterization Lee et al. (2019). Under certain conditions with sufficiently large , the NTK is known to govern the training dynamics of the wide network:

 dfdt = ηTΘ(y−f), (32)

where the notation corresponds to the time step of the parameter update and represents the learning rate. Surprisingly, the NTK with random initialization, , determines the training process, and we can analytically solve for the dynamics of at . Specifically, NTK’s eigenvalues determine the speed of convergence of the training dynamics. Moreover, one can predict the network output on the test samples by using the Gaussian process with the NTK Jacot et al. (2018); Lee et al. (2019).

The NTK and empirical FIM share essentially the same eigenvalues. Consider the left-to-right reversal of denoted by in Fig. 2 (b). we call it as the dual of . We have . Karakida et al. (2019a) analyzed under the usual parameterization to derive Theorem 3.1 because and have the same non-zero eigenvalues. We can obtain the eigenvalue statistics of by taking the NTK parameterization and leveraging the proof of Theorem 3.1.

###### Theorem 4.1.

When is sufficiently large, the eigenvalue statistics of are asymptotically evaluated as

 mλ
 λmax∼α((T−1)κ′2+κ′1).

The positive constants and are obtained using order parameters,

 κ′1 :=1αL∑l=1(σ2w~qlt^ql−1t+σ2b~qlt), κ′2 :=1αL∑l=1(σ2w~qlst^ql−1st+σ2b~qlst).

The proof is given in Appendix D. The NTK parameterization makes the eigenvalue statistics independent of the width scale. This is because the NTK parameterization maintains the scale of the weights but changes the scale of the gradients with respect to the weights. It also makes () shift to (). This shift occurs because the NTK parameterization makes the order of the weight gradients comparable to that of the bias gradients . The second terms in and correspond to a non-negligible contribution from . The coefficients () are equivalent to in Theorem 1 of Jacot et al. (2018).

While is independent of the sample size , depends on it. This means that the NTK dynamics converge non-uniformly. Most of the eigenvalues are relatively small and the NTK dynamics converge more slowly in the corresponding eigenspace. In addition, a prediction made with the NTK requires the inverse of the NTK to be computed Jacot et al. (2018); Lee et al. (2019). When the sample size is large, the condition number of the NTK, i.e., , is also large and the computation with the inverse NTK is expected to be numerically inaccurate.

#### 4.2 Scale-independent NTK

A natural question is under what condition do NTK’s eigenvalue statistics become independent of both the width and the sample size? As indicated in Eq. (23), the mean subtraction in the last layer with is a simple way to make the FIM’s largest eigenvalue independent of the width. Similarly, one can expect that the mean subtraction makes the NTK’s largest eigenvalue of disappear and the eigenvalue spectrum take a range of independent of the width and sample size. Fig. 5 empirically confirms this speculation. We set , , and used the Gaussian inputs and weights with . As shown in Fig. 5 (left), NTK’s eigenvalue spectrum becomes pathologically distorted as the sample size increases. To make an easier comparison of the spectra, the eigenvalues in this figure are normalized by . As the sample size increases, most of the eigenvalues concentrate close to zero while the largest eigenvalue becomes an outlier. In contrast, Fig. 5 (right) shows that the mean subtraction keeps NTK’s whole spectrum in the range of under the condition . The spectrum empirically converged to a fixed distribution in the large limit.

### 5 Metric tensor for input and feature spaces

The above framework for evaluating FIMs is also applicable to metric tensors for input and feature spaces, which are expressed in the matrix form in Fig. 2 (c). Here, we can prove the following theorem:

###### Theorem 5.1.

When is sufficiently large, the eigenvalue statistics of are asymptotically evaluated as

 mλ∼~κ1M,  sλ∼~α(T−1T~κ22+~κ21T),
 λmax∼~α(T−1T~κ2+~κ1T),

where , and positive constants and are obtained from the order parameters,

 ~κ1:=σ2w~αL∑l=1~qlt,  ~κ2:=σ2w~αL∑l=1~qlst.

The eigenvector of corresponding to is

The theorem is proved in Appendix E. The mean of the eigenvalues asymptotically decreases in the order of , while the variance and largest eigenvalues are of for any . Thus, the spectrum of is pathologically distorted in the sense that the mean is far from the edge beyond the order difference of . The local geometry of the whole input and feature spaces is distorted in the direction of As the depth increases, linearly increases while the mean remains unchanged.

In the same way as the FIM, we can also evaluate the eigenvalue statistics of (see Appendix E). Furthermore, one can obtain the eigenvalue statistics of the diagonal blocks as follows:

###### Theorem 5.2.

When is sufficiently large, the eigenvalue statistics of are asymptotically evaluated as

 mλ∼σ2w~ql+1tMl,  sλ∼σ4w(T−1T(~ql+1st)2+(~ql+1t)2T),
 λmax∼σ2w(T−1T~ql+1st+~ql+1tT).

The eigenvector of corresponding to is

The proof is given in Appendix F.

Fig. 6 (left) shows typical spectra of and Fig. 6 (right) those of . We used deep Tanh networks with and . The other experimental settings are the same as those in Fig. 3. The pathological spectra appear as the theory predicts. Note that the ’s were distributed due to the finite value of .

Let us remark on the related work in the literature of deep learning. First, Pennington et al. (2018) investigated similar metric tensors. Briefly speaking, they used random matrix theory and obtained the eigenvalue spectrum of matrices satisfying , , and

. They found that the isometry of the spectrum is helpful to solve the vanishing gradient problem. Second, DNNs are known to be vulnerable to a specific noise perturbation, i.e., the adversarial example

Goodfellow et al. (2014). One can speculate that the eigenvector corresponding to may be related to adversarial attacks, although such a conclusion will require careful considerations.

### 6 Discussion

We evaluated the asymptotic eigenvalue statistics of the FIM and its variants in sufficiently wide DNNs. They have pathological spectra in the conventional setting of random initialization and activation functions. This suggests that we need to be careful about the eigenvalue statistics and their influence on the learning when we use large-scale deep networks in naive settings.

It will be straightforward to prove that similar pathological spectra appear in various network architectures because order parameters have already been developed in ResNets Yang and Schoenholz (2017) and CNNs Xiao et al. (2018), and we can use them. It is interesting to explore the eigenvalue statistics that the current study cannot capture. Although our study captured some of the basic eigenvalue statistics, it remains to derive the whole spectrum analytically. Random matrix theory enables us to analyze the FIM’s eigenvalue spectrum in a shallow and centered network without bias terms Pennington and Worah (2018). Extending the theory to more general DNNs seems to be a prerequisite for further progress. In addition, random matrix theory enables us to analyze the spectrum of a special type of Jacobian matrix for backpropagation Pennington et al. (2018). The spectrum of the Jacobian matrix with random orthogonal weights differ from one with i.i.d. random Gaussian weights. It would be interesting to investigate the spectra of our metric tensors with other types of random weights, although it seems that applying random matrix theory to general DNNs will be a nontrivial exercise. Furthermore, we assumed a finite number of network output units. In order to deal with multi-label classifications with high dimensionality, it would be helpful to investigate eigenvalue statistics in the wide limit of both hidden and network output layers. Finally, although we focused on the finite depth and regarded order parameters as constants, they can exponentially explode on extremely deep networks in the chaotic regime Schoenholz et al. (2017); Yang et al. (2019). The NTK in such a regime has been investigated in Jacot et al. (2019).

It would also be interesting to explore further connections between the eigenvalue statistics and learning. Recent studies have yielded insights into the connection between the generalization performance of DNNs and the eigenvalues statistics of certain Gram matrices Suzuki (2018); Sun and Nielsen (2019). The NTK’s eigenvalues affect the convergence of the training and performance of the prediction Jacot et al. (2018). We expect that the theoretical foundation of the metric tensors given in this paper will lead to a more sophisticated understanding and development of deep learning in the future.

### A Derivation of Theorem 3.2

First, we briefly overview the derivation of Theorem 3.1 in Karakida et al. (2019a) and Karakida et al. (2019b). The essential point is that a Gram matrix has the same non-zero eigenvalues as its dual. One can represent the empirical FIM (22) as

 F=RR⊤,R:=1√T[∇θf1  ∇θf2  ⋯  ∇θfC]. (A.1 )

Its columns are the gradients on each input, i.e., . Let us refer to a matrix as the dual of FIM. Matrices and have the same non-zero eigenvalues by definition. This can be partitioned into block matrices. The -th block is given by

 F∗(k,k′)=∇θf⊤k∇θfk′/T, (A.2 )

for . In the large limit, the previous study Karakida et al. (2019a) showed that asymptotically satisfies

 F∗(k,k′)=αMTKδ