Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case

03/24/2021
by   Benoît Collins, et al.
FUJITSU
0

Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics. FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians. However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far. The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers. Haar distributed orthogonal matrices are essential for achieving dynamical isometry. In this work, we prove asymptotic freeness of layerwise Jacobian of multilayer perceptrons in this case.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/14/2020

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

The Fisher information matrix (FIM) is fundamental for understanding the...
08/11/2019

Almost Surely Asymptotic Freeness for Jacobian Spectrum of Deep Network

Free probability theory helps us to understand Jacobian spectrum of deep...
06/29/2018

Theory IIIb: Generalization in Deep Networks

A main puzzle of deep neural networks (DNNs) revolves around the apparen...
10/30/2019

Spectral properties of kernel matrices in the flat limit

Kernel matrices are of central importance to many applied fields. In thi...
03/24/2022

Extended critical regimes of deep neural networks

Deep neural networks (DNNs) have been successfully applied to many real-...
02/10/2018

A New Approach of Exploiting Self-Adjoint Matrix Polynomials of Large Random Matrices for Anomaly Detection and Fault Location

Synchronized measurements of a large power grid enable an unprecedented ...
10/14/2019

Pathological spectra of the Fisher information metric and its variants in deep neural networks

The Fisher information matrix (FIM) plays an essential role in statistic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Free Probability Theory (FPT) provides important insight when handling mathematical difficulties caused by random matrices that appear in deep neural networks (DNNs) [PSG18, HN19, HK21]

. The DNNs have been successfully used to achieve empirically high performance in various machine learning tasks

[LBH15, GBC16]

. However, their theoretical understanding is limited, and their success relies heavily on heuristic search settings such as architecture and hyperparameters. To understand and improve the training of DNNs, researchers have developed several theories to investigate, for example, the vanishing/exploding gradient problem

[SGGSD17], the shape of the loss landscape [PW18, KAA19b], and the global convergence of training and generalization [JGH18]

. The nonlinearity of activation functions, the depth of DNN and the lack of commutation of random matrices results in important mathematical challlenges. In this respect, FPT, invented by Voiculescu

[Voi85, Voi87, Voi91], is well suited for this kind of analysis.

FPT essentially appears in the analysis of the dynamical isometry [PSG17, PSG18]. It is well known that it is difficult to reduce the training error in very deep models without carefully preventing the gradient’s vanishing/exploding. Naive settings (i.e., activation function and initialization) cause vanishing/exploding gradients, as long as the network is relatively deep. The dynamical isometry [SMG14, PSG18]

was proposed to solve this problem. The dynamical isometry can facilitate training by setting the input-output Jacobian’s singular values to be one, where the input-output Jacobian is the Jacobian matrix of the DNN at a given input. Experiments have shown that with initial values and models satisfying dynamical isometry, very deep models can be trained without gradient vanishing/exploding;

[PSG18, XBSD18, SP20] have found that DNNs achieve approximately dynamical isometry over random orthogonal weights, but they do not do so over random Gaussian weights. For the sake of the prospect of the theory, let be the Jacobian of the multilayer perceptron (MLP), which is the fundamental model of DNNs. In fact, the Jacobian is given by the product of layerwise Jacobians:

(1.1)

where each is -th weight matrix, each is Jacobian of -th activation function, and is the number of layers. Under an assumption of asymptotic freeness, the limit spectral distribution is given by [PSG18].

To examine the training dynamics of MLP achieving the dynamical isometry, [HK21] introduced a spectral analysis of the Fisher information matrix per sample of MLP. The Fisher information matrix (FIM) has been a fundamental quantity for such theoretical understandings. The FIM describes the local metric of the loss surface concerning the KL-divergence function [Ama16]. The neural tangent kernel [JGH18]

, which has the same eigenvalue spectrum except for trivial zero as FIM, also describes the learning dynamics of DNNs when the dimension of the last layer is relatively smaller than the hidden layer. In particular, the FIM’s eigenvalue spectrum describes the efficiency of optimization methods. For instance, the maximum eigenvalue determines an appropriate size of the learning rate of the first-order gradient method for convergence

[LKS91, KAA19b, WME18]

. Despite its importance in neural networks, the FIM spectrum has been the object of only very little study from a theoretical perspective. The reason is that it has been limited to random matrix theory for shallow networks

[PW18] or mean-field theory for eigenvalue bounds, which may be loose in general [KAA19a]. Thus, [HK21] focused on the FIM per sample and found an alternative approach applicable to DNNs. The FIM per sample is equal to , where is the parameter-Jacobian. Also, the eigenvalues of the FIM per sample are equal to the eigenvalues of the defined recursively as follows, except for the trivial zero eigenvalues and normalization:

(1.2)

where

is the identity matrix, and

is the empirical variane of -th hidden unit. Under an asymptotic freeness assumption, [HK21] gave some limit spectral distributions of .

The asymptotic freeness assumptions have a critical role in these researches [PSG18, HK21] to obtain the propagation of spectral distributions through the layers. However, the proof of the asymptotic freeness was not completed. We prove the asymptotic freeness of layerwise Jacobian of multilayer perceptrons with Haar orthogonal weights in the present work.

1.1 Main results

Our results are as follows. Firstly, the following tuple of families are asymptotically free almost surely (see creftype 4.1):

(1.3)

Secondly, for each , the following pair is almost surely asymptotically free (see creftype 4.2):

(1.4)

The asymptotic freeness is at the heart of the spectral analysis of the Jacobian. Lastly, for each , the following pair is almost surely asymptotically free (see creftype 4.3):

(1.5)

The asymptotic freeness of the pair is the key to the analysis of the conditional Fisher information matrix.

A key of the proof is the invariance of MLP described in Lemma 3.1

. First, we consider the orthogonal matrices that fix the hidden units in each layer. Second, we replace each layer’s parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact. Then asymptotic freeness follows, using the well-known properties of the Haar orthogonal random matrix.

1.2 Related Works

The asymptotic freeness is weaker than the assumption of the forward-backward independence that researches of dynamical isometry assumed [PSG17, PSG18, KAA19b]. Although studies of mean-field theory [SMG14, LBH15, GCC19]

succeeded in explaining many experimental results of deep learning, they use an artificial assumption (gradient independence

[Yan19]), which is not rigorously true. Asymptotic freeness is weaker than this artificial assumption. Our work clarifies that asymptotic free independence is just the right property that is useful and strictly valid for analysis.

Several works prove or treat the asymptotic freeness with Gaussian initialization [HN19, Yan19, Yan20, Pas20]. However, asymptotic freeness had not been proven for the orthogonal initialization. As dynamical isometry can be achieved under orthogonal initialization but cannot be done under Gaussian initialization [PSG18], proof of the asymptotic freeness in orthogonal initialization is essential. Since our proof uses the Haar distributed random matrix properties in an essential manner, the proof is clear because we only need to aim to replace the weights with Haar orthogonal, which is independent of the other Jacobians. While [HN19]

restricts the activation function to ReLU, our proof covers a comprehensive class of activation functions, including smooth functions.

1.3 Organization of the paper

Section 2 is devoted to preliminaries. It contains settings of MLP and notations about random matrices, spectral distribution, and free probability theory. Section 3 consists of two keys to prove main results. A key is the invariance of MLP, and the other is to cut off a dimension. Section 4 is devoted to proving the main results on the asymptotic freeness. In Section 5, we show applications of the asymptotic freeness to spectral analysis of random matrices, which appear in the theory of dynamical isometry and training dynamics of DNNs. Section 6 is devoted to the discussion and future works.

2 Preliminaries

2.1 Setting of MLP

We consider multilayer perceptron settings, as usual in the studies of FIM [PW18, KAA19b] and dynamical isometry [SMG14, PSG18, HK21]. Fix . We consider an -layer multilayer perceptron as a parametrized map with weight matrices as follows. Firstly, consider functions on . Besides, we assume that is continuous and differentiable except for finite points. Secondly, for a single input and set . In addition, for , set inductively

(2.1)

where acts on as the entrywise operation. Note that we omit bias parameters to simplify the analysis. Write . Denote by the Jacobian of the activation given by

(2.2)

Lastly, we assume that each () be independent Haar orthogonal random matrices, and further assume the following condition (d1), …, (d4) on distributions. In Fig. 1

, we visualize the dependency of the random variables.

Figure 1:

A graphical model of random matrices and random vectors drawn by the following rules (i–iii). (i)A node’s boundary is drawn as a square or a rectangle if it contains a square random matrix; otherwise, it is drawn as a circle. (ii)For each node, its parent node is a source node of a directed arrow. A node is measurable with respect to the

-algebra generated by all parent nodes. (iii)The nodes which have no parent node are independent.
  1. [label=(d0), ref=(d0)]

  2. For each , the input vector is -valued random variable such that there is with

    (2.3)

    almost surely.

  3. Each weight matrix () satisfies

    (2.4)

    where are independent orthogonal matrices distributed with the Haar probability measure and .

  4. The bias vectors

     () have independent entries distributed with , where .

  5. For fixed , the family

    (2.5)

    is independent.

Let us define and by the following recurrence relations:

(2.6)
(2.7)
(2.8)

The inequality holds by the assumption Item 2 of activation functions.

We further assume that each activation function satisfies the following conditions (a1), …, (a5).

  1. [label=(a0), ref=(a0)]

  2. It is a continuous function on and is not the identically zero function.

  3. For any ,

    (2.9)
  4. It is differentiable almost everywhere concerning Lebesgue measure. We denote by the derivative defined almost everywhere.

  5. The derivative is continuous almost everywhere concerning the Lebesgue measure.

  6. The derivative is bounded.

These conditions (d1), …, (d4)

Example 2.1 (Activation Functions).

The following activation functions are used to [PSG17, PSG18, HK21] satisfy the above conditions.

  1. (2.10)
  2. (Shifted ReLU)

    (2.11)
  3. (Hard hyperbolic tangent)

    (2.12)
  4. (Hyperbolic tangent)

    (2.13)
  5. (Sigmoid function)

    (2.14)
  6. (Smoothed ReLU)

    (2.15)
  7. (Error function)

    (2.16)

2.2 Notations

Linear Algebra

We denote by the algebra of matrices with entries in a field . Write unnormalized and normalized traces of as follows:

(2.17)
(2.18)

In this work, a random matrix is a valued Borel measurable map from a fixed probability space for an . We denote by the group of orthogonal matrices. It is well-known that is equipped with a unique left and right translation invariant probability measure, called the Haar probability measure.

Spectral Distribution

Recall that the spectral distribution of a linear operator

is a probability distribution

on such that for any , where is the normalized trace. If is an symmetric matrix with , its spectral distribution is given by , where are eigenvalues of , and is the discrete probability distribution whose support is .

Joint Distribution of All Entries

For random matrices and random vectors , we write

(2.19)

if two joint distributions of all entries of corresponding matrices and vectors in the families match.

2.3 Asymptotic Freeness

In this section, we summarize required topics of random matrices and free probability theory.

Definition 2.2.

A noncommutative -probability space (NCPS, for short) is a pair of a unital C-algebra and a faithful tracial state on , which are defined as follows. A linear space over is said to be unital C-algebra if it is unital -algebra equipped with an antilinear map and an norm with the following conditions.

  1. .

  2. .

  3. .

  4. .

  5. is complete with repect to the norm .

A linear map on is said to be a tracial state on if the following four conditions are satisfied.

  1. .

  2. .

  3. .

  4. .

In addition, we say that is faithful if implies .

An example of NCPS is the pair of the algebra of matrices of complex entires and the normalized trace , where . Consider the algebra of of matrices of real entries and the normalized trace . The pair itself is not a NCPS in the sence of creftype 2.2 since it is not -linear space. However, contains and preserves by setting for as follows:

(2.20)

Also, the inclusion preserves the trace. Therefore, we consider the joint distributions of matirces in as that of elements in the NCPS .

Definition 2.3.

(Joint Distribution in NCPS) Let and let be the free algebra of non-commutative polynomials on generated by indeterminates . Then the joint distirubtion of the -tuple is the linear form defined by

(2.21)

where .

Definition 2.4.

Let . Let be sequences of matrices. Then we say that they converge in distribution to if

(2.22)

for any .

Definition 2.5.

(Freeness) Let be a NCPS. Let be subalgebras having the same unit as . They are said to be free if the following holds: for any , any sequence , and any () with

(2.23)
(2.24)

it holds that

(2.25)

Besides, elements in are said to be free iff the unital subalgebras that they generate are free.

Here we show an example related to our results.

Example 2.6.

Let and . Then the families are free if and only if the following unital subalgebras of are free:

(2.26)
(2.27)

Let us now introduce asymptotic freeness of random matrices with compact support limit spectral distributions. Since we consider a family of a finite number of random matrices, we restrict it to a finite index set. Note that the finite index is not required for a general definition of freeness.

Definition 2.7 (Asymptotic Freeness of Random Matrices).

Consider a nonempty finite index set and a partition of . Consider a sequence of -tuples

(2.28)

of random matrices where . The sequence is then said to be asymptotically free as almost surely if the following two conditions are satisfied.

  1. There exist a family of elements in such that the following tuple is free:

    (2.29)
  2. For every ,

    (2.30)

    almost surely, where is the number of elements of .

2.4 Haar Distributed Orthogonal Random Matrices

Proposition 2.8.

[CŚ06, Theorem 5.1] For any , let be independent Haar random matrices, and be random matrices, which have the almost-sure-limit joint distribution. Assume that all entries of are independent of that of

(2.31)

for each . Then the families

(2.32)

are asympotically free as .

The following proposition is a direct consequence of creftype 2.8.

Proposition 2.9.

For , let and be symmetric random matrices, and let be a Haar-distributed orthogonal random matrix. Assume that

  1. The random matrix is independent of for every .

  2. The spectral distribution of (resp. ) converges in distribution to a compactly supported probability measure (resp. ), almost surely.

Then the following pair is asymptotic free as ,

(2.33)

almost surely.

Note that we do not require independence between and in creftype 2.9. Here we recall that the following result, which is a direct consequence of the translation invariance of Haar random matrices.

Lemma 2.10.

Fix . Let be independent Haar random matrices. Let be valued random matrices. Let be valued random matrices. Let be random matrices. Assume that all entries of are independent of

(2.34)

Then,

(2.35)
Proof.

For the readers’ convenience, we include a proof. The characteristic function of (

2.35) is given by

(2.36)

where and . By using conditional expectation, (2.36) is equal to

(2.37)

By the property of the Haar measure and the independence, the conditional expectation in (2.37) is equal to

(2.38)

Thus the assertion holds. ∎

3 Key to Asymptotic Freeness

3.1 Invariance

Since random matrices’ universality leads to asymptotic freeness (creftype 2.8), it is essential to investigate the network’s invariance. The following invariance is the key to the main theorem.

Lemma 3.1.

Let be independent Haar random matrices on . Let be an valued random variable. For each , let be an valued random matrix such that

(3.1)

Further assume that all entries of are independent from that of for each . Then the following holds, in joint distributions of all entries:

(3.2)

Before proving Lemma 3.1, we show its application target. For , fix a standard complete orthonormal basis of . Set

(3.3)

Note that is non-zero almost surely. Then the family is a basis of . Secondly, let be the complete orthonormal basis detemined by applying the Gram-Schmidt orthogonalization to the family. Thirdly, let be the orthogonal matrix which maps to for , where is the Euclid norm. Then satisfies the following conditions.

  1. is -measuarable.

  2. .

Lastly, let be independent Haar distributed orthogonal random matrices such that all entries of them are independent of that of . Set

(3.4)

Then

(3.5)

Further, for any , all entries of are independent from that of since each is -measurable, where is the -algebra generated by and . We have completed the construction of the .

Each is the random matrix which determines the action of on the orthogonal complement of . Note that the Haar property of is not necessary to construct in Lemma 3.1, but the property is used in the proof of creftype 4.1. Fig. 2 visualizes a dependency of the random variables that appeared in the above discussion.

Figure 2: A graphical model of random variables in a specific case using for . See Fig. 1 for the graph’s drawing rule. The node of is an isolated node in the graph.

To prove Lemma 3.1, we prepare some notations to treat characteristic functions of the joint distributions of entries. For the reader’s convenience, we visualize the dependency of the random variables in Fig. 3 for a specific constructed by the above discussion. Note that we do not use the ’s specific construction with in the proof of Lemma 3.1.

Fix and . For each , define a map by

(3.6)

where and . Write

(3.7)
(3.8)

By (3.1) and by , the values of characteristic functions of the joint distributions at the point is given by and , respectively.

Figure 3: A graphical model of random variables for computing characteristic functions in a specific case using for constructing . See (3.7) and (3.8) for the definition of and . See Fig. 1 for the graph’s drawing rule.
Proof of Lemma 3.1.

Let be arbitrary random matrices satisfing conditions in Lemma 3.1. We prove the corresponding characteristic functions of the joint distributions match. We only need to show

(3.9)

Firstly, we claim the following:

(3.10)

for each . To show (3.10), fix and write for a random variable ,

(3.11)

By the tower property of conditional expectations, we have

(3.12)

Let be the Haar measure. Then by the invariance of the Haar measure, we have

(3.13)
(3.14)
(3.15)
(3.16)
(3.17)

In particular, is -measurable. By (3.12), we have (3.10).

Secondly, we claim that for each ,

(3.18)

Denote by the -algebra generated by . By definition, are -measurable. Therefore,

(3.19)

Now we have

(3.20)
(3.21)

since the generators of needed to determine are coupled into . Therefore, by (3.10), we have

(3.22)
(3.23)

Therefore, we have proven (3.18).

Lastly, by applying (3.18) iteratively, we have

(3.24)

By (3.10),

(3.25)

We have completed the proof of (3.9). ∎

3.2 Cutoff

The invariance described in Lemma 2.10 fixes the vector , and there are no restrictions on the remaining dimensional space. This section quantifies the fact that cutting off the fixed space causes no significant effect when taking the large-dimensional limit.

Let be the diagonal matrix given by

(3.26)

If there is no confusion, we omit the index and simply write it . For , we denote by the -norm of defined by

(3.27)

Recall that the following non-commutative Hölder’s inequality holds:

(3.28)

for any with .

Lemma 3.2.

Fix . Let be random matrices for each . Assume that there is a constant such that almost surely

(3.29)

Let be the orthogonal projection defined in Eq. 3.26. Then we have almost surely

(3.30)

as .

Proof.

We omit the index if there is no confusion. Let us denote by the left-hand side of (3.30). Then

(3.31)

By the Hölder’s inequality (3.28),

(3.32)

Now

(3.33)

Then by the assumption, we have almost surely. Then the assertion follows. ∎

Next, we check that an orthogonal matrix approximates a cutoff of any orthogonal matrix.

Lemma 3.3.

For any valued random matrix , there is valued random matrix satisfying

(3.34)

for any , almost surely.

Proof.

Consider the singular value decomposition