. The DNNs have been successfully used to achieve empirically high performance in various machine learning tasks[LBH15, GBC16]
. However, their theoretical understanding is limited, and their success relies heavily on heuristic search settings such as architecture and hyperparameters. To understand and improve the training of DNNs, researchers have developed several theories to investigate, for example, the vanishing/exploding gradient problem[SGGSD17], the shape of the loss landscape [PW18, KAA19b], and the global convergence of training and generalization [JGH18]
. The nonlinearity of activation functions, the depth of DNN and the lack of commutation of random matrices results in important mathematical challlenges. In this respect, FPT, invented by Voiculescu[Voi85, Voi87, Voi91], is well suited for this kind of analysis.
FPT essentially appears in the analysis of the dynamical isometry [PSG17, PSG18]. It is well known that it is difficult to reduce the training error in very deep models without carefully preventing the gradient’s vanishing/exploding. Naive settings (i.e., activation function and initialization) cause vanishing/exploding gradients, as long as the network is relatively deep. The dynamical isometry [SMG14, PSG18]
was proposed to solve this problem. The dynamical isometry can facilitate training by setting the input-output Jacobian’s singular values to be one, where the input-output Jacobian is the Jacobian matrix of the DNN at a given input. Experiments have shown that with initial values and models satisfying dynamical isometry, very deep models can be trained without gradient vanishing/exploding;[PSG18, XBSD18, SP20] have found that DNNs achieve approximately dynamical isometry over random orthogonal weights, but they do not do so over random Gaussian weights. For the sake of the prospect of the theory, let be the Jacobian of the multilayer perceptron (MLP), which is the fundamental model of DNNs. In fact, the Jacobian is given by the product of layerwise Jacobians:
where each is -th weight matrix, each is Jacobian of -th activation function, and is the number of layers. Under an assumption of asymptotic freeness, the limit spectral distribution is given by [PSG18].
To examine the training dynamics of MLP achieving the dynamical isometry, [HK21] introduced a spectral analysis of the Fisher information matrix per sample of MLP. The Fisher information matrix (FIM) has been a fundamental quantity for such theoretical understandings. The FIM describes the local metric of the loss surface concerning the KL-divergence function [Ama16]. The neural tangent kernel [JGH18]
, which has the same eigenvalue spectrum except for trivial zero as FIM, also describes the learning dynamics of DNNs when the dimension of the last layer is relatively smaller than the hidden layer. In particular, the FIM’s eigenvalue spectrum describes the efficiency of optimization methods. For instance, the maximum eigenvalue determines an appropriate size of the learning rate of the first-order gradient method for convergence[LKS91, KAA19b, WME18]
. Despite its importance in neural networks, the FIM spectrum has been the object of only very little study from a theoretical perspective. The reason is that it has been limited to random matrix theory for shallow networks[PW18] or mean-field theory for eigenvalue bounds, which may be loose in general [KAA19a]. Thus, [HK21] focused on the FIM per sample and found an alternative approach applicable to DNNs. The FIM per sample is equal to , where is the parameter-Jacobian. Also, the eigenvalues of the FIM per sample are equal to the eigenvalues of the defined recursively as follows, except for the trivial zero eigenvalues and normalization:
is the identity matrix, andis the empirical variane of -th hidden unit. Under an asymptotic freeness assumption, [HK21] gave some limit spectral distributions of .
The asymptotic freeness assumptions have a critical role in these researches [PSG18, HK21] to obtain the propagation of spectral distributions through the layers. However, the proof of the asymptotic freeness was not completed. We prove the asymptotic freeness of layerwise Jacobian of multilayer perceptrons with Haar orthogonal weights in the present work.
1.1 Main results
Our results are as follows. Firstly, the following tuple of families are asymptotically free almost surely (see creftype 4.1):
Secondly, for each , the following pair is almost surely asymptotically free (see creftype 4.2):
The asymptotic freeness is at the heart of the spectral analysis of the Jacobian. Lastly, for each , the following pair is almost surely asymptotically free (see creftype 4.3):
The asymptotic freeness of the pair is the key to the analysis of the conditional Fisher information matrix.
A key of the proof is the invariance of MLP described in Lemma 3.1
. First, we consider the orthogonal matrices that fix the hidden units in each layer. Second, we replace each layer’s parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact. Then asymptotic freeness follows, using the well-known properties of the Haar orthogonal random matrix.
1.2 Related Works
The asymptotic freeness is weaker than the assumption of the forward-backward independence that researches of dynamical isometry assumed [PSG17, PSG18, KAA19b]. Although studies of mean-field theory [SMG14, LBH15, GCC19]
succeeded in explaining many experimental results of deep learning, they use an artificial assumption (gradient independence[Yan19]), which is not rigorously true. Asymptotic freeness is weaker than this artificial assumption. Our work clarifies that asymptotic free independence is just the right property that is useful and strictly valid for analysis.
Several works prove or treat the asymptotic freeness with Gaussian initialization [HN19, Yan19, Yan20, Pas20]. However, asymptotic freeness had not been proven for the orthogonal initialization. As dynamical isometry can be achieved under orthogonal initialization but cannot be done under Gaussian initialization [PSG18], proof of the asymptotic freeness in orthogonal initialization is essential. Since our proof uses the Haar distributed random matrix properties in an essential manner, the proof is clear because we only need to aim to replace the weights with Haar orthogonal, which is independent of the other Jacobians. While [HN19]
restricts the activation function to ReLU, our proof covers a comprehensive class of activation functions, including smooth functions.
1.3 Organization of the paper
Section 2 is devoted to preliminaries. It contains settings of MLP and notations about random matrices, spectral distribution, and free probability theory. Section 3 consists of two keys to prove main results. A key is the invariance of MLP, and the other is to cut off a dimension. Section 4 is devoted to proving the main results on the asymptotic freeness. In Section 5, we show applications of the asymptotic freeness to spectral analysis of random matrices, which appear in the theory of dynamical isometry and training dynamics of DNNs. Section 6 is devoted to the discussion and future works.
2.1 Setting of MLP
We consider multilayer perceptron settings, as usual in the studies of FIM [PW18, KAA19b] and dynamical isometry [SMG14, PSG18, HK21]. Fix . We consider an -layer multilayer perceptron as a parametrized map with weight matrices as follows. Firstly, consider functions on . Besides, we assume that is continuous and differentiable except for finite points. Secondly, for a single input and set . In addition, for , set inductively
where acts on as the entrywise operation. Note that we omit bias parameters to simplify the analysis. Write . Denote by the Jacobian of the activation given by
Lastly, we assume that each () be independent Haar orthogonal random matrices, and further assume the following condition (d1), …, (d4) on distributions. In Fig. 1
, we visualize the dependency of the random variables.
For each , the input vector is -valued random variable such that there is with
Each weight matrix () satisfies
where are independent orthogonal matrices distributed with the Haar probability measure and .
The bias vectors() have independent entries distributed with , where .
For fixed , the family
Let us define and by the following recurrence relations:
The inequality holds by the assumption Item 2 of activation functions.
We further assume that each activation function satisfies the following conditions (a1), …, (a5).
It is a continuous function on and is not the identically zero function.
For any ,
It is differentiable almost everywhere concerning Lebesgue measure. We denote by the derivative defined almost everywhere.
The derivative is continuous almost everywhere concerning the Lebesgue measure.
The derivative is bounded.
These conditions (d1), …, (d4)
Example 2.1 (Activation Functions).
We denote by the algebra of matrices with entries in a field . Write unnormalized and normalized traces of as follows:
In this work, a random matrix is a valued Borel measurable map from a fixed probability space for an . We denote by the group of orthogonal matrices. It is well-known that is equipped with a unique left and right translation invariant probability measure, called the Haar probability measure.
Recall that the spectral distribution of a linear operatoron such that for any , where is the normalized trace. If is an symmetric matrix with , its spectral distribution is given by , where are eigenvalues of , and is the discrete probability distribution whose support is .
Joint Distribution of All Entries
For random matrices and random vectors , we write
if two joint distributions of all entries of corresponding matrices and vectors in the families match.
2.3 Asymptotic Freeness
In this section, we summarize required topics of random matrices and free probability theory.
A noncommutative -probability space (NCPS, for short) is a pair of a unital C-algebra and a faithful tracial state on , which are defined as follows. A linear space over is said to be unital C-algebra if it is unital -algebra equipped with an antilinear map and an norm with the following conditions.
is complete with repect to the norm .
A linear map on is said to be a tracial state on if the following four conditions are satisfied.
In addition, we say that is faithful if implies .
An example of NCPS is the pair of the algebra of matrices of complex entires and the normalized trace , where . Consider the algebra of of matrices of real entries and the normalized trace . The pair itself is not a NCPS in the sence of creftype 2.2 since it is not -linear space. However, contains and preserves by setting for as follows:
Also, the inclusion preserves the trace. Therefore, we consider the joint distributions of matirces in as that of elements in the NCPS .
(Joint Distribution in NCPS) Let and let be the free algebra of non-commutative polynomials on generated by indeterminates . Then the joint distirubtion of the -tuple is the linear form defined by
Let . Let be sequences of matrices. Then we say that they converge in distribution to if
for any .
(Freeness) Let be a NCPS. Let be subalgebras having the same unit as . They are said to be free if the following holds: for any , any sequence , and any () with
it holds that
Besides, elements in are said to be free iff the unital subalgebras that they generate are free.
Here we show an example related to our results.
Let and . Then the families are free if and only if the following unital subalgebras of are free:
Let us now introduce asymptotic freeness of random matrices with compact support limit spectral distributions. Since we consider a family of a finite number of random matrices, we restrict it to a finite index set. Note that the finite index is not required for a general definition of freeness.
Definition 2.7 (Asymptotic Freeness of Random Matrices).
Consider a nonempty finite index set and a partition of . Consider a sequence of -tuples
of random matrices where . The sequence is then said to be asymptotically free as almost surely if the following two conditions are satisfied.
There exist a family of elements in such that the following tuple is free:
For every ,
almost surely, where is the number of elements of .
2.4 Haar Distributed Orthogonal Random Matrices
[CŚ06, Theorem 5.1] For any , let be independent Haar random matrices, and be random matrices, which have the almost-sure-limit joint distribution. Assume that all entries of are independent of that of
for each . Then the families
are asympotically free as .
The following proposition is a direct consequence of creftype 2.8.
For , let and be symmetric random matrices, and let be a Haar-distributed orthogonal random matrix. Assume that
The random matrix is independent of for every .
The spectral distribution of (resp. ) converges in distribution to a compactly supported probability measure (resp. ), almost surely.
Then the following pair is asymptotic free as ,
Note that we do not require independence between and in creftype 2.9. Here we recall that the following result, which is a direct consequence of the translation invariance of Haar random matrices.
Fix . Let be independent Haar random matrices. Let be valued random matrices. Let be valued random matrices. Let be random matrices. Assume that all entries of are independent of
For the readers’ convenience, we include a proof. The characteristic function of (2.35) is given by
where and . By using conditional expectation, (2.36) is equal to
By the property of the Haar measure and the independence, the conditional expectation in (2.37) is equal to
Thus the assertion holds. ∎
3 Key to Asymptotic Freeness
Since random matrices’ universality leads to asymptotic freeness (creftype 2.8), it is essential to investigate the network’s invariance. The following invariance is the key to the main theorem.
Let be independent Haar random matrices on . Let be an valued random variable. For each , let be an valued random matrix such that
Further assume that all entries of are independent from that of for each . Then the following holds, in joint distributions of all entries:
Before proving Lemma 3.1, we show its application target. For , fix a standard complete orthonormal basis of . Set
Note that is non-zero almost surely. Then the family is a basis of . Secondly, let be the complete orthonormal basis detemined by applying the Gram-Schmidt orthogonalization to the family. Thirdly, let be the orthogonal matrix which maps to for , where is the Euclid norm. Then satisfies the following conditions.
Lastly, let be independent Haar distributed orthogonal random matrices such that all entries of them are independent of that of . Set
Further, for any , all entries of are independent from that of since each is -measurable, where is the -algebra generated by and . We have completed the construction of the .
Each is the random matrix which determines the action of on the orthogonal complement of . Note that the Haar property of is not necessary to construct in Lemma 3.1, but the property is used in the proof of creftype 4.1. Fig. 2 visualizes a dependency of the random variables that appeared in the above discussion.
To prove Lemma 3.1, we prepare some notations to treat characteristic functions of the joint distributions of entries. For the reader’s convenience, we visualize the dependency of the random variables in Fig. 3 for a specific constructed by the above discussion. Note that we do not use the ’s specific construction with in the proof of Lemma 3.1.
Fix and . For each , define a map by
where and . Write
By (3.1) and by , the values of characteristic functions of the joint distributions at the point is given by and , respectively.
Proof of Lemma 3.1.
Let be arbitrary random matrices satisfing conditions in Lemma 3.1. We prove the corresponding characteristic functions of the joint distributions match. We only need to show
Firstly, we claim the following:
for each . To show (3.10), fix and write for a random variable ,
By the tower property of conditional expectations, we have
Let be the Haar measure. Then by the invariance of the Haar measure, we have
Secondly, we claim that for each ,
Denote by the -algebra generated by . By definition, are -measurable. Therefore,
Now we have
since the generators of needed to determine are coupled into . Therefore, by (3.10), we have
Therefore, we have proven (3.18).
The invariance described in Lemma 2.10 fixes the vector , and there are no restrictions on the remaining dimensional space. This section quantifies the fact that cutting off the fixed space causes no significant effect when taking the large-dimensional limit.
Let be the diagonal matrix given by
If there is no confusion, we omit the index and simply write it . For , we denote by the -norm of defined by
Recall that the following non-commutative Hölder’s inequality holds:
for any with .
Fix . Let be random matrices for each . Assume that there is a constant such that almost surely
Let be the orthogonal projection defined in Eq. 3.26. Then we have almost surely
Next, we check that an orthogonal matrix approximates a cutoff of any orthogonal matrix.
For any valued random matrix , there is valued random matrix satisfying
for any , almost surely.
Consider the singular value decomposition