Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

09/24/2018 ∙ by Wojciech Tarnowski, et al. ∙ Jagiellonian University 0

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study the consequence of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results, by ensuring the same level of dynamical isometry at initialization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I The model

In this paper, we consider a deep, residual network of layers of a constant width of neurons. We follow the typical nomenclature of the literature and therefore the real-valued, synaptic matrix for the -th layer is denoted by

, whereas the real-valued bias vectors are

. The information propagates in this network according to:

(1)

where and are pre- and post-activations respectively and is the activation function itself, acting entry-wise on the vector of pre-activations. We have introduced the parameter to track the influence of skip connections in the calculations, however we do not study its influence on the Jacobian’s spectrum or learning in general. By we denote the input of the network and by its output. Our primary interest will lay in exploring the singular value spectral properties of the input-output Jacobian:

(2)

known to be useful in studying initialization schemes of neural networks at least since the publishing of GB

. It in particular holds the information on the severity of the exploding gradients problem, as we explain at the beginning of the next section.

i.1 Relevance of the input-output Jacobian

To understand why we are interested in the Jacobian, consider the neural network adjusting its weights during the learning process. In a simplified, example by example scenario, this happens according to

(3)

where is the error function depending on - the output of the network, - the correct output value associated with that example and, implicitly through , on the parameters of the model, namely the weights and biases.

is the learning rate. Applying the chain rule we can rewrite this as:

(4)

For the learning process to be stable, all three terms need to be bounded. Out of those, the middle one can become problematic if a poor choice of the initialization scheme is made. We can rewrite it as:

(5)

and see the larger the difference between and l, the more terms we have in the product, and (in general) the less control there is over its behavior. Here

is an identity matrix and,

is a diagonal matrix such that . Indeed, it was first proposed in GB , that learning in deep feed-forward neural networks can be improved by keeping the mean singular value of the Jacobian associated with layer (in our setup ), close to for all ’s. It is also important for the dynamics of learning to be driven by data, not by the random initialization of the network. The latter may take place if the Jacobian to the -th layer possesses very large singular values which dominate the learning or very small singular values suppressing it. In the optimal case all singular values should be concentrated around 1 regardless of how deep is the considered layer. One therefore examines the case of , namely - the input-output Jacobian, as the most extreme object of (5).The feature that in the limit of large depth all singular values of concentrate around 1, irrespective of the depth of the network, was coined as dynamical isometry in  Saxe .

Note, that the spectral problem for the full Jacobian

(6)

belongs to the class of matrix-valued diffusion processes GJJN ; JW , leading to a complex spectrum. We note one of us has derived the large limit, eigenvalue, spectral density properties of a generic model of (6) with , and different symmetry classes of , already in GJJN

. Due to non-normality of the Jacobian, singular values cannot be easily related to eigenvalues. Therefore we follow

PSG1 ; PSG2 and tackle the full singular spectrum of the Jacobian (or equivalently the eigenvalue spectrum of ), extending these works to the case of the Residual Neural Network model.

Ii Spectral properties of the Jacobian

ii.1 Spectral analysis

Free Probability Theory, or Free Random Variable (FRV) Theory 

VOICULESCU , is a powerful tool for the spectral analysis of random matrices in the limit of their large size. It is a counterpart of the classical Probability Theory for the case of non-commuting observables. For a pedagogical introduction to the subject, see SpeicherMingo - here we start by laying out the basics useful in the derivations of this subsection. The fundamental objects of the theory are the Green’s functions (a.k.a. Stieltjes transforms in mathematical literature):

(7)

which generate spectral moments. The eigenvalue density can be recovered via the Sochocki-Plemelj formula

(8)

The associated free cumulants are generated by the so-called

-transform, which plays the role of the logarithm of the characteristic function in the classical probability. By this correspondence, the

-transform is additive under addition, i.e. for mutually free, but non-commuting random ensembles and . Moreover, it is related to via the functional equations

(9)

On the other hand, the so-called -transform facilitates calculations of the spectra of products of random matrices, as it satisfies , provided or is positive definite. If additionally, the ensemble has a finite mean, the S-transform can be easily obtained from the -transform, and vice versa, through a pair of the following, mutually inverse maps and  BJN . Explicitly:

(10)

Denoting now the Jacobian across layers and , one can write the recursion relation . The latter matrix is isospectral to , which leads to the equation for the -transform . Proceeding inductively, we arrive at

(11)

The proof can be seen for example in BNS . To find the -transform of the single layer Jacobian, we will first consider its Green’s function

(12)

with the averaging over the ensemble of weight matrices . To facilitate the study of , in particular to cope with , one linearizes the problem by introducing matrices of size

(13)

with . Another crucial ingredient is the block trace operation (bTr), which is the trace applied to each block. The generalized Green’s function is defined as a block trace of the generalized resolvent

(14)

Remarkably, the Green’s function of is the entry of the generalized Green’s function. This construction is a slight modification of the quaternionization approach to large non-Hermitian matrices developed in JNPZ ; QUAT , therefore we adapt these concepts here for calculations in the large width limit of the network. Furthermore, the generalized Green’s function (14) is given implicitly by the solution of the Schwinger-Dyson equation

(15)

Here is the generalized -transform of FRV theory. This construction is a generalization of standard FRV tools to the matrix-valued functions, dubbed as freeness with amalgamation AMAL . In particular, (15) is such a generalization of (9) to matrices.

To study two common weight initializations, Gaussian and scaled orthogonal, on the same footing, we assume that belongs to the class of biunitarily invariant random matrices, i.e. its pdf is invariant under multiplication by two orthogonal matrices, for . In the large limit these matrices are known in free probability as -diagonal operators SPEICHER . A product of -diagonal operator with an arbitrary operator remains -diagonal NS , thus the matrix is -diagonal too.

The generalized -transform of -diagonal operators takes a remarkably simple form NowTarn

(16)

Here, is the determining sequence, which generates cumulants . For the later use we mention a simple relation for the determining sequence of a scaled matrix , which generalizes the Hermitian case or, equivalently, .

We derive the equation for the Green’s function (with ) by substituting the -transform (16) into (15) and eliminating irrelevant variables. It thus reads:

(17)

In the next step, let us substitute and use (9) to obtain

(18)

Then, we substitute and use (10), which leads us to

(19)

This equation is exact. To incorporate the additional scaling of weights variances by in our considerations, as proposed in Taki1 ; BFLLMM , we rescale and since we are interested in deep networks, we keep only the leading term in (see also JW

for similar calculations in the unitary matrix models). This leads to

, which simplifies (19) to a quadratic equation for . Choosing the appropriate branch of the solution, we see that

(20)

Here

(21)

is the squared spectral radius of the matrix . In general, can vary across the depth of the network due to non-constant variance of preactivations. Assuming that this variability is bounded, we can consider the logarithm of (11) and write:

(22)

where in the last equality we defined the effective cumulant . This allows us to deduce the form of the -transform, assuming that does not scale with

(23)

Substituting and using we thus obtain

(24)

Then, we substitute and use , to finally get

(25)

an equation for the Green’s function characterizing the square singular values of the Jacobian, which can be solved numerically. We do that for a range of different activation functions and present the results with numerical simulations to corroborate them in subsection III.1, using the fact that the solution of the above equation can be expressed in terms of the Lambert function, the properties of which are well documented LAMBERT .

One of the measures of the spread of the spectrum, describing the departure from the isometry can be the ratio between largest and smallest singular values of , which we denote by . We shall not provide the detailed derivation, we just present the final, large limit, formula:

(26)

We close this section with a remark that the above analysis is not restricted only to the model (1), but analogous reasoning can be performed for networks in which skip connections bypass more than one fully connected block.

ii.2 Signal propagation

The formulas we have derived until now were given in terms of a single parameter , which is the squared derivative of the activation function averaged within each layer and across the depth of the network. Thus, we now need to address the behavior of preactivations. In the proceeding paragraph, we closely follow a similar derivation done in SGGS , for fully connected feed forward networks.

For the simplicity of our arguments, we consider here and as independent identically distributed (iid) Gaussian random variables with mean and variances and , respectively. Here, is of order one, and the additional scaling is meant to reflect those introduced in the previous paragraphs. At the end of this section we provide an argument that the same results hold for scaled orthogonal matrices.

In this subsection, we will denote the averaging over variables and , at a given layer , by . By we denote the sample average, of some variable in the -th layer: . Note that the width () is independent of the layer number, however the derivation can be easily generalized to the opposite case, when the architecture is more complicated. Unless stated otherwise explicitly, all integrals are calculated over real line.

We are interested in the distribution of

in our model, depending on the input vectors and the probability distributions of

and

. If we assume they are normal (as can be argued using the Central Limit Theorem), we just need the first two moments. It is clear that

. Furthermore, we assume ergodicity, i.e. that averaging some quantity over a layer of neurons is equivalent to averaging this quantity for one neuron over an ensemble of neural networks with random initializations. We assume this is true for , , and . Thus, we can say that and moreover, as we work in the limit of wide networks,

can be replaced with an averaging over a normal distribution of variance

. This is the crux of the dynamical mean field theory approach developed in PLRSG for feed-forward neural networks. We have in particular:

(27)

where . Thus we obtain:

(28)

Therefore, to calculate the effective cumulant, we need to know how the variance of the distribution of the preactivations changes as the input information propagates across consecutive layers of the network.

With the scalings of from section II.1 made explicit, we have . Now, based on (1) and the above considerations, we have

(29)

and

(30)

Let us assume the factorization holds in the large limit. This is justified, as the input to comes from all the many elements of . We can rewrite (30) as

(31)

Turning to , based on (1), we have:

(32)

For the recurrence yields

(33)

Thus, (31), with a shift in , turns into

(34)

Finally, we use (29) to obtain

(35)

which is a closed recursive equation for . We note that for , the known, feed-forward network recursion relation is recovered. Furthermore, for the case of , in contrast to the feed-forward architecture, the biases do not influence the statistical properties of the pre-activations. Moreover, for ResNets, this recursive relation is iteratively additive, namely each is a result of adding some terms to the previous . In all the examples studied below, the first term is positive and the second term is non-negative. This in turn means that the variance of pre-activations grows with the networks depth and there are no non-trivial fixed points of this recursion equation. Finally, here we can see the importance of the scaling of , without which, would grow uncontrollably with .

We remark here that the above reasoning concerning signal propagation holds also when the weights are scaled orthogonal matrices, i.e. . In such a case and  CollSn and the entries of can be approximated as independent Gaussians Chatterjee .

ii.3 Results for various activation functions

We now investigate particular examples of activation functions. For simplicity, we consider purely residual networks (we set ). The numerical verifications of the results presented here will follow in the next subsection.

  1. Linear

    In the case of the linear activation function and there is no need to consider the way the pre-activations change across the network. Thus we can proceed to calculating the cumulant which yields .

  2. The example of ReLU is only slightly more involved. Now we have , where is the Heaviside theta function, and thus

    (36)
  3. Leaky ReLU

    The activation function interpolating between the first two examples is

    with . In this case

    (37)

All together, this leads to the following equation for the Green’s function

(38)

where corresponds to the linear activation function and to ReLU. Equation (38) can be easily solved numerically for the spectral probability density of the Jacobian. For completeness, we write down the recursion relation (35) in these three cases:

(39)

For the linear activation function (), its solution is readily available and reads

(40)

which explicitly shows the importance of the rescaling introduced earlier.

  1. Hard hyperbolic tangent

    The hard tanh activation function is defined by for and elsewhere. Thus:

    (41)

    The resulting recurrence equation for the variance of the preactivations reads:

    (42)

    and can be easily solved.

In the preceding examples we dealt with piecewise linear activation functions. For other nonlinear activation functions to obtain the cumulants, we need to use the recurrence relation describing the signal propagation in the network.

  1. Hyperbolic tangent

    When , the activation function is antisymmetric and the last term of (35) vanishes. Thus the recurrence takes the form:

    (43)

    In the large limit, we can write

    (44)

    where we assume . Expanding this recursively around for decreasing and keeping only the leading term, as long as for any , , we obtain :

    (45)

    Therefore, the solution of the recursion is:

    (46)

    The variance grows linearly with (this is verified in appendix A, see Fig. 5). Cumulants and thus are obtained with numerical integration.

In fact, in the above calculations we have only used the antisymmetry property of the hyperbolic tangent activation function and the properties of its behavior near . Therefore, these results are valid for other antisymmetric activation functions like .

  1. Sigmoid

    The sigmoid activation function, is the first example we encounter, for which , thus it deserves special attention. In particular, in this case one needs to additionally address the last term in (35). It turns out, that

    (47)

    irrespective of . Therefore, the recurrence relation becomes:

    (48)

    Thus, we can see, that due to the non-zero first moment of the activation function (47

    ) the mean and the second moment of post-activations grow with depth. Similarly, the variance of pre-activations increases as the signal propagates, which causes quick saturation of the sigmoid nonlinearity. This in turn precludes training of deep networks 

    GB .

    Analogically to the previous case, one can derive an approximation to the solution of the recursion relation. In this case it becomes:

    (49)

    We verify this result in Fig. 5 in Appendix A

  2. Scaled Exponential Linear Units

    Our final example is the SELU activation function, one introduced recently in KUM with the intent to bypass the batch normalization procedure. In this case, we have for and for . Thus, it is not antisymmetric and is nonlinear for negative arguments. It turns out, that

    (50)

    Moreover:

    (51)

    and

    (52)

    These yield the recursion relation for via (35). One can check that for and , the results for ReLu are recovered.

These theoretical predictions for the recursion relations are tested with numerical simulations using Mathematica. The results are relegated to Appendix A.

ii.4 Random matrix experiments with Mathematica

To throughly test the theoretical predictions of section II, we run numerical simulations using Mathematica. The initial condition, input vector of length , filled with iid Gaussian random variables of zero mean and unit variance, is propagated according to the recurrence (1), for various activation functions. The network weights and biases are generated from normal distribution of zero mean and and variance, respectively, with (see plots for values of other parameters). The propagation of variance of pre-activations, post-activations as well as the second cumulant , across the network, is presented in Appendix A. All numerical simulations corroborate our theoretical results. Here, for clarity and as a generic example, in figure 1 (left), we show the distribution of singular values of the input-output Jacobian (defined in 6) for the tanh nonlinearity for various network depths. In this example the Jacobian in not independent of the signal propagation, as it is the case of piecewise linear activation functions. Similarly, in the right panel of figure 1, we showcase the outcome of numerical experiments and the associated, matching, theoretical results for the most popular ReLU activation function, for various initializations resulting in different values of the effective cumulant .

Figure 1: (Left) density of singular values of the input-output Jacobian for the residual network with tanh nonlinearity. Note that the asymptotic theoretical result describes remarkably well not very deep () networks. (right) Asymptotic distribution of singular values for various values of parameter (dashed) juxtaposed with the numerical simulations for ReLU nonlinearity (solid). Note that histograms were calculated from a single random initialisation. The smaller , the narrower the spectrum and the closer to the Jacobian isometry.

Iii Experiments on image classification

The goal of this section is to test our theoretical predictions on real data. To this end we will use a single representation fully connected residual network, see Fig. 2. This simplified version of the model of (HZRS, )

does not use (i) multiple stages with different dimension of hidden representation, and (ii) two layers within residual block. Further, we will restrict our focus to image classification via the popular CIFAR-10 benchmark 

CIFAR10 . Finally, we focus on fully connected residual networks and leave studying convolutional layers for future work.

Figure 2: Residual network architecture used in the paper.

iii.1 Achieving dynamical isometry for any activation function

Perhaps the most interesting prediction of our theory is that ResNets, in contrast to fully connected networks, can achieve dynamical isometry for many different activation functions. We will study this empirically by looking at , at initialization, for different activation functions and number of residual blocks. Please note that by we refer to Jacobian of the output of the last residual block with respect to the input of the first one.

We consider the following popular activation functions: ReLU NH , Tanh, Hard Tanh, Sigmoid, SeLU (KUM, ) and Leaky ReLU (MHN, ) (with the leaking constant in ). For each activation function we consider the number of blocks to be and

. All weights of the network are initialized from a zero-centered normal distribution whereas biases are initialized to zero. The weights of the residual blocks are initialized using standard deviation

, other weights are initialized as in (Taki1, ). For the given activation function and the number of blocks , we set so that the effective cumulant , which ensures the concentration of eigenvalues of the Jacobian around one, and hence dynamical isometry (see Sec. II.3 for more details and Fig. 1 for the shape of the singular value density).

For each pair of activation function and number of blocks we compute the empirical spectrum of at initialization, the results are reported in Fig. 3. Indeed, we observe that upon scaling the initializations standard deviation, in such a way that is kept constant, the empirical spectrum of is independent of the number of residual blocks or the choice of activation functions.

Figure 3: Singular spectra obtained for various activation functions and depth . The network was fed with examples from CIFAR10 dataset.

iii.2 Learning dynamics are not the same under dynamical isometry

Figure 4:

Training accuracy during first 100 iterations (left) and first 100 epochs (right) of residual networks with various activation functions. The weight initialization was chosen for each activation function in such a way that the effective cumulant is

and the spectra of Jacobians are presented in Fig. 3. We set for leaky ReLU.

Our next objective is to investigate whether networks which achieve dynamical isometry share similar learning dynamics. While this is outside of the scope of our theoretical investigation, it is inspired by studies such as (PSG1, ), which demonstrate the importance of achieving dynamical isometry at the initialization for the subsequent optimization.

We consider the same set of experiments as in the previous section, and follow a similar training protocol to (HZRS, ). We train for epochs and drop the learning rate by a factor of after epochs , , and . We use batch-size and a starting learning rate of either or 111We use relatively low learning rates, largely because we omit batch normalization layers in the architecture..

First, we look at the learning dynamics on the training set. Most of the activation functions exhibit similar training accuracy evolution, see Fig. 4 (top). Using the sigmoid activation, led however to significantly slower optimization. This is due to a faster growth of the variances of post- and pre- activations (which can be observed in Fig. 5), which exacerbates the neuron saturation problem. The differences between the sigmoid and the rest of the activation functions is even more pronounced on the validation data set, see Fig. 4 (bottom left).

Overall our results suggest that the probability density of the singular spectrum of at initialization does not fully determine generalization and training performance irrespective of the activation function used. Nonetheless, setting the same effective cumulant for the experiments with different activation functions results in a markedly coinciding behavior of neural networks using activation functions of similar characteristics. This is in contrast to a set up in which the variances of the weight matrix entries are set to be equal. To demonstrate this we run another set of training experiments, this time with all standard deviations . The plots depicting the full results are relegated to Fig. 6 in appendix B. Here, in Fig. 4 (bottom right) we showcase the training accuracy during the first iterations for these two setups (excluding, for clarity, the networks with the sigmoid activation function). With different effective cumulants, the network learning dynamics, differs among experiments with different activation functions, especially at the beginning of learning.

This suggests that the spectrum of the input-output Jacobian at initialization can be treated as a confounding variable in such experiments. Ensuring that the level of dynamic isometry, and hence the value of the effective cumulant is kept the same, provides the possibility of a more meaningful comparison of the effect of activation functions on learning dynamics.

Iv Comment on batch normalization

A crucial component of practically used ResNets is batch normalization IS . When it is used on pre-activations, between each layer, the propagation of the information in the network is described by:

(53)

where

(54)

are the mean and (regularized with some small ) standard deviation of the ’th mini batch inputs ’th coefficient in layer , namely of . and are parameters which are optimized during the learning process. Here, the formula for the Jacobian reeds:

(55)

where is a diagonal matrix such that . Therefore, the only difference in the spectral statistics derivation from the previous section, is that (21) becomes

(56)

Thus, the universal, large limit equation for the Greens function of the Jacobian (25) and the formula for the condition number (26) hold also when batch normalization is included. Again, and can be treated as random variables. Unfortunately, the evolution of their probability density functions across the layers is quite complicated and beyond the scope of this paper. We leave their analysis for future work.

V Synopsis and discussion

The main focus of this paper was the singular spectrum of the input-output Jacobian of a simple model of residual neural networks. We have shown that in the large network depth limit, it is described by a single, universal equation. This holds irrespective of the activation function used, for biunitarily invariant weight initializations matrices, a set covering Gaussian and scaled orthogonal initialization schemes. The singular value density depends on a single parameter called the effective cumulant, which can be calculated by considering the propagation of information in the network, via a dynamical mean field theory approach. This parameter depends on the activation function used, variance of biases and the entries of the weight matrices, and, for some activation functions, also on the depth of the network. We demonstrated the validity of our theoretical results in numerical experiments, both by generating random matrices adhering to the assumptions of the model and by evaluating the Jacobians of residual networks (at initialization) on the CIFAR10 dataset.

For a given activation function and/or network depth, it is always possible to set the weight matrix entries variances in such a way, that the resulting singular spectra of the Jacobians not only fulfill the conditions for dynamical isometry, but also are exactly the same, irrespective of the activation function used. This observation allows us to eliminate the singular spectrum of the Jacobian treated as a confounding factor in experiments with the learning process of simple residual neural networks for different activation functions. As an example of how this approach can be applied, we examined how accuracies of simple residual neural networks, employing a variety of activation functions, change during the learning process. When using the same variances of weight matrices entries, the learning curves of similar activation functions differed between each other more than when the networks were initialized with the same input-output Jacobian spectra. This allows, in our opinion, for a more meaningful comparison between the effects of choosing the activation function. We hope this observation will help with the research of deep neural networks.

Finally, we would like to note, that in hindsight, it is not surprising that Free Probability Theory, a generalization of classical Probability Theory to non-commuting variables, is useful in the study of deep neural networks. We expect to see more interesting results in this area, especially looking towards the applications of many new and emerging results for the eigenvalue spectra and (equally if not more important) eigenvectors of non-symmetric matrices.

Vi Acknowledgments

PW would like to acknowledge the funding of the Polish National Science Centre, through the project SONATA number 2016/21/D/ST2/01142. WT appreciates the financial support from the Polish Ministry of Science and Higher Education through ”Diamond Grant” 0225/DIA/2015/44. SJ was supported by the Polish National Science Centre through ETIUDA stipend No. 2017/24/T/ST6/00487.

References

Appendix A Numerical verification of the recurrence relations

To validate the assumptions made and corroborate the theoretical results obtained in subsections II.2 and II.3, we simulate signal propagation in the studied residual neural networks with different activation functions. The outcomes of these experiments are depicted in figure 5.

Figure 5: Verification of the numerical solution to the recurrence equations for post-activations (31) (left column) and preactivations (35) (middle). Based on this signal propagation the effective cumulant for each layer was calculated (right column). The solid red lines represent the approximation (46) for tanh and hard tanh nonlinearity and (49) for sigmoid. Solutions of recurrences (solid) are confronted with the numerical simulation (dots) of residual fully connected networks with layers of width

. Data points represent a single run of simulations. Weights are independently sampled from Gaussian distribution of zero mean and variance equal to

. Biases and network input are sampled from standard normal distribution. A small variability of across the network justifies the assumption made for derivation of (22).

Appendix B Baseline

We advocate for setting the same value of the effective cumulant (and hence keeping the same spectrum of the input output Jacobian) when comparing the effects of using different activation function on the learning process. Thus, here, in Fig. 6, for comparison, we showcase the learning accuracy when instead of the effective cumulant, the weight matrices entries’ variances (equal to ) are kept the same across the networks.

Figure 6: Training accuracy during first 100 iterations (left) and first 100 epochs (right) of residual networks with various activation functions. The weight initialization was Gaussian with zero mean and variance. We set and for leaky ReLU (LReLU).