Training very deep networks via stochastic gradient descent is an unstable numerical procedure. The end result of that training varies in quality based on metrics such as convergence speed, robustness to random factors, and of course terminal accuracy. Various moving parts of the deep network in question can be interchanged in order to favour one aspect over the other. Of particular interest is the maximum training depth, or number of layers, that a specific architecture can reliably achieve. A key breakthrough in deep learning was achieved when that number got in the region of 10 layers by switching the nonlinear activation unit from traditionally used hyperbolic tangent and sigmoid units to a rectified linear unit(Glorot & Bengio, 2010). Further progress, by an order of magnitude or two, was recently made possible by the combination of a residual network reparameterization (He et al., 2015b) and a milestone process called batch normalization (Ioffe & Szegedy, 2015), constraining the output data statistics after each layer. For all intents and purposes, these two methods have ushered in an era of arbitrarily deep networks, at the expense of increased computational cost (and maybe less theoretical interpretability). Without using them the number of layers that can be trained reliably is still limited (Pennington et al., 2017)
. For instance, we did not manage to train feedforward networks deeper than 30 layers, using rectified linear units, without such stabilising procedures. In this work, we investigate if existing theory available for multi-layer perceptrons enables us to design activation functions more suitable for training deeper networks, by transforming existing activations.
The contributions of this paper are as follows :
first, we use newly discovered applications of random matrix theory to neural networks in order to give a new, general process that transforms activation functions;
second, we compute the transforms of activation functions commonly used in deep learning, then give a closed-form expression for the transformed rectified linear unit, and focus on evaluating it empirically;
third, we explore the empirical eigenvalue spectra of networks trained with these new activations.
2 Related Work
Studies in physics-inspired theoretical properties of the gradient flow in neural networks go, for instance, as far back as Sompolinsky (Sompolinsky et al., 1988). A revival of this literature was sparked by the Stanford school of thought (Poole et al., 2016; Schoenholz et al., 2016), making connections between neural networks and statistical physics, in particular noting the existence of a critical phase separating the two unstable regimes of exploding and vanishing gradients.
On the activation function front, sigmoidal and hyperbolic tangent units have historically been favoured with neural networks. The advent of rectified linear units, or ReLUs (Glorot & Bengio, 2010; Nair & Hinton, 2010) marked a notable shift that enabled the training of deeper networks than previously thought possible. This sparkled much research aimed at maximizing accuracy, with exponential linear units or ELUs (Clevert et al., 2015; Klambauer et al., 2017), Gaussian linear units or GELUs (Hendrycks & Gimpel, 2016), and swish functions Elfwing et al. (2017) all notable units performance-wise, enabling accuracy gains.
) in the large square layer regime for feed-forward neural networks. It opens up the possibility to train stabler, deeper networks without requiring techniques like skip connections(Pennington & Worah, 2017), batch normalization, (Ioffe & Szegedy, 2015) and its siblings weight normalization (Salimans & Kingma, 2016) or group normalization (Wu & He, 2018). A method that would ’push the batch norm inside the activations’, at least in a static manner, would be of computational interest.
3 Review of random matrices and free probability for neural networks
3.1 Dynamical isometry
For clarity, we begin with laying out our theoretical framework for what follows. This expository section is much indebted to the work of Pennington (Pennington et al., 2018; Pennington & Worah, 2017), and the reader familiar with his work might omit it. Let us assume we have an -layer deep multi-layer perceptron, or feed-forward neural network, of width , with weight matrices, pre-activations , and post-activations , for each layer in . The forward propagation equations are
where is a nonlinearity (also referred to in this paper as an activation function), and the input is . The input-output Jacobian is therefore given by
Here is a diagonal matrix with diagonal terms , where is a Kronecker delta (it will in general denote a Dirac delta in the sequel). Intuitively, this input-output Jacobian
, and particularly its spectral content - eigenvalues and conditioning - is critical for the operation of backpropagation that happens during the training of the network : if we knew thatall eigenvalue moduli were tightly concentrated around a mean value of 1, gradient flow in the network would be optimal, as no direction of input data would yield neither vanishing or exploding gradients. This very desirable situation is called dynamical isometry. This way of normalizing networks by construction ensures network training proceeds as fast as possible. The biases don’t come into play in the expression of , so we need consider weights only - as random matrices who belong to an ensemble that depends on how the neural network is initialized. Now we want to study the whole spectral distribution of Jacobian for , in the wide layer limit . In general, given a random matrix , its limiting spectral density is a real function defined as the limit of the eigenvalue empirical measure
Since we only care about the squared absolute values of the eigenvalues, we can consider the limiting spectral density of the self-adjoint and therefore semidefinite positive , so as to compare and the ideal distribution . The Jacobian is a product of random matrices as given by equation 2, so the mathematical problem we are facing is that of computing the eigenvalues of a product of random matrices.
To this end we consider two complex variable transforms and , that turn a random matrix ensemble’s spectral density into a complex variable function. These transforms both encode the same information; we use them interchangeably as per ease of computation. The first one,
, is just the moment generating function,
where is the -th moment of the distribution , or the limit trace moment of random matrix ,
This information that encodes all the coefficients is also contained in the so-called Cauchy-Stieltjes transform turning a random matrix ensemble into a function of the complex variable:
On the other hand, the S-transform is defined as
where is the functional inverse of .
The S-transform is key here, due to its morphism behavior with respect to matrix multiplication. If and are two freely independent111The theory of free independence dates to Voiculescu, and relies on an assumption of commutatitivity of joint trace moments. This assumption holds more and more true in very high dimension where, due to the phenomenon of concentration of measure, random directions are orthogonal with high probability; and as such, eigenspaces of two random matrices are approximately orthogonal and approximately commute. Readers (should there be any) are referred to Speicher and Mingo
dates to Voiculescu, and relies on an assumption of commutatitivity of joint trace moments. This assumption holds more and more true in very high dimension where, due to the phenomenon of concentration of measure, random directions are orthogonal with high probability; and as such, eigenspaces of two random matrices are approximately orthogonal and approximately commute. Readers (should there be any) are referred to Speicher and Mingo(Mingo & Speicher, 2017) for more details. random matrices, then the S-transform of the product random matrix ensemble is the product of their S-transforms :
This property then enables us to write in full for the self-adjoint Jacobian , and under an assumption of homogeneity across layers, that
This gives us a strategy of computing the individual S-transforms of both activations and initializations, multiplying them together, and finally, inverting the result. However, attempting to write closed form S-transforms for all activation functions used in deep learning generally gives rise to intractable complex integrals.
3.2 Static isometry : Nonlinearity M-Transform
First, for any nonlinearity , we have,
The integral over the Gaussian measure reflects a sum over all the activations in a layer, since in the large limit the empirical distribution of activations converges to a Gaussian with limit variance
via central limit theorem. This complex parameter integral is in general intractable, but can be developed in power series involving
Based on this, Gaussian moments of the gradient’ed activation function are crucial controls for the mathematical properties of the activation function in question.
4 Static activation normalization
4.1 Marcenko-Pastur static isometry
Many of the Gaussian integrals in the above are de facto intractable for numerous activation functions, even after approximations. However our main goal is to design activation functions as close as possible to isometry, static or dynamical. For that, we introduce two measures of bias and variance shift respectively for a given activation , that are
and we want for no bias shift (the gradient of the activation function, possibly dilated, is a zero Gaussian mean function). For the variance shift we want the second-order condition
where clearly is desirable.
Under those two conditions, Pennington (Pennington & Worah, 2017) shows that the Stieltjes transform verifies the quadratic
Solving and inverting it shows the eigenvalues distribution of , and are the same: they both follow a Marcenko-Pastur distribution. Intuitively, this comes from the fact is assumed to be a Gaussian datapoint vector, so that is a Wishart random matrix, and the aforementioned conditions ensure that static spectral isometry is conserved. Numerical evidence also shows that this result holds too independently of the Gaussian character of . This is a very desirable property termed static isometry : when it holds, pairwise, fully-connected layers of a neural network maintain a constant (in modulus) eigenvalue distribution. In particular, this means that all higher moments of the data distribution, and its conditioning, get preserved at the beginning of training.
The probability density of the Marcenko-Pastur distribution, depicted figure 1, with parameter is given by :
is the rectangularity ratio of neural network layers, which we pick square so that it is equal to 1. For all that follows, we will also further assume that we are in the limit of large square neural network layers, and that we normalize and to both be equal to ; this can be achieved by the choice of weight initialization for the network, and by accordingly pre-processing the dataset, respectively.
4.2 Hermite polynomial basis expansion
We can now hope to intervene on desirable properties of an activation function via its Gaussian moments. Those in turn come into play in the Hermite polynomial series expansion of the activation function, as Hermite polynomials are orthogonal with respect to the standard Gaussian measure, as follows:
where the Hermite orthogonal polynomials are defined by
Trivially and , so that . Further, .
4.3 Static normalization : Principle
Most activation functions do not satisfy these two conditions by default. However, if we make the simplifying assumption that (that is, the network is initialized close to critical variance of the weights, and inputs have been suitably normalized too), then it is not hard to see that we can normalize any activation function statically by the transform below, scaling and removing the affine part:
which will roughly enforce Marcenko-Pastur isometry, at least at the beginning of the training.
This means that we are looking for activation functions whose Hermite expansion only includes terms for polynomials of order and above - so that is in the vector space . Zeroing out Hermite coefficients of order 0 and 1 is projecting onto . Further, we also project on a sphere by enforcing by scaling with .
In practice, this means that we can simply precompute the Gaussian integrals above as scalars , and then replace any activation function by its Hermite normalization
Here we also note that
which means that this transformation represents fixed centering and scaling of the backpropagation gradients passing through the activations. In this sense, it can be interpreted as a static version of batch normalization that is not backpropagated through (hence runs significantly faster).
4.4 Normalization coefficients for various activation functions
Numerous activation functions are used in deep learning. If the ReLU unit is by now the most common, many others, such as the softplus , exponential linear units, Gaussian exponential linear units, the swish (Elfwing et al., 2017) activation function, the sigmoid activation function, the hyperbolic tangent, and so on and so forth, are also present. We can apply static activation normalization (equation 18) to all of these and compute the , and coefficients. The numerical results are presented in the table below.
|x * Tanh(x)||0.605706||0||0.625308||0.749437||1.01452|
for static activation normalization. Also included are the coefficients (higher order moments) related to the Gaussian kurtosis calculation.
4.5 Tilted ReLU
In what follows we focus exclusively on the static normalization of the ReLU function, which gives a new activation where the associated constants can be derived in closed form. We get to a multiplicative constant factor (chosen to make its Lipschitz constant 1):
In all our testing, this unit seems to yield the best results (excluding the normalized GELU which unfortunately yielded prohibitive computation times). Crucially, it is convex, just like the original ReLU unit. We thereafter refer to this unit as the tilted ReLU, depicted below.
Interestingly, and while our process applies to any unit, the tilted ReLU can be seen as a special case of concatenated rectified linear units, which have previously shown to improve training (Shang et al., 2016). We now proceed to validate its use empirically.
5 Experimental results
Our experimental setup consists in training multi-layer perceptrons (feed-forward neural networks) on the CIFAR10 image classification dataset. The 32x32x3 image tensors are flattened into vectors of 3072 elements, and then fed as inputs to multi-layer perceptrons, consisting of a variable depth number of square layers, either 1024 or 128 units wide, and a final 10-way softmax layer. In general we compare 30-layer deep networks with networks of 60 layers or more. Batch normalization is not used by default, unless explicitly stated. Weight are initialized orthogonally(Saxe et al., 2013)2016)
and Keras for our code, and run on a machine equipped with NVIDIA GTX1080 GPUs. We report on test-time accuracies, anytime performance, and robustness of convergence.
5.2.1 Convergence: trainability in layers
We progressively increase the number of layers in our feedforward network until the training process fails. In this setup, the tilted ReLU, even without batch normalization, provides convergence all the way to 60 layers, where all things considered the same setup with ReLUs stops converging after depth reaches circa 30 layers. This is of note for a simple affine tilting transformation, and provides simple validation for an involved theory.
5.2.2 Convergence: speed, terminal accuracy, and impact of bias shift
Comparison of convergence for a 30 layer, 1024 width, feedforward network on CIFAR10. Test-time accuracy in percentage is plotted as a function of the number of training epochs. The ReLU unit barely converges, whereas the tilted ReLU (middle unit) brings accuracy and convergence speed in line with the hyperbolic tangent unit.
Figure 4 shows our tilted ReLU unit, in the middle, shows terminal accuracy in line with the best performing unit for feedforward networks with orthogonal initialization, the hyperbolic tangent (rightmost). Furthermore, it gets there extremely fast, after about 25 epochs of training, in stark contrast with the standard ReLU unit which across many runs shows difficulty converging.
Since tilting other activations doesn’t seem to provide the same performance gains than on ReLUs, we question whether it’s the tilting part or simply removing the shift part via substracting a constant (the term in equation 19) that is responsible for improving the trainability of the network. We verify (see Figure 5) that moving the tilted ReLU from to still maintains convergence, showing that the tilting is the essential part of the improved convergence.
5.3 Inspection of pre-activation eigenvalues
In order to shed light on what is happening inside the networks, we decide to diagonalize numerically, so as to look at the squared magnitude of the eigenvalues at each layer, . For convenience, we sort these by decreasing value.
Layer-wise, it appears that the spread of squared eigenvalue spectra decreases as we progress deeper into the layer. These spectra progressively flatten to converge to something close to the uniform 1 function.
This has important potential implications for neural network training. It appears that in the orthogonal initialization case, the vast majority of squared eigenvalues is 1 or close to 1 (in magnitude) in the deepest layers of the network (see figure 6
). This is irrespective of the activation function used. In other words, there is a large subspace of data where the network layers are simply acting as a rotation operator. This could inform future work on optimization methods. At the same time, and in a manner compatible with building hierarchical features and representations of the input data, the right tail of the distribution tapers off to zero, so as to forget information. The left tail is most interesting as it looks skewed - seemingly, the exploding gradients problem is of more practical importance than the vanishing gradients one in our application.
Such a flat spectral distribution also provides a nice ex-post interpretation for residual networks
. These force the network training process to look for solutions near the identity matrix with a full flat spectrum. This way, the optimization process focuses on looking for distinctive features (the aforementioned tails). In light of our experimental results, this is particularly important deeper in the network.
In line with intuition, we also verify experimentally that batch normalization generates an eigenvalue spectrum even closer to the idealized Dirac distribution at one, with less vertical spread (but coming at computational expense).
6 Conclusion and further work
We have defined static activation normalization as a projection of activation functions used in deep learning on a Hermite polynomial space. Empirically we observe that simply normalizing the ReLU unit in such a way enables us to train much deeper networks, and promotes robustness and convergence speed of the training process, making a first step towards ’pushing batch normalization within the activation’. Further work will focus on a better understanding of how to mix all the different competing constraints that come into play in activation design, possibly combining both static and dynamic isometry. Hopefully, that work will enable pro-active activation scoring
from such properties as their Gaussian moments. Moreover, the static isometry view also indicates that significant meaning could be found in the large number of pre-activation eigenvalues with moduli 1. Either constraining the optimization process to operate block-wise within the space of rotation matrices, or using random unitary matrix theory to give further predictions for the phase of eigenvalues and output vectors, could be fruitful research directions in the future.
The author wants to thank Bilal Piot, Mo Azar, Tiago Pereira, and Pratik Chaudhari for helpful discussions and comments.
Abadi et al. (2016)
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg,
M. Wicke, Y. Yu, and X. Zheng.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.arXiv e-prints, March 2016.
- Clevert et al. (2015) D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv e-prints, November 2015.
Elfwing et al. (2017)
S. Elfwing, E. Uchibe, and K. Doya.
Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning.arXiv e-prints, February 2017.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
He et al. (2015a)
K. He, X. Zhang, S. Ren, and J. Sun.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.arXiv e-prints, February 2015a.
- He et al. (2015b) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv e-prints, December 2015b.
- Hendrycks & Gimpel (2016) D. Hendrycks and K. Gimpel. Gaussian Error Linear Units (GELUs). arXiv e-prints, June 2016.
- Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv e-prints, February 2015.
- Klambauer et al. (2017) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. arXiv e-prints, June 2017.
- Louart et al. (2017) C. Louart, Z. Liao, and R. Couillet. A Random Matrix Approach to Neural Networks. arXiv e-prints, February 2017.
- Mingo & Speicher (2017) James A. Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017.
Nair & Hinton (2010)
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.In ICML, 2010.
- Pennington et al. (2017) J. Pennington, S. S. Schoenholz, and S. Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. arXiv e-prints, November 2017.
- Pennington et al. (2018) J. Pennington, S. S. Schoenholz, and S. Ganguli. The Emergence of Spectral Universality in Deep Networks. arXiv e-prints, February 2018.
- Pennington & Bahri (2017) Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In ICML, 2017.
- Pennington & Worah (2017) Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In NIPS, 2017.
- Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. arXiv e-prints, June 2016.
- Salimans & Kingma (2016) T. Salimans and D. P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. arXiv e-prints, February 2016.
- Saxe et al. (2013) A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv e-prints, December 2013.
- Schoenholz et al. (2016) S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep Information Propagation. arXiv e-prints, November 2016.
Shang et al. (2016)
W. Shang, K. Sohn, D. Almeida, and H. Lee.
Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units.arXiv e-prints, March 2016.
- Sompolinsky et al. (1988) Sompolinsky, A Crisanti, and Sommers. Chaos in random neural networks. Physical review letters, 61 3:259–262, 1988.
- Wu & He (2018) Y. Wu and K. He. Group Normalization. arXiv e-prints, March 2018.