1 Introduction
Deep neural networks have become extremely popular as they achieve stateoftheart performance on a variety of important applications including language processing and computer vision; see, e.g.,
LeCun et al. (1998). The success of these models has motivated the use of increasingly deep networks and stimulated a large body of work to understand their theoretical properties. It is impossible to provide here a comprehensive summary of the large number of contributions within this field. To cite a few results relevant to our contributions, Montufar et al. (2014) have shown that neural networks have exponential expressive power with respect to the depth while Poole et al. (2016) obtained similar results using a topological measure of expressiveness.We follow here the approach of Poole et al. (2016) and Schoenholz et al. (2017)
by investigating the behaviour of random networks in the infinitewidth and finitevariance i.i.d. weights context where they can be approximated by a Gaussian process as established by
Matthews et al. (2018) and Lee et al. (2018).In this paper, our contribution is twofold. Firstly, we provide an analysis complementing the results of Poole et al. (2016) and Schoenholz et al. (2017) and show that initializing a network with a specific choice of hyperparameters known as the ‘edge of chaos’ is linked to a deeper propagation of the information through the network. In particular, we establish that for a class of ReLUlike activation functions, the exponential depth scale introduced in Schoenholz et al. (2017) is replaced by a polynomial depth scale. This implies that the information can propagate deeper when the network is initialized on the edge of chaos. Secondly, we outline the limitations of ReLUlike activation functions by showing that, even on the edge of chaos, the limiting Gaussian Process admits a degenerate kernel as the number of layers goes to infinity. Our main result (4) gives sufficient conditions for activation functions to allow a good ‘information flow’ through the network (Proposition 4) (in addition to being nonpolynomial and not suffering from the exploding/vanishing gradient problem). These conditions are satisfied by the Swish activation used in Hendrycks & Gimpel (2016), Elfwing et al. (2017) and Ramachandran et al. (2017). In recent work, Ramachandran et al. (2017) used automated search techniques to identify new activation functions and found experimentally that functions of the form appear to perform indeed better than many alternative functions, including ReLU. Our paper provides a theoretical grounding for these results. We also complement previous empirical results by illustrating the benefits of an initialization on the edge of chaos in this context. All proofs are given in the Supplementary Material.
2 On Gaussian process approximations of neural networks and their stability
2.1 Setup and notations
We use similar notations to those of Poole et al. (2016) and Lee et al. (2018). Consider a fully connected random neural network of depth , widths , weights and bias , where
denotes the normal distribution of mean
and variance . For some input , the propagation of this input through the network is given for an activation function by(1) 
Throughout the paper we assume that for all the processes are independent (across ) centred Gaussian processes with covariance kernels and write accordingly . This is an idealized version of the true processes corresponding to choosing
(which implies, using Central Limit Theorem, that
is a Gaussian variable for any input ). The approximation of by a Gaussian process was first proposed by Neal (1995) in the single layer case and has been recently extended to the multiple layer case by Lee et al. (2018) and Matthews et al. (2018). We recall here the expressions of the limiting Gaussian process kernels. For any input , so that for any inputswhere is a function that depends only on . This gives a recursion to calculate the kernel ; see, e.g., Lee et al. (2018) for more details. We can also express the kernel in terms of the correlation in the layer used in the rest of this paper
where , resp. , is the variance, resp. correlation, in the layer and ,
are independent standard Gaussian random variables. when it propagates through the network.
is updated through the layers by the recursive formula , where is the ‘variance function’ given by(2) 
Throughout the paper, will always denote independent standard Gaussian variables.
2.2 Limiting behaviour of the variance and covariance operators
We analyze here the limiting behaviour of and as the network depth goes to infinity under the assumption that has a second derivative at least in the distribution sense^{1}^{1}1ReLU admits a Dirac mass in 0 as second derivative and so is covered by our developments.. From now onwards, we will also assume without loss of generality that (similar results can be obtained straightforwardly when ). We first need to define the Domains of Convergence associated with an activation function .
Definition 1.
Let be an activation function, .
(i) is in (domain of convergence for the variance) if there exists , such that for any input with , . We denote by the maximal satisfying this condition.
(ii) is in (domain of convergence for the correlation) if there exists such that for any two inputs with , . We denote by the maximal satisfying this condition.
Remark : Typically, in Definition 1 is a fixed point of the variance function defined in equation 2. Therefore, it is easy to see that for any such that is increasing and admits at least one fixed point, we have where is the minimal fixed point; i.e. . Thus, if we rescale the input data to have , the variance converges to . We can also rescale the variance of the first layer (only) to assume that for all inputs .
The next result gives sufficient conditions on to be in the domains of convergence of .
Proposition 1.
Let . Assume , then for and any , we have and
Let . Assume for some , then for and any , we have and .
The proof of Proposition 1 is straightforward. We prove that and then apply the Banach fixed point theorem; similar ideas are used for .
Example : For ReLU activation function, we have and for any .
In the domain of convergence , for all , almost surely and the outputs of the network are constant functions. Figure 1 illustrates this behaviour for for ReLU and Tanh using a network of depth with neurons per layer. The draws of outputs of these networks are indeed almost constant.
To refine this convergence analysis, Schoenholz et al. (2017) established the existence of and such that and when fixed points exist. The quantities and are called ‘depth scales’ since they represent the depth to which the variance and correlation can propagate without being exponentially close to their limits. More precisely, if we write and then the depth scales are given by and . The equation corresponds to an infinite depth scale of the correlation. It is called the edge of chaos as it separates two phases: an ordered phase where the correlation converges to 1 if and a chaotic phase where and the correlations do not converge to 1. In this chaotic regime, it has been observed in Schoenholz et al. (2017) that the correlations converge to some random value when and that is independent of the correlation between the inputs. This means that very close inputs (in terms of correlation) lead to very different outputs. Therefore, in the chaotic phase, the output function of the neural network is noncontinuous everywhere.
Definition 2.
For , let be the limiting variance^{2}^{2}2The limiting variance is a function of but we do not emphasize it notationally.. The Edge of Chaos, hereafter EOC, is the set of values of satisfying .
To further study the EOC regime, the next lemma introduces a function called the ‘correlation function’ simplifying the analysis of the correlations. It states that the correlations have the same asymptotic behaviour as the timehomogeneous dynamical system .
Lemma 1.
Let such that , and an activation function such that for all compact sets . Define by and by . Then .
The condition on in Lemma 1 is violated only by activation functions with exponential growth (which are not used in practice), so from now onwards, we use this approximation in our analysis. Note that being on the EOC is equivalent to satisfying
. In the next section, we analyze this phase transition carefully for a large class of activation functions.
3 Edge of Chaos
To illustrate the effect of the initialization on the EOC, we plot in Figure 2 the output of a ReLU neural network with 20 layers and 100 neurons per layer with parameters (as we will see later for ReLU). Unlike the output in Figure 1, this output displays much more variability. However, we will prove here that the correlations still converges to 1 even in the EOC regime, albeit at a slower rate.
3.1 ReLUlike activation functions
We consider activation functions of the form: if and if . ReLU corresponds to and . For this class of activation functions, we see (Proposition 2) that the variance is unchanged () on the EOC, so that does not formally exist in the sense that the limit of depends on . However, this does not impact the analysis of the correlations.
Proposition 2.
Let be a ReLUlike function with and defined above. Then for any and , we have with . Moreover and, on the EOC, for any .
This class of activation functions has the interesting property of preserving the variance across layers when the network is initialized on the EOC. However, we show in Proposition 3 below that, even in the EOC regime, the correlations converge to 1 but at a slower rate. We only present the result for ReLU but the generalization to the whole class is straightforward.
Example : ReLU: The EOC is reduced to the singleton , which means we should initialize a ReLU network with the parameters . This result coincides with the recommendation of He et al. (2015) whose objective was to make the variance constant as the input propagates but did not analyze the propagation of the correlations. Klambauer et al. (2017) also performed a similar analysis by using the "Scaled Exponential Linear Unit" activation (SELU) that makes it possible to center the mean and normalize the variance of the postactivation . The propagation of the correlation was not discussed therein either. In the next result, we present the correlation function corresponding to ReLU networks. This was first obtained in Cho & Saul (2009). We present an alternative derivation of this result and further show that the correlations converge to 1 at a polynomial rate of instead of an exponential rate.
Proposition 3 (ReLU kernel).
Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .
Figure 3 displays the correlation function with two different sets of parameters . The red graph corresponds to the EOC , and the blue one corresponds to an ordered phase . In unreported experiments, we observed that numerical convergence towards for on the EOC. As the variance is preserved by the network () and the correlations converge to 1 as increases, the output function is of the form for a constant (notice that in Figure 2, we start observing this effect for depth 20).
3.2 A better class of activation functions
We now introduce a set of sufficient conditions for activation functions which ensures that it is then possible to tune to slow the convergence of the correlations to 1. This is achieved by making the correlation function sufficiently close to the identity function.
Proposition 4 (Main Result).
Let be an activation function. Assume that
(i) , and has right and left derivatives in zero and or , and there exists such that .
(ii) There exists such that for any , there exists such that .
(iii) For any , the function with parameters is nondecreasing and where is the minimal fixed point of , .
(iv) For any , the correlation function with parameters introduced in Lemma 1 is convex.
Then, for any , we have , and
Note that ReLU does not satisfy the condition since the EOC in this case is the singleton . The result of Proposition 4 states that we can make close to by considering . However, this is under condition which states that . Therefore, practically, we cannot take too small. One might wonder whether condition is necessary for this result to hold. The next lemma shows that removing this condition results in a useless class of activation functions.
Lemma 2.
The next proposition gives sufficient conditions for bounded activation functions to satisfy all the conditions of Proposition 4.
Proposition 5.
The conditions in Proposition 5 are easy to verify and are, for example, satisfied by Tanh and Arctan. We can also replace the assumption " satisfies (ii) in Proposition 4" by a sufficient condition (see Proposition 7 in the Supplementary Material). Tanhlike activation functions provide better information flow in deep networks compared to ReLUlike functions. However, these functions suffer from the vanishing gradient problem during backpropagation; see, e.g., Pascanu et al. (2013) and Kolen & Kremer (2001). Thus, an activation function that satisfies the conditions of Proposition 4 (in order to have a good ’information flow’) and does not suffer from the vanishing gradient issue is expected to perform better than ReLU. Swish is a good candidate.
Proposition 6.
The Swish activation function satisfies all the conditions of Proposition 4.
It is clear that Swish does not suffer from the vanishing gradient problem as it has a gradient close to 1 for large inputs like ReLU. Figure 4 (a) displays for Swish for different values of . We see that is indeed approaching the identity function when is small, preventing the correlations from converging to 1. Figure 4(b) displays a draw of the output of a neural network of depth 30 and width 100 with Swish activation, and . The outputs displays much more variability than the ones of the ReLU network with the same architecture. We present in Table 1 some values of on the EOC as well as the corresponding limiting variance for Swish. As condition of Proposition 4 is satisfied, the limiting variance decreases with .
Other activation functions that have been shown to outperform empirically ReLU such as ELU (Clevert et al. (2016)), SELU (Klambauer et al. (2017)) and Softplus also satisfy the conditions of Proposition 4 (see appendix for ELU). The comparison of activation functions satisfying the conditions of Proposition 4 remains an open question.
4 Experimental Results
We demonstrate empirically our results on the MNIST dataset. In all the figures below, we compare the learning speed (test accuracy with respect to the number of epochs/iterations) for different activation functions and initialization parameters. We use the Adam optimizer with learning rate
. The Python code to reproduce all the experiments will be made available online.Initialization on the Edge of Chaos We initialize randomly the deep network by sampling and . In Figure 5, we compare the learning speed of a Swish network for different choices of random initialization. Any initialization other than on the edge of chaos results in the optimization algorithm being stuck eventually at a very poor test accuracy of as the depth increases (equivalent to selecting the output uniformly at random). To understand what is happening in this case, let us recall how the optimization algorithm works. Let be the MNIST dataset. The loss we optimize is given by where is the output of the network, and is the categorical crossentropy loss. In the ordered phase, we know that the output converges exponentially to a fixed value (same value for all ), thus a small change in and
will not change significantly the value of the loss function, therefore the gradient is approximately zero and the gradient descent algorithm will be stuck around the initial value.
ReLU versus Tanh We proved in Section 3.2 that the Tanh activation guarantees better information propagation through the network when initialized on the EOC. However, Tanh suffers
from the vanishing gradient problem. Consequently, we expect Tanh to perform better than ReLU for shallow networks as opposed to deep networks, where the problem of the vanishing gradient is not encountered. Numerical results confirm this fact. Figure 6
shows curves of validation accuracy with confidence interval 90
(30 simulations). For depth 5, the learning algorithm converges faster for Tanh compared to ReLu. However, for deeper networks (), Tanh is stuck at a very low test accuracy, this is due to the fact that a lot of parameters remain essentially unchanged because the gradient is very small.ReLU versus Swish As established in Section 3.2, Swish, like Tanh, propagates the information better than ReLU and, contrary to Tanh, it does not suffer from the vanishing gradient problem. Hence our results suggest that Swish should perform better than ReLU, especially for deep architectures. Numerical results confirm this fact. Figure 7 shows curves of validation accuracy with confidence interval 90 (30 simulations). Swish performs clearly better than ReLU especially for depth 40. A comparative study of final accuracy is shown in Table 2. We observe a clear advantage for Swish, especially for large depths. Additional simulations results on diverse datasets demonstrating better performance of Swish over many other activation functions can be found in Ramachandran et al. (2017)
(Notice that these authors have already implemented Swish in Tensorflow).
(10,5)  (20,10)  (40,30)  (60,40)  

ReLU  94.01  96.01  96.51  91.45 
Swish  94.46  96.34  97.09  97.14 
5 Conclusion and Discussion
We have complemented here the analysis of Schoenholz et al. (2017) which shows that initializing networks on the EOC provides a better propagation of information across layers. In the ReLU case, such an initialization corresponds to the popular approach proposed in He et al. (2015). However, even on the EOC, the correlations still converge to 1 at a polynomial rate for ReLU networks. We have obtained a set of sufficient conditions for activation functions which further improve information propagation when the parameters
are on the EOC. The Tanh activation satisfied those conditions but, more interestingly, other functions which do not suffer from the vanishing/exploding gradient problems also verify them. This includes the Swish function used in
Hendrycks & Gimpel (2016), Elfwing et al. (2017) and promoted in Ramachandran et al. (2017) but also ELU Clevert et al. (2016).Our results have also interesting implications for Bayesian neural networks which have received renewed attention lately; see, e.g., HernandezLobato & Adams (2015) and Lee et al. (2018). They show that if one assigns i.i.d. Gaussian prior distributions to the weights and biases, the resulting prior distribution will be concentrated on close to constant functions even on the EOC for ReLUlike activation functions. To obtain much richer priors, our results indicate that we need to select not only parameters on the EOC but also an activation function satisfying Proposition 4.
References

Cho & Saul (2009)
Y. Cho and L.K. Saul.
Kernel methods for deep learning.
Advances in Neural Information Processing Systems, 22:342–350, 2009.  Clevert et al. (2016) D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations, 2016.
 Elfwing et al. (2017) S. Elfwing, E. Uchibe, and K. Doya. Sigmoidweighted linear units for neural network function approximation in reinforcement learning. arXiv:1702.03118, 2017.

He et al. (2015)
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
ICCV, 2015.  Hendrycks & Gimpel (2016) D.. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv:1606.08415, 2016.

HernandezLobato & Adams (2015)
J. M. HernandezLobato and R.P. Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
ICML, 2015.  Klambauer et al. (2017) G. Klambauer, T. Unterthiner, and A. Mayr. Selfnormalizing neural networks. Advances in Neural Information Processing Systems, 30, 2017.
 Kolen & Kremer (2001) J.F. Kolen and S.C. Kremer. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies. A Field Guide to Dynamical Recurrent Network, WileyIEEE Press, pp. 464–, 2001.
 LeCun et al. (1998) Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. Neural Networks: Tricks of the trade, Springer, 1998.
 Lee et al. (2018) J. Lee, Y. Bahri, R. Novak, S.S. Schoenholz, J. Pennington, and J. SohlDickstein. Deep neural networks as gaussian processes. 6th International Conference on Learning Representations, 2018.
 Matthews et al. (2018) A.G. Matthews, J. Hron, M. Rowland, R.E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. 6th International Conference on Learning Representations, 2018.
 Montufar et al. (2014) G.F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems, 27:2924–2932, 2014.
 Neal (1995) R.M. Neal. Bayesian learning for neural networks. Springer Science & Business Media, 118, 1995.

Pascanu et al. (2013)
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.
Proceedings of the 30th International Conference on Machine Learning
, 28:1310–1318, 2013.  Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. SohlDickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. 30th Conference on Neural Information Processing Systems, 2016.
 Ramachandran et al. (2017) P. Ramachandran, B. Zoph, and Q.V. Le. Searching for activation functions. arXiv eprint 1710.05941, 2017.
 Schoenholz et al. (2017) S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. SohlDickstein. Deep information propagation. 5th International Conference on Learning Representations, 2017.
Appendix A Proofs
We provide in the supplementary material the proofs of the propositions presented in the main document, and we give additive theoretical and experimental results. For the sake of clarity we recall the propositions before giving their proofs.
a.1 Convergence to the fixed point: Proposition 1
Proposition 1.
Let . Suppose , then for and any , we have and
Moreover, let . Suppose for some positive , then for and any , we have and .
Proof.
To abbreviate the notation, we use for some fixed input a.
Convergence of the variances: We first consider the asymptotic behaviour of . Recall that where,
The first derivative of this function is given by:
(3) 
where we used Gaussian integration by parts , an identify satisfied by any function such that .
Using the condition on , we see that for , the function is a contraction mapping, and the Banach fixedpoint theorem guarantees the existence of a unique fixed point of , with . Note that this fixed point depends only on , therefore, this is true for any input , and .
Convergence of the covariances: Since , then for all there exists such that, for all , . Let , using Gaussian integration by parts, we have
We cannot use the Banach fixed point theorem directly because the integrated function here depends on through . For ease of notation, we write , we have
Therefore, for , is a Cauchy sequence and it converges to a limit . At the limit
The derivative of this function is given by
By assumption on and the choice of , we have , so that is a contraction, and has a unique fixed point. Since , . The above result is true for any , therefore, . ∎
As an illustration we plot in Figure 12 the variance for three different inputs with , as a function of the layer . In this example, the convergence for Tanh is faster than that of ReLU.
Lemma 1.
Let such that , and an activation function such that for all compact sets . Define by and by . Then .
Proof.
For , we have
where . The first term goes to zero uniformly in using the condition on and CauchySchwartz inequality. As for the second term, it can be written as
again, using CauchySchwartz and the condition on , both terms can be controlled uniformly in by an integrable upper bound. We conclude using the Dominated convergence. ∎
a.2 Results for ReLUlike activation functions: proof of Propositions 2 and 3
Proposition 2.
Let be a ReLUlike function with and defined above. Then for any and , we have with . Moreover and, on the EOC, for any .
Proof.
We write throughout the proof. Note first that the variance satisfies the recursion:
(4) 
For all , is a fixed point. This is true for any input, therefore and (i) is proved.
Now, the EOC equation is given by . Therefore, . Replacing by its critical value in equation 4 yields
Thus if and only if , otherwise diverges to infinity. So the frontier is reduced to a single point , and the variance does not depend on .
∎
Proposition 3 (ReLU kernel).
Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .
Proof.
In this case the correlation function is given by where .

Let , note that is differentiable and satisfies,
which is also differentiable. Simple algebra leads to
Since and ,
We conclude using the fact that and .

We first derive a Taylor expansion of near 1. Consider the change of variable with close to 0, then
so that
and
Comments
There are no comments yet.