On the Impact of the Activation Function on Deep Neural Networks Training

02/19/2019 ∙ by Soufiane Hayou, et al. ∙ University of Oxford 0

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Samuel et al (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `Edge of Chaos' can lead to good performance. While the work by Samuel et al (2017) discuss trainability issues, we focus here on training acceleration and overall performance. We give a comprehensive theoretical analysis of the Edge of Chaos and show that we can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have become extremely popular as they achieve state-of-the-art performance on a variety of important applications including language processing and computer vision; see, e.g.,

Goodfellow-et-al-2016. The success of these models has motivated the use of increasingly deep networks and stimulated a large body of work to understand their theoretical properties. It is impossible to provide here a comprehensive summary of the large number of contributions within this field. To cite a few results relevant to our contributions, montufar have shown that neural networks have exponential expressive power with respect to the depth while poole obtained similar results using a topological measure of expressiveness.

Since the loss function we are minimizing is non-convex for deep neural networks, the weight initialization and the activation function will essentially determine the functional subspace that the optimization algorithm will explore. We follow here the approach of

poole and samuel

by investigating the behaviour of random networks in the infinite-width and finite-variance i.i.d. weights context where they can be approximated by a Gaussian process as established by

lee and matthews.

In this paper, our contribution is three-fold. Firstly, we provide an comprehensive analysis of the so-called Edge of Chaos (EOC) curve and show that initializing a network on this curve leads to a deeper propagation of the information through the network and accelerates the training. In particular, we show that a feedforward ReLU network initialized on the EOC acts as a simple residual ReLU network in terms of information propagation. Secondly, we introduce a class of smooth activation functions that allows deeper signal propagation (Proposition

3) than ReLU. In particular, this analysis provides a theoretical understanding on why smooth versions of ReLU (such as SiLU or ELU) perform better experimentally for deep neural network; see, e.g., clevert, pedamonti and ramachandran). Lastly, we show the existence of an optimal point on the EOC curve that depends on the depth of the network. We provide guidelines for the choice of such point and we demonstrate numerically the benefits of this approach. We also complement previous empirical results by illustrating the advantages of an initialization on the EOC in this context. All proofs are given in the Supplementary Material.

2 On Gaussian process approximations of neural networks and their stability

2.1 Setup and notations

Consider a fully connected feedforward random neural network of depth , widths , weights and bias , where

denotes the normal distribution of mean

and variance . For some input , the propagation of this input through the network is given for an activation function by

(1)
(2)

Throughout this paper we assume that for all the processes are independent (across ) centred Gaussian processes with covariance kernels and write accordingly . This is an idealized version of the true processes corresponding to choosing

(which implies, using Central Limit Theorem, that

is a Gaussian variable for any input ). The approximation of by a Gaussian process was first proposed by neal in the single layer case and has been recently extended to the multiple layer case in lee, matthews. We recall here the expressions of the limiting Gaussian process kernels. For any input , so that for any inputs

where is a function that depends only on . This gives a recursion to calculate the kernel ; see, e.g., lee for more details. We can also express the kernel (we use hereafter ) in terms of the correlation in the layer

where , resp. , is the variance, resp. correlation, in the layer and where ,

are independent standard Gaussian random variables. When it propagates through the network.

is updated through the layers by the recursive formula , where is the ‘variance function’ given by

(3)

Throughout this paper, will always denote independent standard Gaussian variables, and two inputs for the network.
Before starting our analysis, the transform of a function of domain is defined by for . We have .

Let and be two subsets of . We define the following sets of functions for by

where is the derivative of . When and are not explicitly mentioned, we assume .

2.2 Limiting behaviour of the variance and covariance operators

We analyze here the limiting behaviour of and as goes to infinity. From now onwards, we also assume without loss of generality that (similar results can be obtained straightforwardly when ). We first define the Domains of Convergence associated with an activation function .

Definition 1.

Let , .
(i) is in (domain of convergence for the variance) if there exists , such that for any input with , . We denote by the maximal satisfying this condition.
(ii) is in (domain of convergence for the correlation) if there exists such that for any two inputs with , . We denote by the maximal satisfying this condition.

Remark. Typically, in Definition 1 is a fixed point of the variance function defined in (3). Therefore, it is easy to see that for any such that is non-decreasing and admits at least one fixed point, we have where is the minimal fixed point; i.e. . Thus, if we re-scale the input data to have , the variance converges to . We can also re-scale the variance of the first layer (only) to assume that for all inputs .

(a) ReLU with
(b) Tanh with
(c) Tanh with
Figure 1: Draws of outputs for ReLU and Tanh networks for different paramerers . Figures (a) and (b) show the effect of an initialization in the ordered phase, the outputs are nearly constant. Figure (c) shows the effect of an initialization in the chaotic phase.

The next Lemma gives sufficient conditions under which and are infinite.

Lemma 1.

Assume exists at least in the distribution sense.111ReLU admits a Dirac mass in 0 as second derivative and so is covered by our developments.
Let . Assume , then for and any , we have and .
Let . Assume for some , then for and any , we have and .

The proof of Lemma 1 is straightforward and rely on the Banach fixed point theorem. Similar ideas are used to establish the second part.

Example. For ReLU activation function, we have and for any .

In the domain of convergence , for all , we have almost surely and the outputs of the network are constant functions. Figures 1(a) and 1(b) illustrate this behaviour for ReLU and Tanh with inputs in using a network of depth with neurons per layer. The draws of outputs of these networks are indeed almost constant.

Under the conditions of Lemma 1, both the variance and the correlations converge exponentially fast (contraction mapping). To refine this convergence analysis, samuel established the existence of and such that and when fixed points exist. The quantities and are called ‘depth scales’ since they represent the depth to which the variance and correlation can propagate without being exponentially close to their limits. More precisely, if we write and then the depth scales are given by and . The equation corresponds to an infinite depth scale of the correlation. It is called the EOC as it separates two phases: an ordered phase where the correlation converges to 1 if and a chaotic phase where and the correlations do not converge to 1. In this chaotic regime, it has been observed in samuel that the correlations converge to some value when and that is independent of the correlation between the inputs. This means that very close inputs (in terms of correlation) lead to very different outputs. Therefore, in the chaotic phase, in the case of infinite width and depth, the output function of the neural network is non-continuous everywhere. Figure 1(c) shows an example of such behaviour for Tanh.

Definition 2 (Edge of Chaos).

For , let be the limiting variance222The limiting variance is a function of but we do not emphasize it notationally.. The EOC is the set of values of satisfying .

To further study the EOC regime, the next lemma introduces the ‘correlation function’ which simplifies the analysis of the correlations. It states that the correlations have the same asymptotic behaviour as the time-homogeneous dynamical system .

Lemma 2.

Let such that , and an activation function such that for all compact sets . Define by and by . Then .

The condition on in Lemma 2 is violated only by activation functions with square exponential growth (which are not used in practice), so from now onwards, we use this approximation in our analysis. Note that being on the EOC is equivalent to satisfying

. In the next section, we analyze this phase transition carefully for a large class of activation functions.

3 Edge of Chaos

To illustrate the effect of the initialization on the EOC, we plot in Figure 2(c) the output of a ReLU neural network with 20 layers and 100 neurons per layer with parameters (as we will see later for ReLU). Unlike the output in Figure 1(a), this output displays much more variability. However, we will prove here that the correlations still converges to 1 even in the EOC regime, albeit at a slower rate.

3.1 ReLU-like activation functions

Contrary to classical activation functions (sigmoid, tanh etc.), ReLU does not suffer from the gradient vanishing problem; see, e.g., glorot and nair). Many variants such as Leaky-ReLU were shown to perform better than ReLU xu. This motivates the analysis of such functions from an initialization point of view. Let us first define this class.

Definition 3 (ReLU-like functions).

A function is ReLU-like if it is of the form

where .

ReLU corresponds to and . For this class of activation functions, the EOC in terms of Definition 2 is reduced to a null set. However, we can define a weak version of the EOC for this class. From Lemma 1, when , the variances converge to and the correlations converge to 1 exponentially fast. If , the variances converge to infinity. We then have the following result.

Lemma 3 (Weak EOC).

Let be a ReLU-like function. Then does not depend on , and having and bounded is only possible for the singleton . The Weak EOC is defined as this singleton.

The non-existence of EOC for ReLU-like activation in the sense of Definition 2 is due to the fact that the variance is unchanged () on the weak EOC, so that does not formally exist in the sense that the limit of depends on . However, this does not impact the analysis of the correlations.
This class of activation functions has the interesting property of preserving the variance across layers when the network is initialized on the EOC. We show in Proposition 1 below that, in the EOC regime, the correlations converge to 1 at a sub-exponential rate. We only present the result for ReLU but the generalization to the whole class is straightforward.

(a) Convergence of the correlation to 1 with
(b) Correlation function
(c) Output of 100x20 ReLU network on the EOC
Figure 2: Impact of the EOC initialization on the correlation and the correlation function. In (a), the correlation converges to 1 at a sub-exponential rate when the network is initialized on the EOC. In (b), the correlation function satisfies on the EOC.

Example: ReLU. The EOC is reduced to the singleton , which means we should initialize a ReLU network using those parameter values. This result coincides with the recommendation of he2 whose objective was to make the variance constant as the input propagates but did not analyze the propagation of the correlations. klambauer performed a similar analysis by using the “Scaled Exponential Linear Unit” (SELU) activation that makes it possible to center the mean and normalize the variance of the post-activation . The propagation of the correlation was not discussed therein either.
Figure 2(b) displays the correlation function with two different sets of parameters . The blue graph corresponds to the EOC , and the red one corresponds to an ordered phase .

In the next result, we show that a fully connected feedforward ReLU network initialized on the EOC (weak sense) acts as if it has residual connections in terms of correlation propagation. This potentially explains why training ReLU is faster on the EOC (see experimental results). We further show that the correlations converge to 1 at a polynomial rate of

instead of an exponential rate in the ordered phase.

Proposition 1 (EOC acts as Residual connections).

Consider a ReLU network with parameters and let be the corresponding correlation. Consider also a ReLU network with simple residual connections given by

where and . Let be the corresponding correlation. Then, by taking and , there exists a constant such that

3.2 Smooth activation functions

We show here that smooth activation functions provide better signal propagation through the network. We start by a result on the existence of the EOC.

Proposition 2.

Let be non ReLU-like function such that and . Assume is non-decreasing and is non-increasing. Let and for let be the smallest fixed point of the function . Then we have .

Example: Tanh and ELU (defined by for and for ) satisfy all conditions of Proposition 2. SiLU (a.k.a Swish) does not satisfy conditions of Proposition 2 but has an EOC (see Appendix).
Proposition 2 suggests Algorithm 1 to find the EOC curves.
Figure 3 shows the EOC curves for different activation functions. For ReLU, the EOC is reduced to a point while smooth activation functions have an EOC curve.

(a) Tanh
(b) ReLU
(c) ELU
Figure 3: EOC for different activation functions (red dashed line). For smooth activation functions (Figure 3.a and Figure 3.c), the EOC is a curve in the plane , while it is reduced to a single point for ReLU (Figure 3.b)
Figure 4: Impact of the smoothness of the activation function on the convergence of the correlation on the EOC. The convergence rate for ReLU is and for ELU and Tanh.
Data: satisfying conditions of Proposition 2,
Result: EOC curve
Initialize ;
while  has not converged do
       ;
      
end while
return ()
Algorithm 1 EOC Curve

A natural question that arises from the analysis above is whether we can have . The answer is yes for a large class of ‘Tanh-like’ activation functions. Let us first define this class.

Definition 4 (Tanh-like activation functions).

Let . is Tanh-like if

  1. bounded, , and for all , , and .

  2. There exist such that for large (in norm).

Lemma 4.

Let be a Tanh-like activation function, then satisfies all conditions of Proposition 2 and .

Recall that the convergence rate of the correlation to 1 for ReLU-like activations is . We can improve this rate by taking a sufficiently regular activation function. Let us first define a regularity class .

Definition 5.

Let . We say that is in if there exists , a partition of and such that

This class includes activations such as Tanh, SiLU, ELU (with ). Note that for all .
For activation functions in the set , the next proposition shows that the correlation converges to 1 at rate of which is better than the rate of ReLU-like activation functions.

Proposition 3 (Convergence rate for smooth activations).

Let such that non-linear (i.e. is non-identically zero). Then, on the EOC, we have where .

Choosing a smooth activation function is therefore better for deep neural networks since it entails deeper information propagation. This could explain for example why smooth versions of ReLU such as ELU perform experimentally better; see clevert and Section 4. Figure 4 shows in log-scale the evolution of the correlation through the network layers for different activation function. For function in (Tanh and ELU), the graph shows a rate of as expected compared to for ReLU.

So far, we have only discussed the impact of the EOC and the smoothness of the activation function on the correlation. But could we further improve this propagation by choosing a specific point on the EOC? The next proposition shows that the coefficient plays an important role in the information propagation process. Indeed, we show that

controls the propagation of the correlation and the backpropagation of the gradients, following the setup of

samuel. In particular, similarly to Lillicrap2016, samuel, we use the approximation that the weights used during forward propagation are independent of the weights used during backpropagation. Let be the empirical risk to be minimized associated to a given loss function.

Proposition 4.

Let be a non-linear activation function such that , . Assume that is non-decreasing and is non-increasing, and let be defined as in Proposition 2. Define the gradient with respect to the layer by , let be the covariance matrix of the gradients during backpropagation. Recall that .

Then, for any , by taking we have

  • For ,

Moreover, we have

The result of Proposition 5 suggests that by taking small , we can achieve two important things. First, we can make the function close to the identity function, which will slow further the convergence of the correlation to 1, i.e., the information propagates deeper inside the network. Second, make the trace of the covariance matrix of the gradients approximately constant through layers, which means, we avoid vanishing of the information (represented by the trace of ) during backpropagation, see Lillicrap2016, samuel for a discussion on the importance of stabilizing over the layers. Proposition 5 generalizes and provides more precise control on the asymptotic behaviour of and , allowing us to give more precise recommendation on how to choose the hyperparameters of the initialization step.
However, we also have . Therefore, practically, we cannot take too small. On the other hand, only a linear activation would satisfy for all . Hence, there is a trade-off to be taken into account when initializing on the EOC. Using Proposition 5, we can deduce the maximal depth to which the correlation can propagate without being within a distance to 1. Indeed, we have for all , , therefore for , . Assuming, for all inputs where is a constant, the maximal depth we can reach without loosing of the information is . Note that (because ).

Corollary 1.

Under the hypotheses of Proposition 2, choosing such that is of the same order as leads to almost constant correlation and accross layers.

We verify numerically the optimality of this rule in Section 4.
Note that ReLU-like activations do not satisfy conditions of Proposition 5. The next lemma gives easy-to-verify sufficient conditions for Proposition 5.

Lemma 5.

Let such that and for all . Then, satisfies all conditions of Proposition 5.

Example : Tanh and ELU satisfy all conditions of Lemma 5. This may partly explain why ELU performs better than ReLU (see next section). Another example is activation function of the form where . We evaluate experimentally in the Appendix and we show that it performs better than many alternatives.

(a) ELU
(b) ReLU
(c) Tanh
Figure 5:

100 epochs of the training curve (test accuracy) for different activation functions for depth 200 and width 300 using SGD. The red curves correspond to the EOC, the green ones correspond to an ordered phase, and the blue curves corresponds to an Initialization on the EOC plus a Batch Normalization after each layer. Upper figures show the test accuracies with respect to the epochs while lower figures show the accuracies with respect to time.

4 Experiments

We demonstrate here empirically our theoretical results. In order, we show the following results:

  • For deep networks, only an initialization on the EOC could make the training successful, and that the initialization on the EOC is better than Batch Normalization.

  • Smooth activation functions in the sense of Proposition 3 perfor, better than ReLU-like activation, especially for very deep networks.

  • Choosing the right point on the EOC further accelerate the training (Proposition 5).

We demonstrate empirically our results on the MNIST and CIFAR10 datasets for depths between 30 and 200 and width

. We use SGD and RMSProp and Categorical Cross-Entropy loss for training. We performed a grid search between

and with exponential step of size 10 to find the optimal learning rate. For SGD, a learning rate of is nearly optimal for , for , the best learning rate is . For RMSProp, is nearly optimal for networks with depth (for deeper networks, gives better results). All values are averages of 10 runs, we also show the 90confidence interval.

Initialization on the Edge of Chaos We initialize randomly the network by sampling and . Figure 5 shows that the initialization on the EOC dramatically accelerates the training for ELU, ReLU and Tanh. The initialization in the ordered phase (here we used for all activations) results in the optimization algorithm being stuck eventually at a very poor test accuracy of (equivalent to selecting the output uniformly at random). Figure 5 also show that EOC combined to BatchNorm results in a worse learning curve and dramatically increases the training time. Note that it is crucial here to initialize BatchNorm parameters to and in order to keep our analysis on the forward propagation on the EOC valid for networks with BatchNorm.

MNIST EOC EOC + BN Ord Phase
ReLU 93.57 0.18 93.11 0.21 10.09 0.61
ELU 97.62 0.21 93.41 0.3 10.14 0.51
Tanh 97.20 0.3 10.74 0.1 10.02 0.13
S-Softplus 10.32 0.41 9.92 0.12 10.09 0.53
CIFAR10 EOC EOC + BN Ord Phase
ReLU 36.55 1.15 35.91 1.52 9.91 0.93
ELU 45.76 0.91 44.12 0.93 10.11 0.65
Tanh 44.11 1.02 10.15 0.85 9.82 0.88
S-Softplus 10.13 0.11 9.81 0.63 10.05 0.71
Table 1: Test accuracies for width 300 and depth 200 with different activations on MNIST and CIFAR10 after 100 epochs using SGD

Table 1 shows test accuracy after 100 epochs for different activation functions on MNIST and CIFAR10. For all activation functions but S-Softplus(Shifted Softplus to make ), EOC initialization leads to the best performance. Adding BatchNorm to the EOC initilization makes the training worse, this can be explained the fact that parameters and are also modified during the first backpropagation which destroys the EOC results for gradient backpropagation (see Proof of Proposition 5). The poor performance of S-Softplus even on the EOC which is reduced to (see Appendix) is due to the fact that the limiting variance is zero for , which leads to the SGD algorithm being stuck at low test accuracy .

Impact of the smoothness of the activation function on the training Table 2 shows the test accuracy at different epochs for ReLU, ELU, Tanh. Smooth activation functions perform better than ReLU which confirms the result of Proposition 3. More experimental results with RMSProp and other activation functions of the form are provided in the Appendix.


MNIST Epoch 10 Epoch 50 Epoch 100
ReLU 66.76 1.95 88.62 0.61 93.57 0.18
ELU 96.09 1.55 97.21 0.31 97.62 0.21
Tanh 89.75 1.01 96.51 0.51 97.20 0.3
CIFAR10 Epoch 10 Epoch 50 Epoch 100
ReLU 26.46 1.68 33.74 1.21 36.55 1.15
ELU 35.95 1.83 45.55 0.91 47.76 0.91
Tanh 34.12 1.23 43.47 1.12 44.11 1.02
Table 2: Test accuracies for width 300 and depth 200 with different activations on MNIST and CIFAR10 using SGD

What is the best point on the EOC From Corollary 1, a good choice on the EOC is such that . Figure 6 shows test accuracy of a Tanh network for different depths using . With , we have . We see for depth 50, the red curve () is the best. For other depths between 30 and 90, is the value that makes the closest to among , which explains why the red curve is approximately better for all depths between 30 and 90. To further confirm this finding, we search numerically for the best for depths . In Table 3, the value of that we get from the rule is close to the best value of found numerically.

(a) Epoch 10
(b) Epoch 50
(c) Epoch 100
Figure 6: Test accuracies for Tanh network with depths between 30 and 90 and width 300 using different points on the EOC

Depth
0.08 0.04 0.02
Corollary 1 0.071 0.03 0.022
Table 3: Best test accuracy achieved after 100 epochs with Tanh on MNIST

5 Discussion

The Gaussian process approximation of deep neural networks was used by samuel to show that very deep Tanh networks are only trainable on the EOC. We complement this result by giving here a comprehensive analysis of the EOC for a large class of activation functions. We prove that smoothness plays a major role in terms of signal propagation as confirmed experimentally (Table 2). Moreover, we provide a rule to choose the optimal point on the EOC depending on the depth. As the depth increases, we need smaller to achieve the best signal propagation. However, the limiting variance becomes close to zero as goes to zero. To avoid this problem, one possible solution would be to change the activation function so that the coefficient becomes large independently from the choice of on the EOC, we leave this for future research.

References

Appendix A Proofs

We provide in this supplementary material the proofs of theoretical results presented in the main document, and we give additive theoretical and experimental results. For the sake of clarity we recall the results before giving their proofs.

a.1 Convergence to the fixed point: Proposition 1

Lemma 1.

Let . Suppose , then for and any , we have and

Moreover, let . Suppose for some positive , then for and any , we have and .

Proof.

To abbreviate the notation, we use for some fixed input .

Convergence of the variances: We first consider the asymptotic behaviour of . Recall that where

The first derivative of this function is given by

(4)

where we use Gaussian integration by parts, , an identity satisfied by any function such that .

Using the condition on , we see that the function is a contraction mapping for and the Banach fixed-point theorem guarantees the existence of a unique fixed point of , with . Note that this fixed point depends only on , therefore this is true for any input and .

Convergence of the covariances: Since , then for all there exists such that for all . Let , using Gaussian integration by parts, we have

We cannot use the Banach fixed point theorem directly because the integrated function here depends on through . For ease of notation, we write . We have

Therefore, for , is a Cauchy sequence and it converges to a limit . At the limit

The derivative of this function is given by

By assumption on and the choice of , we have so is a contraction and has a unique fixed point. Since then . The above result is true for any , therefore . ∎

Lemma 2.

Let such that , and an activation function such that for all compact sets