On the Selection of Initialization and Activation Function for Deep Neural Networks

05/21/2018 ∙ by Soufiane Hayou, et al. ∙ University of Oxford 0

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the learning procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `edge of chaos' can lead to good performance. We complete these recent results by providing quantitative results showing that, for a class of ReLU-like activation functions, the information propagates indeed deeper when the network is initialized at the edge of chaos. By extending our analysis to a larger class of functions, we then identify an activation function, ϕ_new(x) = x ·sigmoid(x), which improves the information propagation over ReLU-like functions and does not suffer from the vanishing gradient problem. We demonstrate empirically that this activation function combined to a random initialization on the edge of chaos outperforms standard approaches. This complements recent independent work by Ramachandran et al. (2017) who have observed empirically in extensive simulations that this activation function performs better than many alternatives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have become extremely popular as they achieve state-of-the-art performance on a variety of important applications including language processing and computer vision; see, e.g.,

LeCun et al. (1998). The success of these models has motivated the use of increasingly deep networks and stimulated a large body of work to understand their theoretical properties. It is impossible to provide here a comprehensive summary of the large number of contributions within this field. To cite a few results relevant to our contributions, Montufar et al. (2014) have shown that neural networks have exponential expressive power with respect to the depth while Poole et al. (2016) obtained similar results using a topological measure of expressiveness.

We follow here the approach of Poole et al. (2016) and Schoenholz et al. (2017)

by investigating the behaviour of random networks in the infinite-width and finite-variance i.i.d. weights context where they can be approximated by a Gaussian process as established by

Matthews et al. (2018) and Lee et al. (2018).

In this paper, our contribution is two-fold. Firstly, we provide an analysis complementing the results of Poole et al. (2016) and Schoenholz et al. (2017) and show that initializing a network with a specific choice of hyperparameters known as the ‘edge of chaos’ is linked to a deeper propagation of the information through the network. In particular, we establish that for a class of ReLU-like activation functions, the exponential depth scale introduced in Schoenholz et al. (2017) is replaced by a polynomial depth scale. This implies that the information can propagate deeper when the network is initialized on the edge of chaos. Secondly, we outline the limitations of ReLU-like activation functions by showing that, even on the edge of chaos, the limiting Gaussian Process admits a degenerate kernel as the number of layers goes to infinity. Our main result (4) gives sufficient conditions for activation functions to allow a good ‘information flow’ through the network (Proposition 4) (in addition to being non-polynomial and not suffering from the exploding/vanishing gradient problem). These conditions are satisfied by the Swish activation used in Hendrycks & Gimpel (2016), Elfwing et al. (2017) and Ramachandran et al. (2017). In recent work, Ramachandran et al. (2017) used automated search techniques to identify new activation functions and found experimentally that functions of the form appear to perform indeed better than many alternative functions, including ReLU. Our paper provides a theoretical grounding for these results. We also complement previous empirical results by illustrating the benefits of an initialization on the edge of chaos in this context. All proofs are given in the Supplementary Material.

2 On Gaussian process approximations of neural networks and their stability

2.1 Setup and notations

We use similar notations to those of Poole et al. (2016) and Lee et al. (2018). Consider a fully connected random neural network of depth , widths , weights and bias , where

denotes the normal distribution of mean

and variance . For some input , the propagation of this input through the network is given for an activation function by

(1)

Throughout the paper we assume that for all the processes are independent (across ) centred Gaussian processes with covariance kernels and write accordingly . This is an idealized version of the true processes corresponding to choosing

(which implies, using Central Limit Theorem, that

is a Gaussian variable for any input ). The approximation of by a Gaussian process was first proposed by Neal (1995) in the single layer case and has been recently extended to the multiple layer case by Lee et al. (2018) and Matthews et al. (2018). We recall here the expressions of the limiting Gaussian process kernels. For any input , so that for any inputs

where is a function that depends only on . This gives a recursion to calculate the kernel ; see, e.g., Lee et al. (2018) for more details. We can also express the kernel in terms of the correlation in the layer used in the rest of this paper

where , resp. , is the variance, resp. correlation, in the layer and ,

are independent standard Gaussian random variables. when it propagates through the network.

is updated through the layers by the recursive formula , where is the ‘variance function’ given by

(2)

Throughout the paper, will always denote independent standard Gaussian variables.

2.2 Limiting behaviour of the variance and covariance operators

We analyze here the limiting behaviour of and as the network depth goes to infinity under the assumption that has a second derivative at least in the distribution sense111ReLU admits a Dirac mass in 0 as second derivative and so is covered by our developments.. From now onwards, we will also assume without loss of generality that (similar results can be obtained straightforwardly when ). We first need to define the Domains of Convergence associated with an activation function .

Definition 1.

Let be an activation function, .
(i) is in (domain of convergence for the variance) if there exists , such that for any input with , . We denote by the maximal satisfying this condition.
(ii) is in (domain of convergence for the correlation) if there exists such that for any two inputs with , . We denote by the maximal satisfying this condition.

Remark : Typically, in Definition 1 is a fixed point of the variance function defined in equation 2. Therefore, it is easy to see that for any such that is increasing and admits at least one fixed point, we have where is the minimal fixed point; i.e. . Thus, if we re-scale the input data to have , the variance converges to . We can also re-scale the variance of the first layer (only) to assume that for all inputs .

The next result gives sufficient conditions on to be in the domains of convergence of .

(a) ReLU network
(b) Tanh network
Figure 1: Two draws of outputs for ReLU and Tanh networks with . The output functions are almost constant.
Proposition 1.

Let . Assume , then for and any , we have and

Let . Assume for some , then for and any , we have and .

The proof of Proposition 1 is straightforward. We prove that and then apply the Banach fixed point theorem; similar ideas are used for .

Example : For ReLU activation function, we have and for any .

In the domain of convergence , for all , almost surely and the outputs of the network are constant functions. Figure 1 illustrates this behaviour for for ReLU and Tanh using a network of depth with neurons per layer. The draws of outputs of these networks are indeed almost constant.

To refine this convergence analysis, Schoenholz et al. (2017) established the existence of and such that and when fixed points exist. The quantities and are called ‘depth scales’ since they represent the depth to which the variance and correlation can propagate without being exponentially close to their limits. More precisely, if we write and then the depth scales are given by and . The equation corresponds to an infinite depth scale of the correlation. It is called the edge of chaos as it separates two phases: an ordered phase where the correlation converges to 1 if and a chaotic phase where and the correlations do not converge to 1. In this chaotic regime, it has been observed in Schoenholz et al. (2017) that the correlations converge to some random value when and that is independent of the correlation between the inputs. This means that very close inputs (in terms of correlation) lead to very different outputs. Therefore, in the chaotic phase, the output function of the neural network is non-continuous everywhere.

Definition 2.

For , let be the limiting variance222The limiting variance is a function of but we do not emphasize it notationally.. The Edge of Chaos, hereafter EOC, is the set of values of satisfying .

To further study the EOC regime, the next lemma introduces a function called the ‘correlation function’ simplifying the analysis of the correlations. It states that the correlations have the same asymptotic behaviour as the time-homogeneous dynamical system .

Lemma 1.

Let such that , and an activation function such that for all compact sets . Define by and by . Then .

The condition on in Lemma 1 is violated only by activation functions with exponential growth (which are not used in practice), so from now onwards, we use this approximation in our analysis. Note that being on the EOC is equivalent to satisfying

. In the next section, we analyze this phase transition carefully for a large class of activation functions.

Figure 2: A draw from the output function of a ReLu network with 20 layers, 100 neurons per layer, (edge of chaos)

3 Edge of Chaos

To illustrate the effect of the initialization on the EOC, we plot in Figure 2 the output of a ReLU neural network with 20 layers and 100 neurons per layer with parameters (as we will see later for ReLU). Unlike the output in Figure 1, this output displays much more variability. However, we will prove here that the correlations still converges to 1 even in the EOC regime, albeit at a slower rate.

3.1 ReLU-like activation functions

We consider activation functions of the form: if and if . ReLU corresponds to and . For this class of activation functions, we see (Proposition 2) that the variance is unchanged () on the EOC, so that does not formally exist in the sense that the limit of depends on . However, this does not impact the analysis of the correlations.

Proposition 2.

Let be a ReLU-like function with and defined above. Then for any and , we have with . Moreover and, on the EOC, for any .

This class of activation functions has the interesting property of preserving the variance across layers when the network is initialized on the EOC. However, we show in Proposition 3 below that, even in the EOC regime, the correlations converge to 1 but at a slower rate. We only present the result for ReLU but the generalization to the whole class is straightforward.

(a) Convergence of correlation to 1 with
(b) Correlation function
Figure 3: Impact of the initialization on the EOC for a ReLU network

Example : ReLU: The EOC is reduced to the singleton , which means we should initialize a ReLU network with the parameters . This result coincides with the recommendation of He et al. (2015) whose objective was to make the variance constant as the input propagates but did not analyze the propagation of the correlations. Klambauer et al. (2017) also performed a similar analysis by using the "Scaled Exponential Linear Unit" activation (SELU) that makes it possible to center the mean and normalize the variance of the post-activation . The propagation of the correlation was not discussed therein either. In the next result, we present the correlation function corresponding to ReLU networks. This was first obtained in Cho & Saul (2009). We present an alternative derivation of this result and further show that the correlations converge to 1 at a polynomial rate of instead of an exponential rate.

Proposition 3 (ReLU kernel).

Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .

Figure 3 displays the correlation function with two different sets of parameters . The red graph corresponds to the EOC , and the blue one corresponds to an ordered phase . In unreported experiments, we observed that numerical convergence towards for on the EOC. As the variance is preserved by the network () and the correlations converge to 1 as increases, the output function is of the form for a constant (notice that in Figure 2, we start observing this effect for depth 20).

3.2 A better class of activation functions

We now introduce a set of sufficient conditions for activation functions which ensures that it is then possible to tune to slow the convergence of the correlations to 1. This is achieved by making the correlation function sufficiently close to the identity function.

Proposition 4 (Main Result).

Let be an activation function. Assume that
(i) , and has right and left derivatives in zero and or , and there exists such that .
(ii) There exists such that for any , there exists such that .
(iii) For any , the function with parameters is non-decreasing and where is the minimal fixed point of , .
(iv) For any , the correlation function with parameters introduced in Lemma 1 is convex.

Then, for any , we have , and

(a) The correlation function of a Swish network for different values of
(b) A draw from the output function of a Swish network with depth 30 and width 100 on the edge of chaos for
Figure 4: Correlation function and a draw of the output for a Swish network

Note that ReLU does not satisfy the condition since the EOC in this case is the singleton . The result of Proposition 4 states that we can make close to by considering . However, this is under condition which states that . Therefore, practically, we cannot take too small. One might wonder whether condition is necessary for this result to hold. The next lemma shows that removing this condition results in a useless class of activation functions.

Lemma 2.

Under the conditions of Proposition 4, the only change being , the result of Proposition 4 holds if only if the activation function is linear.

The next proposition gives sufficient conditions for bounded activation functions to satisfy all the conditions of Proposition 4.

Proposition 5.

Let be a bounded function such that , , , , and for , and satisfies (ii) in Proposition 4. Then, satisfies all the conditions of Proposition 4.

The conditions in Proposition 5 are easy to verify and are, for example, satisfied by Tanh and Arctan. We can also replace the assumption " satisfies (ii) in Proposition 4" by a sufficient condition (see Proposition 7 in the Supplementary Material). Tanh-like activation functions provide better information flow in deep networks compared to ReLU-like functions. However, these functions suffer from the vanishing gradient problem during back-propagation; see, e.g., Pascanu et al. (2013) and Kolen & Kremer (2001). Thus, an activation function that satisfies the conditions of Proposition 4 (in order to have a good ’information flow’) and does not suffer from the vanishing gradient issue is expected to perform better than ReLU. Swish is a good candidate.

Proposition 6.

The Swish activation function satisfies all the conditions of Proposition 4.

It is clear that Swish does not suffer from the vanishing gradient problem as it has a gradient close to 1 for large inputs like ReLU. Figure 4 (a) displays for Swish for different values of . We see that is indeed approaching the identity function when is small, preventing the correlations from converging to 1. Figure 4(b) displays a draw of the output of a neural network of depth 30 and width 100 with Swish activation, and . The outputs displays much more variability than the ones of the ReLU network with the same architecture. We present in Table 1 some values of on the EOC as well as the corresponding limiting variance for Swish. As condition of Proposition 4 is satisfied, the limiting variance decreases with .

Table 1: Values of on the EOC and limiting variance for Swish

Other activation functions that have been shown to outperform empirically ReLU such as ELU (Clevert et al. (2016)), SELU (Klambauer et al. (2017)) and Softplus also satisfy the conditions of Proposition 4 (see appendix for ELU). The comparison of activation functions satisfying the conditions of Proposition 4 remains an open question.

4 Experimental Results

We demonstrate empirically our results on the MNIST dataset. In all the figures below, we compare the learning speed (test accuracy with respect to the number of epochs/iterations) for different activation functions and initialization parameters. We use the Adam optimizer with learning rate

. The Python code to reproduce all the experiments will be made available on-line.

Initialization on the Edge of Chaos We initialize randomly the deep network by sampling and . In Figure 5, we compare the learning speed of a Swish network for different choices of random initialization. Any initialization other than on the edge of chaos results in the optimization algorithm being stuck eventually at a very poor test accuracy of as the depth increases (equivalent to selecting the output uniformly at random). To understand what is happening in this case, let us recall how the optimization algorithm works. Let be the MNIST dataset. The loss we optimize is given by where is the output of the network, and is the categorical cross-entropy loss. In the ordered phase, we know that the output converges exponentially to a fixed value (same value for all ), thus a small change in and

will not change significantly the value of the loss function, therefore the gradient is approximately zero and the gradient descent algorithm will be stuck around the initial value.

(a) width , depth
(b) width , depth
Figure 5: Impact of the initialization on the edge of chaos for Swish network

ReLU versus Tanh We proved in Section 3.2 that the Tanh activation guarantees better information propagation through the network when initialized on the EOC. However, Tanh suffers

(a) width , depth
(b) width , depth
Figure 6: Comparaison of ReLu and Tanh learning curves for different widths and depths

from the vanishing gradient problem. Consequently, we expect Tanh to perform better than ReLU for shallow networks as opposed to deep networks, where the problem of the vanishing gradient is not encountered. Numerical results confirm this fact. Figure 6

shows curves of validation accuracy with confidence interval 90

(30 simulations). For depth 5, the learning algorithm converges faster for Tanh compared to ReLu. However, for deeper networks (), Tanh is stuck at a very low test accuracy, this is due to the fact that a lot of parameters remain essentially unchanged because the gradient is very small.

ReLU versus Swish As established in Section 3.2, Swish, like Tanh, propagates the information better than ReLU and, contrary to Tanh, it does not suffer from the vanishing gradient problem. Hence our results suggest that Swish should perform better than ReLU, especially for deep architectures. Numerical results confirm this fact. Figure 7 shows curves of validation accuracy with confidence interval 90 (30 simulations). Swish performs clearly better than ReLU especially for depth 40. A comparative study of final accuracy is shown in Table 2. We observe a clear advantage for Swish, especially for large depths. Additional simulations results on diverse datasets demonstrating better performance of Swish over many other activation functions can be found in Ramachandran et al. (2017)

(Notice that these authors have already implemented Swish in Tensorflow).

(a) width , depth
(b) width , depth
Figure 7: Convergence across iterations of the learning algorithm for ReLU and Swish networks
(10,5) (20,10) (40,30) (60,40)
ReLU 94.01 96.01 96.51 91.45
Swish 94.46 96.34 97.09 97.14
Table 2: Accuracy on test set for different values of (width, depth)

5 Conclusion and Discussion

We have complemented here the analysis of Schoenholz et al. (2017) which shows that initializing networks on the EOC provides a better propagation of information across layers. In the ReLU case, such an initialization corresponds to the popular approach proposed in He et al. (2015). However, even on the EOC, the correlations still converge to 1 at a polynomial rate for ReLU networks. We have obtained a set of sufficient conditions for activation functions which further improve information propagation when the parameters

are on the EOC. The Tanh activation satisfied those conditions but, more interestingly, other functions which do not suffer from the vanishing/exploding gradient problems also verify them. This includes the Swish function used in

Hendrycks & Gimpel (2016), Elfwing et al. (2017) and promoted in Ramachandran et al. (2017) but also ELU Clevert et al. (2016).

Our results have also interesting implications for Bayesian neural networks which have received renewed attention lately; see, e.g., Hernandez-Lobato & Adams (2015) and Lee et al. (2018). They show that if one assigns i.i.d. Gaussian prior distributions to the weights and biases, the resulting prior distribution will be concentrated on close to constant functions even on the EOC for ReLU-like activation functions. To obtain much richer priors, our results indicate that we need to select not only parameters on the EOC but also an activation function satisfying Proposition 4.

References

  • Cho & Saul (2009) Y. Cho and L.K. Saul.

    Kernel methods for deep learning.

    Advances in Neural Information Processing Systems, 22:342–350, 2009.
  • Clevert et al. (2016) D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations, 2016.
  • Elfwing et al. (2017) S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv:1702.03118, 2017.
  • He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    ICCV, 2015.
  • Hendrycks & Gimpel (2016) D.. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv:1606.08415, 2016.
  • Hernandez-Lobato & Adams (2015) J. M. Hernandez-Lobato and R.P. Adams.

    Probabilistic backpropagation for scalable learning of bayesian neural networks.

    ICML, 2015.
  • Klambauer et al. (2017) G. Klambauer, T. Unterthiner, and A. Mayr. Self-normalizing neural networks. Advances in Neural Information Processing Systems, 30, 2017.
  • Kolen & Kremer (2001) J.F. Kolen and S.C. Kremer. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies. A Field Guide to Dynamical Recurrent Network, Wiley-IEEE Press, pp. 464–, 2001.
  • LeCun et al. (1998) Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. Neural Networks: Tricks of the trade, Springer, 1998.
  • Lee et al. (2018) J. Lee, Y. Bahri, R. Novak, S.S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. 6th International Conference on Learning Representations, 2018.
  • Matthews et al. (2018) A.G. Matthews, J. Hron, M. Rowland, R.E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. 6th International Conference on Learning Representations, 2018.
  • Montufar et al. (2014) G.F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems, 27:2924–2932, 2014.
  • Neal (1995) R.M. Neal. Bayesian learning for neural networks. Springer Science & Business Media, 118, 1995.
  • Pascanu et al. (2013) R. Pascanu, T. Mikolov, and Y. Bengio.

    On the difficulty of training recurrent neural networks.

    Proceedings of the 30th International Conference on Machine Learning

    , 28:1310–1318, 2013.
  • Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. 30th Conference on Neural Information Processing Systems, 2016.
  • Ramachandran et al. (2017) P. Ramachandran, B. Zoph, and Q.V. Le. Searching for activation functions. arXiv e-print 1710.05941, 2017.
  • Schoenholz et al. (2017) S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. 5th International Conference on Learning Representations, 2017.

Appendix A Proofs

We provide in the supplementary material the proofs of the propositions presented in the main document, and we give additive theoretical and experimental results. For the sake of clarity we recall the propositions before giving their proofs.

a.1 Convergence to the fixed point: Proposition 1

Proposition 1.

Let . Suppose , then for and any , we have and

Moreover, let . Suppose for some positive , then for and any , we have and .

Proof.

To abbreviate the notation, we use for some fixed input a.

Convergence of the variances: We first consider the asymptotic behaviour of . Recall that where,

The first derivative of this function is given by:

(3)

where we used Gaussian integration by parts , an identify satisfied by any function such that .

Using the condition on , we see that for , the function is a contraction mapping, and the Banach fixed-point theorem guarantees the existence of a unique fixed point of , with . Note that this fixed point depends only on , therefore, this is true for any input , and .

Convergence of the covariances: Since , then for all there exists such that, for all , . Let , using Gaussian integration by parts, we have

We cannot use the Banach fixed point theorem directly because the integrated function here depends on through . For ease of notation, we write , we have

Therefore, for , is a Cauchy sequence and it converges to a limit . At the limit

The derivative of this function is given by

By assumption on and the choice of , we have , so that is a contraction, and has a unique fixed point. Since , . The above result is true for any , therefore, . ∎

As an illustration we plot in Figure 12 the variance for three different inputs with , as a function of the layer . In this example, the convergence for Tanh is faster than that of ReLU.

(a) ReLU
(b) Tanh
Figure 8: Convergence of the variance for three different inputs with
Lemma 1.

Let such that , and an activation function such that for all compact sets . Define by and by . Then .

Proof.

For , we have

where . The first term goes to zero uniformly in using the condition on and Cauchy-Schwartz inequality. As for the second term, it can be written as

again, using Cauchy-Schwartz and the condition on , both terms can be controlled uniformly in by an integrable upper bound. We conclude using the Dominated convergence. ∎

a.2 Results for ReLU-like activation functions: proof of Propositions 2 and 3

Proposition 2.

Let be a ReLU-like function with and defined above. Then for any and , we have with . Moreover and, on the EOC, for any .

Proof.

We write throughout the proof. Note first that the variance satisfies the recursion:

(4)

For all , is a fixed point. This is true for any input, therefore and (i) is proved.

Now, the EOC equation is given by . Therefore, . Replacing by its critical value in equation 4 yields

Thus if and only if , otherwise diverges to infinity. So the frontier is reduced to a single point , and the variance does not depend on .

Proposition 3 (ReLU kernel).

Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .

Proof.

In this case the correlation function is given by where .

  • Let , note that is differentiable and satisfies,

    which is also differentiable. Simple algebra leads to

    Since and ,

    We conclude using the fact that and .

  • We first derive a Taylor expansion of near 1. Consider the change of variable with close to 0, then

    so that

    and