On the Selection of Initialization and Activation Function for Deep Neural Networks
The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the learning procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `edge of chaos' can lead to good performance. We complete these recent results by providing quantitative results showing that, for a class of ReLU-like activation functions, the information propagates indeed deeper when the network is initialized at the edge of chaos. By extending our analysis to a larger class of functions, we then identify an activation function, ϕ_new(x) = x ·sigmoid(x), which improves the information propagation over ReLU-like functions and does not suffer from the vanishing gradient problem. We demonstrate empirically that this activation function combined to a random initialization on the edge of chaos outperforms standard approaches. This complements recent independent work by Ramachandran et al. (2017) who have observed empirically in extensive simulations that this activation function performs better than many alternatives.
READ FULL TEXT