DNNs usually require nonlinear activation functions. Rectified Linear Units (ReLU) ref:RELU are widely used in state-of-the-art DNNs, because of their simplicity. Ioffe and Szegedy ref:Batch Normalization
proposed the Batch Normalization (BN) technique, whose usage improves the accuracy compared to original CNNs without BN. Recently, Klambauer et al.ref:SNN
introduced the Scaled Exponential Linear Unit (SELU) and demonstrated that, for a specific choice of parameters, it has self-normalizing properties, because it leads to zero mean and unit variance. Another important property of SELU functions, which is also applicable to Exponential Linear Units (ELU)ref:ELU , is that the negative weights are continuously updated. This effect results in a wider learning ability of ELU and SELU compared to ReLU, that leads to an accuracy improvement.
A key scientific question is: Can we replace the ReLU layers of a given DNN by ELU or SELU layers to improve the DNN accuracy? If yes, how can we select which of the ReLU layers should be replaced by ELU or SELU layers, without significantly affecting the total training time?
Our novel contributions: To address the above question, we propose a novel methodology that automatically explores different activation functions in each layer of the DNN and converges to select the model that leads to further accuracy improvements. Through this methodology, we can obtain a Hybrid DNN that can have ReLU, ELU or SELU activations in different layers, interchangeably. Since SELU functions require “alpha” dropout, we introduce the possibility to have dropout at each layer of the DNN, using standard dropout in case of ReLU and ELU and “alpha” dropout in case of SELU, to obtain a fair comparison. Our methodology allows to fine tune the dropout rate to achieve a higher accuracy compared to the original DNN. Instead of replacing activation functions in all the layers of the network,
our methodology adds a degree of freedom in the DNN architecture, because different types of activation functions can be selected in different layers. Since exploring all the configurations of activation functions extensively is an extremely compute-intensive task and may not be feasible in practice, our methodology evaluates different activation functions for each layer in an intermediate step of the training process, called the “Evaluation Point”
. Such an Evaluation Point (EP) is automatically selected after a gradient-based analysis of the accuracy at each training epoch, i.e., after each forward and backward pass of all the training data through the DNN. We demonstrate (seeSection 5.1) that a comparison at the EP is effective, i.e., it produces similar results with respect to comparing the versions at the end of training. This optimization allows our methodology to reduce the exploration time during training by a factor varying from 4x to 7x, for different types of original DNNs. The final outcome of our methodology is a so-called Hybrid DNN that has the best configuration of activation functions and dropout rate, among all the possibilities, for each activation layer. Quantitatively, we achieve from 7 % to 15 % Error Rate Reduction in our Hybrid DNNs compared to the original versions. An overview of our novel contributions with inputs/outputs is illustrated in Figure 1.
Paper Organization: Section 2 discusses the related work. Section 3 presents a quick analysis of the activation functions that we are using. Section 4 presents our methodology and novel contributions. Section 5 reports the experimental results on different benchmarks. Section 6 concludes the paper and summarizes the achievements.
2 Related work
and other networks afterwards. Since it has zero derivative for negative inputs, the backpropagation error is blocked in those conditions. This is called the “dead neuron problem”, because, once a neuron reaches this state, it will not escape and can be considered dead, since it cannot be updated. Many researchers proposed solutions for that problem. Maas et al.ref:Leaky RELU suggested to use Leaky ReLU, where also the negative part of the activation function has a positive (linear) slope. Setting the appropriate value of the slope can be tricky, but He et al. ref:PRELU showed a method to learn the slope automatically during backpropagation. Another important research direction is Exponential Linear Unit (ELU), proposed by Clevert et al. ref:ELU . ELU is an activation function with exponential behavior in the negative part and linear in the positive one. Afterwards, Ioffe and Szegedy ref:Batch Normalization implemented the Batch Normalization (BN) for ResNet ref:ResNet , that contributes to its accuracy improvement over the previous state-of-the-art DNNs. Every layer using one of the activation functions described above may be associated to a BN layer to increase the accuracy of a given DNN. A recent work by Klambauer et al. ref:SNN showed that Self-Normalizing Neural Networks (SNN) have the intrinsic property to automatically converge to zero mean and unit variance, without requiring explicit Batch Normalization. They propose to use Scaled Exponential Linear Units (SELU) as activation function. Several other novel activation functions can be generated automatically, as shown by the work of Ramachandran et al. ref:search_activation_function . Selecting the appropriate activation function is not an easy task. Recent works by Harmon et al. ref:activation_ensembles and Manessi et al. ref:learning_combinations proposed to use a combination of different activation functions in the same layer, with connections learned during training. This approach, however, increases the memory footprint of the DNN by a significant factor, which is typically a critical parameter in real-world scenarios. Replacing each activation function throughout the complete network in a uniform way (Pinamonti ref:act_function_comparison ) shows that ELU and SELU outperform the other activation functions, but this is not an exhaustive search to find high accuracy improvements. To the best of our knowledge, a Hybrid DNN with different activation functions within different layers has not been explored. Our methodology has the potential to further improve the network accuracy, as we will demonstrate in Section 5. However, an appropriate selection of different activation functions in different layers demands a comprehensive evaluation with a huge trial-and-error effort if it is done in a naive way, thereby requiring an automatic methodology equipped with a fast evaluation strategy.
In this paper, we propose an automatic methodology to systematically select different types of activation functions in different layers selectively, obtaining a Hybrid DNN. The exploration time is significantly reduced with respect to the trial-and-error approach, because our methodology compares the versions in an intermediate training epoch, called Evaluation Point.
3 Activation Function Analysis
In this section, we briefly discuss, analyze and compare different activation functions (i.e., ReLU, ELU, and SELU) that are explored by our methodology. Other activation functions are not considered in this paper, but can easily be integrated in our methodology, as it takes a library of activation functions as an input (see Figure 1). For each of them and each first derivative, we define the respective analytic expressions and the behavior; see Figures 3 and 2. Section 3.4 describes the dropout method that fits with the different activations.
Compared to ReLU, ELU has an exponential behavior for negative inputs. It introduces a new parameter, , which can be considered as a new hyper-parameter of the network. Clevert et al. ref:ELU propose to select , hence we also use this value in our experiments. The analytic expression of ELU and its first derivative are reported in Equations 4 and 3, respectively.
3.4 Dropout method
The dropout technique has been introduced by Srivastava et al. ref:dropout , in order to improve the regularization and to avoid the overfitting problem in DNNs. It is widely used in the most common state-of-the-art DNNs because of its regularizing property and simple applicability with ReLU activation functions. He et al. ref:PRELU proposed a weight initialization method that is efficient for ReLU activations, because it limits the variance. Clevert et al. ref:ELU applied the same initialization criteria with ELU activations. Kingma et al. ref:variational_dropout analyzed how the variance changes when the dropout is applied. Klambauer et al. ref:SNN revised it and proposed a new initialization method and a new dropout technique, specific for SELU. Weights are initialized in such a way that the mean and the variance , where is the number of inputs. This methods leads to the global variance (sum of all variances of each weight in the same layer) equal to 1. For example, each weight of a layer with 100 inputs should be initialized as a gaussian variable with zero mean and variance equal to 0.01.
Thus, such a new dropout method, called “alpha” dropout, sets dropped weights to , where . It is effective with SELU activations because it preserves mean and variance. Hence, in the following sections, we adopt standard dropout when applied to ReLU and ELU activations and "alpha dropout" when dealing with SELU.
4 Our Novel Methodology
We propose a simple yet effective methodology to automatically select the activation functions for each layer of a given DNN as well as its associated dropout rate, based on the accuracy obtained at the Evaluation Point.
4.1 Motivation and Key Features
Activation function selection is a quite complex task and has a lot of implications on the performance and the accuracy of a given DNN. A simple selection process based on exploration requires extensive trial-and-error analysis to converge and yet it cannot guarantee to find a high-quality solution. Hence, our methodology, at the very first stage, focuses on an efficient way to extract useful information from the learning curve (i.e., the curve that describes the accuracy of the DNN as a function of the number of epochs) to obtain the Evaluation Point. Then, for each layer, we find the best combination (that produces the maximum test accuracy) of the activation function and the dropout rate. A layer-wise search is efficient for (1) improving the DNN accuracy and (2) not penalizing the computation efficiency, while using parallel processing and SIMD instructions in GPUs. Alternatively, our Hybrid DNN can easily be implemented and integrated in a hardware accelerator for Deep Learning Inference. Moreover, due to the Evaluation Point optimization, (3) we efficiently reduce the exploration time during the training process.
4.2 Evaluation Point
The work by Domhan et al. ref:extrapolation_of_learning_curve showed how to extrapolate useful information from the first epochs of the learning curve to optimize the hyper-parameters. We exploited this method to establish the concept of an Evaluation Point, which is computed during the first stage of our methodology. This allows us to perform early termination of the training process without significantly sacrificing the accuracy. Looking at the learning curve, we can identify a first region, where the accuracy (A) grows fast, and a second region, where the accuracy is almost flat. A re-parametrization that evaluates the gradient of such a curve, i.e., the so-called Accuracy Gradient (AG), allows us to define an analytical function that is able to find the Evaluation Points automatically. We define the AG as the average of the relative accuracy difference over a range (R) of epochs (E). It is expressed by Equation 7.
The Evaluation Point (EP) is defined as the first epoch where the AG falls below 0.1 %. Refer to Figures 6 and 5 to see an example of the learning curve and the Accuracy Gradient to compute the Evaluation Point on the MNIST benchmark.
4.3 In-depth view of Methodology
The essence of our methodology consists of training different versions of the DNN for the number of epochs specified by the EP. It is an iterative process, where, for each layer (except the last one, where the softmax activation function is typically used for classification purposes), we search the combination of an activation function and a dropout rate that optimizes the accuracy. We define the library of activation functions, composed by ReLU, ELU and SELU. For each activation function, we select the best dropout rate, according to what we call the “smart” search. Instead of not evaluating dropout, such a search explores the different values of the dropout rate, keeping track of the optimal dropout rate for the previous layer and moving step-by-step in intervals of [.1, .2, .5]. It moves one step further when the accuracy at the EP is increased with respect to the previous version and it changes the activation function otherwise. Once such a “smart” search is complete for the current layer, the current configuration is saved and we move on to analyzing the next layer. Algorithm 1 describes the flow of our methodology.
4.4 Hybrid DNN
Once we have processed every layer of the DNN, we train the resulting network, which we call Hybrid DNN, for the complete number of epochs (beyond the Evaluation Point). Such Hybrid DNN could have different activation functions in different layers, according to the selection criteria followed in Section 4.3, and has a better accuracy compared to the original version of a given DNN.
5 Experimental results
We apply our methodology to the LeNet-5 on the MNIST dataset, which corresponds to the example provided in Section 5.1, and other benchmarks: AlexNet on CIFAR-10 (Section 5.2), AlexNet on CIFAR-100 (Section 5.3), VGG-16 on CIFAR-10 (Section 5.4) and VGG on CIFAR-100 (Section 5.5). Accuracy improvement results, expressed in terms of Relative Error Reduction, are reported in Table 2. This table also shows the results of Training Time Reduction, power consumption and performance differences between the original and the Hybrid DNN.
The algorithm of our methodology (Algorithm 1
) has been implemented using the pyTorch frameworkref:pytorch and the experiments have been performed on an Nvidia GTX 1070 GPU (see its specifications in Table 1). The power consumptions have been measured using the NVIDIA System Management Interface tool ref:nvidia_smi . A schematic view of the experimental setup is shown in Figure 4.
|NVIDIA GTX 1070 specs|
|Memory||8 GB DDR5|
|Mem. interface width||256-bit|
|Mem. bandwidth||256 GB/s|
|Single precision Flops||6.5 TeraFLOPS|
|Power requirement||150 W|
|MNIST||LeNet-5||15.56 %||7||4.29||+ 5.53 %||- 3.23 %|
|CIFAR-10||AlexNet||7.11 %||18||7.22||+ 2.37 %||- 3.24 %|
|CIFAR-100||AlexNet||8.34 %||22||5.91||+ 1.94 %||- 3.24 %|
|CIFAR-10||VGG-16||8.83 %||24||5.42||+ 3.09 %||- 3.17 %|
|CIFAR-100||VGG-16||9.34 %||29||4.48||+ 2.40 %||- 3.20 %|
5.1 LeNet-5 on MNIST dataset
The MNIST benchmark is a collection of handwritten digits (size 28x28), divided into 10 categories. The set consists of 60.000 training images and 10.000 test images. As the original model, we use LeNet-5 architecture ref:LeNet5 . We analyze the learning curve (Figure 5) and the respective Accuracy Gradient curve (Figure 6) in detail. Looking at the latter curve, we are able to identify the Evaluation Point, which in this case corresponds to the seventh epoch. We then compute the Training Time Reduction (TTR) that we are able to achieve by evaluating the accuracy at the EP instead of at the end of training. Its expression is reported in Equation 8. The accuracy improvement is measured as the Relative Error Reduction (RER) of the resulting Hybrid DNN with respect to the original one. The RER is defined in Equation 9. Results, including power and performance differences, are reported in the first row of Table 2.
|Layer 1||Layer 2||Layer 3||Accuracy||RER||TTR|
|ACT||DROP. RATE||ACT||DROP. RATE||ACT||DROP. RATE||-||-||-|
|SELU||0.1||ReLU||0.05||ReLU||0.1||99.17 %||7.78 %||6x|
|SELU||0.1||ELU||0.02||ReLU||0.05||99.21 %||12.22 %||5x|
|Hybrid,||SELU||0.2||ELU||0.02||ELU||0.05||99.24 %||15.56 %||4.29x|
|SELU||0.2||ELU||0.05||ELU||0.05||99.24 %||15.56 %||3.75x|
|SELU||0.2||ELU||0.05||ELU||0.1||99.26 %||17.78 %||2.31x|
|SELU||0.2||ELU||0.05||ELU||0.1||99.26 %||17.78 %||1x|
To prove the efficacy of our methodology, we compare our Hybrid DNN (with versions evaluated at the EP computed with our criteria) with the DNN that is obtained by comparing the results at a different epoch (see Table 3 for the selected activation function and dropout rate combinations for each layer). This analysis leads to a trade-off between accuracy and training speed-up (TTR, Training Time Reduction). Table 3 shows that our EP, corresponding to the seventh epoch, leads to a good solution of the previously discussed trade-off, because an evaluation at a lower epoch significantly reduces the accuracy, while an evaluation at a later epoch reduces the speed-up gain.
Moreover, we analyze power and performance differences between the original and the Hybrid DNN, by measuring the relative differences of power and computation time, respectively. The results are reported in Table 2.
5.2 AlexNet on CIFAR-10 dataset
The CIFAR-10 dataset ref:CIFAR is composed of 50.000 training images and 10.000 test images of size 32x32, divided into 10 different classes. The AlexNet network ref:AlexNet consists of 5 convolutional layers and 3 fully-connected layers. Since it was designed to be trained on input images of size 224x224, the first layer of this network has been adapted to the size of CIFAR-10 images. We trained it for 130 epochs using a batch size of 128, with momentum = 0.9 and weight decay = 0.0005. The initial learning rate of 0.1 has been scaled by a factor 0.1 after 40, 80 and 120 epochs. We applied our methodology to this benchmark, obtained a TTR of 7.22x, based on an EP of 18. Our Hybrid DNN achieves 7.11 % RER, with a power penalty of 1.94 % and a performance penalty of 3.24 % with respect to the original AlexNet, as reported in Table 2.
5.3 AlexNet on CIFAR-100 dataset
The CIFAR-100 dataset ref:CIFAR is composed of 50.000 training images and 10.000 test images of size 32x32, divided into 100 different classes. The DNN, AlexNet, is the same as the one described in Section 5.2 and we trained it with the same hyper-parameters. Our methodology selects an EP of 22, which corresponds to a 8.34 % TTR. All the other result metrics are reported in Table 2.
5.4 VGG-16 on CIFAR-10 dataset
The VGG-16 model ref:VGGNet is a Deep Neural Network with 13 convolutional layers and 3 fully-connected layers. To comply with input images of size 32x32, a modified version of this network has been used in this experiment. We trained it for the same dataset (CIFAR-10) and the same hyper-parameters as in Section 5.2. Our methodology produces an EP of 24, which leads to a 5.42 % TTR. The respective line of Table 2 collects the other results.
5.5 VGG-16 on CIFAR-100 dataset
In this paper, we have presented an effective methodology for selecting, layer by layer, the type of activation function and the dropout rate associated with it. The performance and power consumptions measured by our Hybrid DNN are slightly worse than the values measured on the original DNN because of the lower complexity of ReLU with respect to ELU and SELU. The accuracy, however, can be improved by a larger factor. Another key contribution of our methodology is the amount of Training Time Reduction in the exploration phase, using the Evaluation Points.
- (1) D. A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In ICLR, 2015.
- (2) R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. In JMLR, 2011.
T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. InIJCAI, 2015.
- (4) A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. In Neural Networks, 2005.
- (5) M. Harmon and D. Klabjan. Activation Ensembles for Deep Neural Networks. arXiv e-prints arXiv:1702.07790, 2017.
- (6) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InICCV, 2015.
- (8) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- (9) D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In NIPS, 2015.
- (10) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In NIPS, 2017.
- (11) A. Krizhevsky, V. Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research). CIFAR-100 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/ kriz/cifar.html
- (12) A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- (13) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
- (14) A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
- (15) F. Manessi and A. Rozza. Learning Cmbinations of Activation Functions. In CoRR, 2018.
V. Nair and G. Hinton. Rectified linear units improve restricted boltzmann machines. InICML, 2010.
- (17) D. Pedamonti. Comparison of non-linear activation functions for deep neural networks on MNIST classification task. In CoRR, 2018.
- (18) P. Ramachandran, B. Zoph, and Q. V. Le. Searching for Activation Functions. In CoRR, 2017.
- (19) T. K. Samuel, S. McNally, and J. Wynkoop. An Analysis of GPU Utilization Trends on the Keeneland Initial Delivery System. In XSEDE, 2012.
- (20) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In
Journal of Machine Learning Research, 2014.
- (22) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- (23) pyTorch framework: https://github.com/pytorch/pytorch