Empirical Evaluation of Rectified Activations in Convolutional Network

05/05/2015 ∙ by Bing Xu, et al. ∙ 0

In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart. By using RReLU, we achieved 75.68% accuracy on CIFAR-100 test set without multiple test or ensemble.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural network (CNN) has made great success in various computer vision tasks, such as image classification

(Krizhevsky et al., 2012; Szegedy et al., 2014), object detection(Girshick et al., 2014) and tracking(Wang et al., 2015). Despite its depth, one of the key characteristics of modern deep learning system is to use non-saturated activation function (e.g. ReLU) to replace its saturated counterpart (e.g. sigmoid, tanh). The advantage of using non-saturated activation function lies in two aspects: The first is to solve the so called “exploding/vanishing gradient”. The second is to accelerate the convergence speed.

In all of these non-saturated activation functions, the most notable one is rectified linear unit (ReLU) (Nair & Hinton, 2010; Sun et al., 2014). Briefly speaking, it is a piecewise linear function which prunes the negative part to zero, and retains the positive part. It has a desirable property that the activations are sparse after passing ReLU. It is commonly believed that the superior performance of ReLU comes from the sparsity (Glorot et al., 2011; Sun et al., 2014). In this paper, we want to ask two questions: First, is sparsity the most important factor for a good performance? Second, can we design better non-saturated activation functions that could beat ReLU?

We consider a broader class of activation functions, namely the rectified unit family. In particular, we are interested in the leaky ReLU and its variants. In contrast to ReLU, in which the negative part is totally dropped, leaky ReLU assigns a noon-zero slope to it. The first variant is called parametric rectified linear unit (PReLU) (He et al., 2015)

. In PReLU, the slopes of negative part are learned form data rather than predefined. The authors claimed that PReLU is the key factor of surpassing human-level performance on ImageNet classification 

(Russakovsky et al., 2015) task. The second variant is called randomized rectified linear unit

(RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. In a recent Kaggle National Data Science Bowl (NDSB) competition

111Kaggle National Data Science Bowl Competition: https://www.kaggle.com/c/datasciencebowl, it is reported that RReLU could reduce overfitting due to its randomized nature.

In this paper, we empirically evaluate these four kinds of activation functions. Based on our experiment, we conclude on small dataset, Leaky ReLU and its variants are consistently better than ReLU in convolutional neural networks. RReLU is favorable due to its randomness in training which reduces the risk of overfitting. While in case of large dataset, more investigation should be done in future.

2 Rectified Units

In this section, we introduce the four kinds of rectified units: rectified linear (ReLU), leaky rectified linear (Leaky ReLU), parametric rectified linear (PReLU) and randomized rectified linear (RReLU). We illustrate them in Fig.1 for comparisons. In the sequel, we use to denote the input of th channel in th example , and to denote the corresponding output after passing the activation function. In the following subsections, we introduce each rectified unit formally.

Figure 1: ReLU, Leaky ReLU, PReLU and RReLU. For PReLU, is learned and for Leaky ReLU is fixed. For RReLU,

is a random variable keeps sampling in a given range, and remains fixed in testing.

2.1 Rectified Linear Unit

Rectified Linear is first used in Restricted Boltzmann Machines

(Nair & Hinton, 2010). Formally, rectified linear activation is defined as:


2.2 Leaky Rectified Linear Unit

Leaky Rectified Linear activation is first introduced in acoustic model(Maas et al., 2013). Mathematically, we have


where is a fixed parameter in range . In original paper, the authors suggest to set to a large number like 100. In additional to this setting, we also experiment smaller in our paper.

2.3 Parametric Rectified Linear Unit

Parametric rectified linear is proposed by (He et al., 2015). The authors reported its performance is much better than ReLU in large scale image classification task. It is the same as leaky ReLU (Eqn.2) with the exception that is learned in the training via back propagation.

2.4 Randomized Leaky Rectified Linear Unit

Randomized Leaky Rectified Linear is the randomized version of leaky ReLU. It is first proposed and used in Kaggle NDSB Competition. The highlight of RReLU is that in training process,

is a random number sampled from a uniform distribution

. Formally, we have:




In the test phase, we take average of all the in training as in the method of dropout (Srivastava et al., 2014) , and thus set to to get a deterministic result. Suggested by the NDSB competition winner, is sampled from . We use the same configuration in this paper.

In test time, we use:


3 Experiment Settings

We evaluate classification performance on same convolutional network structure with different activation functions. Due to the large parameter searching space, we use two state-of-art convolutional network structure and same hyper parameters for different activation setting. All models are trained by using CXXNET222CXXNET: https://github.com/dmlc/cxxnet.

3.1 CIFAR-10 and CIFAR-100

The CIFAR-10 and CIFAR-100 dataset (Krizhevsky & Hinton, 2009) are tiny nature image dataset. CIFAR-10 datasets contains 10 different classes images and CIFAR-100 datasets contains 100 different classes. Each image is an RGB image in size 32x32. There are 50,000 training images and 10,000 test images. We use raw images directly without any pre-processing and augmentation. The result is from on single view test without any ensemble.

The network structure is shown in Table 1. It is taken from Network in Network(NIN)(Lin et al., 2013).

Input Size NIN
5x5, 192
1x1, 160
1x1, 96

3x3 max pooling, /2

dropout, 0.5
5x5, 192
1x1, 192
1x1, 192
3x3,avg pooling, /2
dropout, 0.5
3x3, 192
1x1, 192
1x1, 10
8x8, avg pooling, /1
10 or 100 softmax
Table 1: CIFAR-10/CIFAR-100 network structure. Each layer is a convolutional layer if not otherwise specified. Activation function is followed by each convolutional layer.

In CIFAR-100 experiment, we also tested RReLU on Batch Norm Inception Network (Ioffe & Szegedy, 2015). We use a subset of Inception Network which is started from inception-3a module. This network achieved 75.68% test accuracy without any ensemble or multiple view test 333CIFAR-100 Reproduce code: https://github.com/dmlc/mxnet/blob/master/example/notebooks/cifar-100.ipynb.

3.2 National Data Science Bowl Competition

The task for National Data Science Bowl competition is to classify plankton animals from image with award of $170k. There are 30,336 labeled gray scale images in 121 classes and there are 130,400 test data. Since the test set is private, we divide training set into two parts: 25,000 images for training and 5,336 images for validation. The competition uses multi-class log-loss to evaluate classification performance.

We refer the network and augmentation setting from team AuroraXie444Winning Doc of AuroraXie: https://github.com/auroraxie/Kaggle-NDSB, one of competition winners. The network structure is shown in Table 5. We only use single view test in our experiment, which is different to original multi-view, multi-scale test.

Input Size NDSB Net
3x3, 32
3x3, 32
3x3, max pooling, /2
3x3, 64
3x3, 64
3x3, 64
3x3, max pooling, /2
split: branch1 — branch 2
3x3, 96 — 3x3, 96
3x3, 96 — 3x3, 96
3x3, 96 — 3x3, 96
3x3, 96
channel concat, 192
3x3, max pooling, /2
3x3, 256
3x3, 256
3x3, 256
3x3, 256
3x3, 256
SPP (He et al., 2014) {1, 2, 4}
121 softmax
Table 2: National Data Science Bowl Competition Network. All layers are convolutional layers if not otherwise specified. Activation function is followed by each convolutional layer.

4 Result and Discussion

Table 3 and 4 show the results of CIFAR-10/CIFAR-100 dataset, respectively. Table 5 shows the NDSB result. We use ReLU network as baseline, and compare the convergence curve with other three activations pairwisely in Fig. 2, 3 and 4, respectively. All these three leaky ReLU variants are better than baseline on test set. We have the following observations based on our experiment:

  1. Not surprisingly, we find the performance of normal leaky ReLU () is similar to that of ReLU, but very leaky ReLU with larger is much better.

  2. On training set, the error of PReLU is always the lowest, and the error of Leaky ReLU and RReLU are higher than ReLU. It indicates that PReLU may suffer from severe overfitting issue in small scale dataset.

  3. The superiority of RReLU is more significant than that on CIFAR-10/CIFAR-100. We conjecture that it is because the in the NDSB dataset, the training set is smaller than that of CIFAR-10/CIFAR-100, but the network we use is even bigger. This validates the effectiveness of RReLU when combating with overfitting.

  4. For RReLU, we still need to investigate how the randomness influences the network training and testing process.

Activation Training Error Test Error
ReLU 0.00318 0.1245
Leaky ReLU, 0.0031 0.1266
Leaky ReLU, 0.00362 0.1120
PReLU 0.00178 0.1179
RReLU () 0.00550 0.1119
Table 3: Error rate of CIFAR-10 Network in Network with different activation function
Activation Training Error Test Error
ReLU 0.1356 0.429
Leaky ReLU, 0.11552 0.4205
Leaky ReLU, 0.08536 0.4042
PReLU 0.0633 0.4163
RReLU () 0.1141 0.4025
Table 4: Error rate of CIFAR-100 Network in Network with different activation function
Activation Train Log-Loss Val Log-Loss
ReLU 0.8092 0.7727
Leaky ReLU, 0.7846 0.7601
Leaky ReLU, 0.7831 0.7391
PReLU 0.7187 0.7454
RReLU () 0.8090 0.7292
Table 5: Multi-classes Log-Loss of NDSB Network with different activation function
Figure 2: Convergence curves for training and test sets of different activations on CIFAR-10 Network in Network.
Figure 3: Convergence curves for training and test sets of different activations on CIFAR-100 Network in Network.
Figure 4: Convergence curves for training and test sets of different activations on NDSB Net.

5 Conclusion

In this paper, we analyzed four rectified activation functions using various network architectures on three datasets. Our findings strongly suggest that the most popular activation function ReLU is not the end of story: Three types of (modified) leaky ReLU all consistently outperform the original ReLU. However, the reasons of their superior performances still lack rigorous justification from theoretic aspect. Also, how the activations perform on large scale data is still need to be investigated. This is an open question worth pursuing in the future.


We would like to thank Jason Rolfe from D-Wave system for helpful discussion on test network for randomized leaky ReLU.