Improving Deep Neural Network with Multiple Parametric Exponential Linear Units

06/01/2016 ∙ by Yang Li, et al. ∙ 0

Activation function is crucial to the recent successes of deep neural networks. In this paper, we first propose a new activation function, Multiple Parametric Exponential Linear Units (MPELU), aiming to generalize and unify the rectified and exponential linear units. As the generalized form, MPELU shares the advantages of Parametric Rectified Linear Unit (PReLU) and Exponential Linear Unit (ELU), leading to better classification performance and convergence property. In addition, weight initialization is very important to train very deep networks. The existing methods laid a solid foundation for networks using rectified linear units but not for exponential linear units. This paper complements the current theory and extends it to the wider range. Specifically, we put forward a way of initialization, enabling training of very deep networks using exponential linear units. Experiments demonstrate that the proposed initialization not only helps the training process but leads to better generalization performance. Finally, utilizing the proposed activation function and initialization, we present a deep MPELU residual architecture that achieves state-of-the-art performance on the CIFAR-10/100 datasets. The code is available at https://github.com/Coldmooon/Code-for-MPELU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 21

page 22

Code Repositories

Code-for-MPELU

Code for Improving Deep Neural Network with Multiple Parametric Exponential Linear Units


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, the landscape of computer vision has been noticeably changed from the engineered feature architecture to an end-to-end feature learning architecture, deep neural networks, by which many state-of-the-art work advanced the development of classical tasks such as object detection

[1], semantic segmentation [2]

, and image retrieval

[3]. Such a revolutionary change mainly results from several crucial elements, such as big datasets, high-performance hardware, new effective models, and regularization techniques. In this work, we focus on two notable elements, activation function and the corresponding initialization of network.

One of known activation functions is Rectified Linear Unit (ReLU)

[4, 5] which produced profound effect on the development of deep neural networks. ReLU is a piecewise-linear function that keeps positive inputs and outputs zero for negative inputs. Owing to this form, it can alleviate the problem of vanishing gradient, allowing the supervised training of much deeper neural networks. However, it experiences a potential disadvantage that units will never activate once gradients reach zero. Seeing this, Maas et al. [6] presented Leaky ReLU (LReLU) where the negative part of activation function is replaced with a linear function. He et al. [7] further extended LReLU to a Parametric Rectified Linear Unit (PReLU) which can learn the parameters of the rectifiers, leading to higher classification accuracy with little overfitting risk. In addition, Clevert et al. [8] presented the Exponential Linear Unit (ELU), leading to faster learning and better generalization performance than the rectified unit family on deep networks. The above rectified and exponential linear units are commonly adopted by the recent deep learning architectures [5, 9, 10, 11]

to achieve good performance. However, there exists a gap of representation space between the two types of activation functions. For the negative part, ReLU or PReLU are able to represent the linear function family but not the non-linear one, while ELU is able to represent the non-linear function family but not the linear one. The representation gap to some extent undermines the representational power of those architectures using a particular activation function. In addition, ELU is at a potential disadvantage when used with Batch Normalization

[12]. Clevert et al. [8] showed that using Batch Normalization with ELU could harm the classification accuracy, which is also verified in our experiments.

This work is mainly motivated by PReLU and ELU. Firstly, we present a new Multiple Parametric Exponential Linear Unit (MPELU), a generalization of ELU, to bridge the gap. In particular, an extra learnable parameter, , is introduced into the inputs of ELU to control the shape of negative part. By optimizing

through stochastic gradient descent (SGD), MPELU is able to adaptively switch between the rectified and exponential linear units. Secondly, motivated by PReLU, we make the hyper-parameter

of ELU learnable to further improve its representational ability and tune the function shape. This design makes MPELU more flexible than its antecedents, ReLU, PReLU, and ELU that can be seen as special cases of MPELU. Therefore, through learning and , the linear and non-linear space of the negative part can be covered in a single activation function module, whereas its special existing cases do not have this property.

The introduction of learnable parameters into ELU may likely bring an additional benefit. This is inspired by the observation that Batch Normalization does not improve ELU networks but can improve ReLU and PReLU networks. To see this, MPELU can be inherently decomposed into a composition of PReLU and learnable ELU:

(1)

where x is the inputs of activation function, and denotes the ELU [8] with a learnable parameter . Applying Batch Normalization to the inputs gives

(2)

As we can see, the outputs of Batch Normalization flow into PReLU before ELU, which can result in not only the improvement of the classification performance, but the alleviation of the potential problem of working with ELU. Eqn. (2) suggests that MPELU is also able to share the advantages of PReLU and ELU simultaneously, for example, the superior learning behavior of ELU compared to ReLU and PReLU, as described in [8]

. Our experimental results on CIFAR-10 and ImageNet 2012 demonstrate that by introducing the learnable parameters, MPELU networks provide better classification performance and convergence property than its counterparts.

Because of the introduction of extra parameters, overfitting could be a concern. To address this, we adopt the same strategy as PReLU to reduce the overfitting risk. For each MPELU layer, and are initialized as the channel-share version or the channel-wise version. Therefore, the increment of parameters of the entire network is at most twice the total number of channels, which is negligible compared to the number of weights.

Although lots of activation functions, e.g., ELU [8], were proposed recently, few works determine a weight initialization for networks using them. Improper initialization often hampers the learning of very deep networks [9]. Glorot et al. [13] proposed an initialization scheme but only considered the linear activation functions. He et al. [7] derived an initialization method that considers the rectifier linear units (e.g., ReLU) but not makes allowance for the exponential linear units (e.g., ELU). Even though Clevert et al. [8] applied it to the networks using ELU, this lacks theoretical analysis. Furthermore, none of these works is suitable for non-convex activation functions. Observing this, this paper presents a strategy of weight initialization, enabling the training of networks using exponential linear units including ELU and MPELU, and thus extends the current theory to the wider range. In particular, since MPELU is non-convex, the proposed initialization also applies to non-convex activation functions.

The main contributions of this work are:

A new activation function MPELU that covers the solution space of both the rectified and exponential linear units.

A technique of weight initialization, allowing the training of extremely deep networks using ELU and MPELU.

A simple architecture of ResNet with MPELU, achieving state-of-the-art results on the CIFAR [14] dataset with comparable time/memory complexity and parameters to the original versions [11, 15].

The remainder of this paper is organized as follows. Sec. 2 reviews the related work. In Sec. 3, we propose our activation function and initialization method. The experiments and analysis are given in Sec. 4 to show their effectiveness. Utilizing the proposed methods, Sec. 5 presents a deep MPELU residual architecture to provide state-of-the-art performance on CIFAR-10/100. Finally, Sec. 6 concludes. To keep the paper at a reasonable length, the implementation details of our experiments are given in appendix.

2 Related Work

This paper mainly focuses on activation functions and the weight initialization of deep neural networks. Therefore, we review the related work in the two fields. Note that training very deep networks can also be realized by developing new architectures such as introducing skip connection as in [16, 11], but this is beyond the scope of the paper.


Activation Functions. Even though activation functions are an early invention, they were not formally defined until recently [17]

. Activation functions allow deep neural networks to learn a complex non-linear transformation, which is crucial to the power of modeling. From the feature point of view, the outputs of activation functions can be used as high-level semantic representations (can also be obtained by subspace learning, e.g.,

[18]

) that are more robust to variance than low-level ones, which facilitates recognition tasks.

Among recent work is Rectified Linear Unit (ReLU) [4, 5]

, one of keys to the breakthrough of deep neural networks. ReLU keeps positive inputs unchanged and outputs zero for negative inputs, and therefore it can avoid the problem of vanishing gradients, enabling the training of much deeper supervised neural networks, whereas sigmoid nonlinearity can not. LReLU

[6] was proposed that multiplies the negative inputs by a slope factor, aiming to avoid zero gradients in ReLU. According to [6], LReLU provides comparable performance to ReLU and is sensitive to the value of the slope. He et al. [7]

found that the cost function is differentiable with respect to the slope factor and therefore proposed optimizing the slope through SGD. This parametric rectified linear unit is named PReLU. Experiments showed that PReLU can improve the performance of convolutional neural networks with little overfitting risk. They also proved that PReLU has the ability of pushing off-diagonal blocks of FIM closer to zero, which enables faster convergence than ReLU. None of the above activation functions can learn the non-convex functions since their essence of convex function. To address this, Jin

et al. [19] proposed a S-shaped rectified linear activation unit (SReLU) to learn both convex and non-convex functions, which is inspired by the Webner-Fechner law and the Stevens law. In addition to the above rectified linear units, Clevert et al. [8] presented a novel form of activation function, Exponential Linear Unit (ELU). ELU is similar to sigmoid for negative inputs and has the same form as ReLU for positive inputs. It has been proved that ELU is able to bring the gradient closer to the unit natural gradient, which accelerates learning speed and leads to higher performance. When used with Batch Normalization [12], ELU tends to expose an unexpected degradation problem. In this case, ELU has a negligible impact on the generalization capability and classification performance. In addition to the above deterministic activation functions, there is another random version. Recently, Xu et al. [20] proposed a randomized leaky rectified linear unit, RReLU. RReLU also has negative values which is helpful to avoid zero gradients. The difference is that the slope of RReLU is not fixed or learnable but randomized. Through this strategy, RReLU is able to reduce the overfitting risk to some extent. However, Xu et al. only verified RReLU on small datasets, like CIFAR-10/100. How RReLU performs on large datasets such as ImageNet is still needed to be explored.


Initialization. Initialization of parameters is very important especially for deep networks and the case of large learning rate. If not initialized properly, it may be very hard to converge through SGD. Many efforts have concentrated on this subject. Hinton et al. [21] introduced a learning algorithm that utilizes layer-wise unsupervised pre-training to initialize all layers. Before this, there is no suitable algorithms for training deep fully-connected architectures. Shortly after, Bengio et al. [22] studied the pre-training strategy and conducted a series of experiments to substantiate and verify it. Erhan et al. [23] further performed a number of experiments to confirm and clarify the procedure, showing that it can initialize the starting point in parameter space in a better basin of attraction than picking starting parameters at random. During the development of deep learning, another important work is ReLU [4]

which addresses the problem of vanishing gradients. With ReLU, deep networks are able to converge even randomly initialized from a Gaussian distribution. Krizhevsky

et al. [5] applied ReLU to supervised convolutional neural networks with random initialization and won the ILSVRC 2012 challenge. Since that, deeper and deeper networks have been proposed, leading to a sequence of improvements in computer vision. However, Simonyan et al. [9] showed that deep networks still face the optimization problem once the number of layers reaches some value (e.g., 11 layers). This phenomenon is also mentioned in [13, 10, 7, 16]. Glorot et al. [13] proposed a method to initialize weights according to the size of a layer. This strategy holds under the assumption of linear activation functions, which works well in many cases but not holds for rectified linear units (e.g., ReLU and PReLU). He et al. [7] extended this method to the case of rectified linear units and proposed a new initialization strategy usually MSRA filler which has shown great help for training very deep networks. Nevertheless, for exponential linear units, there is currently no appropriate strategy to initialize weights. Observing this, we generalize the MSRA filler to a new initialization for deep networks using exponential linear units (e.g., ELU and MPELU) based on the first-order Taylor expansion of MPELU at zero.

(a)
(b)
Figure 1: The graphical depiction of activation functions. (a) shapes of activation functions. of PReLU is initialized with 0.25. The hyper-parameter of ELU is 1. and of MPELU are initialized with 3 and 1, respectively. (b) other activation functions are special cases of MPELU. With = 0, MPELU is reduced to ReLU. If = 25.6302 and =0.01, MPELU approximates to PReLU; When , = 1, MPELU becomes ELU

3 The Proposed Activation Function and Weight Initialization

This section first presents the Multiple Parametric Exponential Linear Unit (MPELU), then derives the weight initialization for networks using exponential linear units.

3.1 Multiple Parametric Exponential Linear Unit

PReLU and ELU have limited but complementary representational power for their negative parts. This work proposes a general form of activation function that unifying the existing ReLU, LReLU, PReLU, and ELU.


Forward Pass. Formally, the definition of MPELU is:

(3)

Here, is constrained to be greater than zero, and is the index of input corresponding to the () and . Following PReLU, and can be channel-wise ( the number of feature maps) or channel-shared () learnable parameters, which control the value to and at which MPELU saturates respectively. Fig. 1(a) shows the shapes of four activation functions.

By adjusting , MPELU can switch between the rectified and exponential linear units. To be specific, if is set to a small number, for example, 0.01, the negative part of MPELU approximates to a linear function. In this case, MPELU becomes the Parametric Rectified Linear Unit (PReLU). On the other side, if takes a large value, for example, 1.0, the negative part of MPELU is a non-linear function, making MPELU turn back into the exponential linear units.

Introducing helps further control the form of MPELU, as shown in Fig. 1(b). If and are set to 1, MPELU reduces to ELU. Decreasing in this case lets MPELU go to LReLU. Finally, MPELU is exactly equivalent to ReLU when .

From the above analysis, it is easy to see that the flexible form of MPELU makes it cover the solution space of its special cases, and therefore grants it more powerful representation. We will show that ResNet [11, 15] could gain significant improvement merely by tuning the usage of activation functions, that is, from ReLU to MPELU.

Another benefit of MPELU is fast learning. Eqn. (2) suggests that MPELU could potentially share the properties of PReLU and ELU. Thus, as an exponential linear unit, MPELU exhibits the same learning behavior as ELU. Readers are referred to [8] for more details.


Backward Pass.

Since MPELU is differentiable almost everywhere, deep networks with MPELU can be trained end-to-end. We use chain rule to derive the update formulations of

and :

(4)
(5)
(6)
(7)

Note that and are the gradients of activation function with respect to and

for a single unit. When computing the gradients of loss function for the entire layer, the gradients of

and should be:

(8)
(9)

where sums over all the positions corresponding to and . Throughout this paper, we employ the channel-wised version for all the experiments. By this strategy, the increment of parameters of the entire network is at most twice the total number of channels, which is negligible compared to the number of weights. We show in Sec. 5 that the model size of the proposed MPELU ResNet architectures can be comparable to (or even less than) that of ReLU architectures.

For the actual running time, MPELU is roughly comparable to PReLU if we carefully optimize the codes. This will be analyzed in Section 4.2.

Initializing and with different values has small but non-negligible impact on classification accuracy. We recommend using and as the initial values, and five times the base learning rate for both of them. Moreover, we highlight that it is important to use weight decay ( regularization) on both and , which is opposite the case of rectified linear units such as PReLU [7] and SReLU [19].

3.2 The Proposed Weight Initialization for Networks with MPELU

The previous works [21, 22, 13, 7] have laid a solid foundation for the initialization of deep neural networks. This paper complements the current theory and extends it to the wider range.


Briefly Review of MSRA filler. MSRA filler contains two cases of initialization, the forward propagation case and the backward propagation case. He et al. [7] proved that both cases are able to properly scale the backward signal. Therefore, it is sufficient to only investigate the forward propagation case.

For the convolutional layer, a pixel in the output channel is expressed as:

(10)

where

is a random variable,

and

are random vectors and independent of each other, and

is initialized with zero. The goal is to explore the relationship between the variance of and the variance of .

(11)

where is the kernel size and is the number of input channels. Here, both and are random variables. Eqn. (11) holds under the assumption that the elements in and are independent and identically distributed respectively. Usually, weights of deep network are initialized with zero mean, and Eqn. (11) becomes:

(12)

Next, we need to find the relationship between and . Note that there exists an activation function between and ,

(13)

For different activation functions , we may derive different relationships, and thus different initialization methods. Specifically, for symmetric activation functions, the sigmoid non-linearity, Glorot et al. [13] assumed they are linear at the initialization and therefore proposed the Xavier method. For rectified linear units, ReLU and PReLU, He et al. [7] removed the linear assumption and extended the Xavier method to the MSRA filler. In the next section, we further extend the MSRA filler to a more general form by taking the first-order Taylor expansion of MPELU at zero and clipping the results to its linear part.


The Proposed Initialization. This section mainly follows the derivation in [13, 7]. Since ELU is a special case of MPELU, we focus on MPELU. As we can see from Eqn. (3), it is very difficult to obtain the exact relationship between and ). Instead, we use its Taylor series at zero. For the negative part, MPELU can be expressed as:

(14)

Then, the left side of Eqn. (14) is approximated by its Taylor polynomial of degree 1.

(15)

Eqn. (15) introduces the linear approximation only for the negative regime. We call this semi-linear assumption with which we have:

(16)
(17)

where,

is the probability density function. Following

[13, 7], if having a symmetric distribution with zero mean, it is also the case for . Then,

(18)

By Eqn. (18) and (12), we obtain:

(19)

Through this, it is easy to derive the relationship between and :

(20)

Following [13, 7], to keep the signals of the forward and backward pass flowing correctly, we expect that is equal to , which leads to:

(21)

Therefore, for each layer in deep networks using MPELU, we can initialize weights from a Gaussian distribution

(22)

where is the index of layer. Eqn. (22) applies to deep networks using the rectified or exponential linear units. Note that when and , Eqn. (22) becomes the initialization for ELU networks. When , Eqn. (22) corresponds to the initialization for ReLU networks. Furthermore, when and , Eqn. (22) can be used to initialize PReLU networks. From this point of view, MSRA filler is a special case of the proposed initialization.


Comparison with Xavier, MSRA, and LSUV. Xavier method is designed for symmetric activation functions with the hypothesis of linearity, and MSRA filler only applies to the rectified linear units (ReLU and PReLU), while the proposed method addresses the initialization for both rectified and exponential linear units. Recently, Mishkin et al. [24] proposed the LSUV initialization that is data-driven and thus avoids solving the relationship between and , but Eqn. (22) is an analytic solution for ELU and MPELU and therefore runs faster than LSUV.

4 Experiment

This section explores the usage of MPELU on a number of architectures. In Sec. 4.1, we begin with experiments with Network in Network (NIN) [25] on CIFAR-10, showing the benefit of introducing learnable parameters into ELU. Sec. 4.2 further substantiates this benefit in deeper networks and on the larger dataset, ImageNet 2012. Finally, Sec. 4.3 verifies the proposed initialization with a very deep network on ImageNet, showing the ability of training very deep ELU/MPELU networks. In Sec. 4.1 and Sec. 4.3, we also provide the convergence analysis, showing that MPELU, like ELU, possesses the superior convergence property to ReLU and PReLU.

4.1 Experiments with NIN on CIFAR-10

This section conducts the experiments of Network in Network with different activation functions on the CIFAR-10 dataset. The goal is to investigate the benefits of introducing learnable parameters into ELU.

This architecture has nine convolutional layers including six ones with kernel size and no Fully Connected (FC) layers, which is easy to train and sufficient for a comprehensive evaluation of effectiveness of learnable parameters. The implementation details are given in appendix.

NIN parameter(s) CIFAR-10 CIFAR-10 (augmented)
ReLU [25] - 10.41 8.81
PReLU 9.02 (9.19 0.15) 7.28 (7.49 0.14)
ELU 9.39 (9.63 0.23) 7.77 (7.83 0.05)
MPELU ; 9.06 (9.19 0.11) 7.37 (7.57 0.16)
MPELU ; 9.10 (9.27 0.12) 7.30 (7.52 0.18)
Table 1: Test error rate (%) of classification on the CIFAR-10. and in MPELU are initialized with 1 or 0.25, and they are updated by SGD without weight decay. As in [16, 11] the best (mean std) results are reported by five runs for each network

For fair comparison, we train networks using ReLU, PReLU, ELU, and MPELU with the same settings from scratch. Tab. 1 shows that MPELU consistently outperforms ELU (e.g., 9.06% vs. 9.39% test error rate without data augmentation, and 7.30% vs 7.77% test error rate with data augmentation). This improvement over ELU is completely from and , verifying the benefit from the learnable parameters.

(a)
(b)
Figure 2: Comparison of convergence on CIFAR-10. All the models learn very quickly on this small dataset, and so we adopt the evaluation method similar to [8] according to which the number of iterations used to reach 15% test error is measured. (a) indicates that MPELU can reduce the loss earlier. (b) shows that MPELU reaches the 15% error after 9k iterations, while ReLU and PReLU need 25k and 15k iterations to reach the same error rate

Some interesting phenomenon can be observed in Tab. 1 and Fig. 2. Firstly, Tab. 1 shows that MPELU () performs like PReLU (a negligible difference of 0.03% mean test error when using data augmentation). Secondly, Fig. 2(a)(b) show that its learning curves are closer to ELU’s, suggesting a potential superior learning behavior compared to the rectified linear units, as described in [8]. Note that all the models learn very quickly on this small dataset and reach the same test error rate (15%) within 25k iterations, which makes it very hard to compare the learning speed. To deal with this, we adopt the similar evaluation criterion to [8], that is, the iteration to reach the 15% test error rate. Fig. 2(b) shows that MPELU starts reducing the error (also the loss) earlier and reaches the 15% error after 9k iterations, while ReLU and PReLU need 25k and 15k iterations to reach the same error rate, respectively. The above better performance arises from the combining advantages of PReLU and ELU, as suggested in Eqn. (1).

It is also worth noting that MPELU achieves a comparable performance to PReLU with a bit more parameters. This is not caused by overfitting since ELU performs much worse than PReLU and MPELU. The underlying reason is still unclear and will be studied in the future. Even though MPELU is a bit less effective than PReLU in this shallower architecture, we will show that MPELU outperforms PReLU in deeper architectures.

4.2 Experiments on ImageNet

Gaussian initialization MSRA our initialization
for MPELU ReLU PReLU ELU MPELU ReLU PReLU ELU MPELU
0/0 1/0 37.66 - - 39.40 37.45 - - -
0/1 1/0 - - - 37.92 - - - -
0/1 1/1 - - - 37.61 - - - 37.41
0.25/0 1/0 - 39.48 - 40.94 - 38.72 - 39.46
0.25/1 1/1 - 39.53 - 37.81 - 38.57 - 37.47
1/0 1/0 - - 40.36 39.53 - - 39.83 38.42
1/1 1/1 - - - 38.04 - - - 37.33
, : initial value / weight decay multiplier
Table 2: Top-1 error rate (single-view test) on the validation set of ImageNet 2012 with data augmentation. The comparison is under the same initial values of . in MPELU is initialized with 1 for all cases. and in MPELU are updated by SGD with/without weight decay. MPELU outperforms its counterparts consistently and obtains the overall best result

This section evaluates MPELU on the ImageNet 2012 classification task. ImageNet 2012 contains about 1.28 million training examples, 50k validation examples, and 100k test examples which belong to 1000 classes. This enables us to utilize a deeper network with little overfitting risk. Therefore, we build a 15-layer network modified from the model-E in [26]. The models evaluated in this section are trained on the training set and tested on the validation set.

Network Structure. Based on the model-E, we add one more convolutional layer, insert Batch Normalization [12] immediately before activation functions, and remove dropout [27] layers. Following [26, 5, 28]

, the networks are divided into three stages by max-pooling layers. The first stage contains only one convolutional layer with a kernel size of

pixels and 64 filters. The second stage consists of four convolutional layers with the kernel size of

pixels and 128 filters. We set stride and pad accordingly so as to maintain the feature map size of

pixels. The third stage consists of seven convolutional layers with kernel size of pixels and 256 filters. In the third stage, the feature map size is reduced to pixels. The next layer is spp [28]

which is followed by two 4096-d FC layers, one 1000-d FC layer, and one softmax successively. The networks are initialized through three methods which are Gaussian distribution with zero mean and 0.01 standard deviation, MSRA filler

[7], and the proposed initialization (see Sec. 3.2). The bias terms are initialized with 0 as usual. and in MPELU are initialized with varying values and updated by SGD with weight decay. Other implementation details are given in appendix.

For fair comparison, the participants are evaluated under the same initial values of , and Tab. 2 lists the results. For clarity, the results that outperform others are marked in boldface and the overall best result is marked in blue.

Gaussian Initialization. When compared to ELU, all the MPELU layers are initialized with . As we can see, the MPELU network outperforms the ELU network by 0.83% top-1 error rate. If weight decay is used, it can significantly outperform the ELU network by 2.32%. Since the only difference between them lies in the activation function, this improvement over ELU indeed demonstrates the advantage of the learnable parameters, and .

For further examining MPELU, we also compared it with PReLU. In this case, in MPELU are initialized with 0.25. Tab. 2 shows that the MPELU network achieves the top-1 error rate 40.94%, which is worse than 39.48% provided by the PReLU network. Nevertheless, using weight decay considerably improves the performance of the MPELU network by 3.13%, reducing the top-1 error rate to 37.81% which is better than that of the PReLU network by 1.72%.


Other Initialization Methods. Experiments are also conducted with other initialization methods (see Tab. 2). The experimental results are in line with the Gaussian initialization case. MPELU surpasses all the counterparts. The overall best top-1 error rate 37.33% achieved by MPELU is significantly lower than those achieved by PReLU and ELU. It is interesting to see that the MPELU networks initialized from the proposed method consistently outperform those initialized from Gaussian method, demonstrating that our initialization can lead to better generalization capability, which is also verified in Sec. 4.3.

Note that MPELU only provides slight improvement over ReLU, and using weight decay in MPELU tends to decrease the top-1 test error in all three cases. This result is not caused, however, by overfitting, since adding more layers (more parameters) to the 15-layer network leads to lower test error, as shown in Sec. 4.3. A possible reason is that using weight decay tends to push and to zero, resulting in smaller scale activations or sparser representations, like ReLU, that are more likely to be linearly separable in a high-dimensional space [29]

. Another explanation may come from the sparse feature selection

[30].

To provide an empirical interpretation, we performed four extra experiments using LReLU with different slopes, and gradually decreased the scale of activations. All the five models (ReLU and LReLU A-D) have the same number of parameters, which eliminates the influence of overfitting. The only difference among them is the scale of the negative activations. A noticeable trend is illustrated in Tab. 3. The top-1/top-5 test error decreases with the slope, which explains why using weight decay to MPELU leads to better results and why ReLU performs better than PReLU and ELU. Nevertheless, this phenomenon is not observed in Sec. 4.1, which might be due to that small scale or sparsity is less important for the shallower architecture (The ReLU NIN performs worst).

15-layer network with slope top-1 error rate(%) top-5 error rate(%)
ReLU 37.66 15.98
LReLU (A) 37.92 16.26
LReLU (B) 38.54 16.65
LReLU (C) 42.76 20.18
LReLU (D) 60.27 36.60
Table 3: Classification comparison among different slopes on the ImageNet validation set. The trend is that the performance increases with the decrease of slope

Convergence Comparison. Since Batch Normalization has a great influence on the convergence of networks, we leave the comparison of convergence among activation functions to Sec. 4.3.


Running Time.

The running time refers to the time consumption of performing an iteration with batch size 64 during training. Essentially, the computational cost of MPELU is greater than its counterparts. But this problem can be properly addressed by carefully engineered implementation (e.g., faster exponential functions). In our Caffe

[31] implementation, the backward pass utilizes the outputs of the forward pass, as shown in Eqn. (4)(6)(7), which saves a lot of computation. Furthermore, the gradients of parameters and inputs can be computed together for each loop. Consequently, the real running time of MPELU can be only slightly slower than that of PReLU, as summarized in Tab. 4.

ReLU PReLU ELU MPELU
running time 0.2310 0.2417 0.2299 0.2441
Table 4: The running time (seconds/iteration) of ReLU, PReLU, ELU, and MPELU based on Caffe implementation. The experiments are performed on a NVIDIA Titan X GPU. The running time below is the mean value of 600k iterations

4.3 Experiments of Initialization

This section conducts experiments on ImageNet 2012. The task is to examine whether the proposed initialization is able to help with convergence of very deep networks using exponential linear units. To this end, we add extra 15 convolutional layers to the network in Sec. 4.2, resulting in a 30-layer network that suffices for investigating the effect of the initialization. Note that the network is similar to the 30-layer ReLU network in [7] but differs from it in several aspects such as batch size, pad, and feature map size.

Since BN has a great influence on the convergence of deep networks, it is nature to take it into account. Following [12], we remove dropout layers when using BN. Finally, four methods are compared: the baseline Gaussian initialization, our initialization, BN + Gaussian initialization, and BN + our initialization. and in MPELU are initialized with 1 and updated by SGD without weight decay, with other settings identical to Sec. 4.2.

initialization methods 30-layer networks 15-layer networks
ELU MPELU ELU MPELU
Gaussian - -
ours 37.08 36.49 - -
Gaussian + BN - 44.28 40.36 39.53
ours + BN - 42.96 39.83 38.42
: fails to converge
Table 5: Comparison of initialization. The top-1 test error (%) on the validation set of ImageNet 2012 is reported. The 30-layer ELU and MPELU networks with Gaussian method totally stop learning. On the contrary, the proposed method makes them converge, verifying the effectiveness of Eqn. (22). When BN is used, the performance can still be boosted by the proposed method. Note that the results, 44.28% and 42.96%, achieved by the 30-layer MPELU networks with BN are considerably lower than those, 39.53% and 38.42%, achieved by the 15-layer counterparts, suggesting the emergence of the degradation problem [11]
15 layers MPELU ELU
, 0/1, 1/1 0.25/0, 1/0 0.25/1, 1/1 1/1, 1/1 1/0, 1/0 1/0, 1/0
LSUV [24] 37.72 39.93 37.67 37.62 38.57 39.85
ours 37.41 39.46 37.47 37.33 38.42 39.83

, : initial value / weight decay multiplier
Table 6: Comparison between LSUV and ours through the 15-layer networks. Although the improvement over LSUV is slight, but still consistent

Comparison to Gaussian. Tab. 5 shows that the Gaussian initialization fails to train the 30-layer ELU/MPELU networks, while our method can help learn, which justifies the effectiveness of Eqn. (22). Furthermore, the 37.08%/36.49% top-1 test error rates achieved by the 30-layer ELU/MPELU networks are obviously lower than those achieved by 15-layer counterparts, meaning that the proposed method indeed addresses the diminishing gradients caused by improper initialization of very deep networks, hence makes them enjoy the benefit from the increase of depth. When BN is adopted, the proposed method reduces the error consistently compared to the Gaussian initialization, showing its benefit to the generalization capability. In addition, MPELU networks always perform better than ELU networks, and obtains the overall best result, 36.49% top-1 test error rate, demonstrating the benefit of introducing learnable parameters. The above results indicate that although Eqn. (22) derives from a first-order Taylor approximation of Eqn. (14), it indeed works rather well in practice.


Comparison to LSUV. Mishkin et al. [24] verified LSUV in the 22-layer GoogLeNet [10] using ReLU. To examine LSUV in deeper networks with exponential linear units, we build another 52-layer ELU network and initialize the 30- and 52-layer ELU networks with LSUV. Without BN, LSUV makes both ELU networks explode within only several iterations, while our method can make them converge. More experiments are also conducted through the 15-layer networks from Sec. 4.2 and the results are given in Tab. 6. The proposed initialization leads to marginal, but consistent, decrease in top-1 test error. In addition, Eqn. (22) is an analytic solution, while LSUV is a data-driven method, meaning that the proposed method runs faster than LSUV.


Degradation Analysis. It should be noted in Tab. 5 that while the 30-layer network without BN obtains the overall best result, the 30-layer networks with BN perform considerably worse than the 15-layer counterparts. To explain this, we analyze their learning behaviors.

(a)
(b)
(c)
Figure 3: learning curves of 15/30-layer MPELU networks on ImageNet. (a) training loss: All the 30-layer networks tend to converge. (b) top-1 training error (%). (c) top-1 test error (%). the 30-layer networks with BN have higher training/test error than the 15-layer network, suggesting the emergence of the degradation problem [11]. Somehow surprisingly, if BN is removed, the problem is eliminated (see the red dashed line)

Firstly, Fig. 3(a) shows the training loss of all the 30-layer networks at the end of training. As we can see, the networks with BN have comparable training loss to the network without BN, demonstrating that they all converge well. Thus, it is most unlikely that the decrease of accuracy is caused by vanishing gradients. Secondly, Fig. 3(b)(c) show the top-1 training/test error rates. Obviously, the 30-layer networks with BN have higher training/test error than the 15-layer counterpart, suggesting the emergence of the degradation problem as described in [11]. Interestingly, the 30-layer network without BN does not suffer from this problem. It can enjoy the benefit of increasing depth. Note that the only difference among these networks is the usage of BN. Therefore, BN might be an underlying factor causing the degradation problem.

conv1 conv7 conv14 conv20 conv27
Mean ReLU 38.95 41.25 28.37 22.52 19.61
MPELU 25.31 4.77 0.13 0.03 0.003
Var ReLU 4196.36 4603.98 2594.84 2381.22 2627.62
MPELU 1840.65 74.43 0.71 0.07 0.01
Table 7: The statistics (mean and variance) of activations of conv{1, 7, 14, 20, 27}. As described in [7], the ReLU network can roughly preserve its variance, which leads to large magnitude of outputs, and thus diverges. As a comparison, the MPELU network can gradually reduce the magnitude, and thus avoid overflow

Comparison of convergence. Since deeper networks are harder to train, it is good to examine the convergence of activation functions by the 30-layer networks without BN. To this end, four such networks are constructed and initialized from the corresponding method with FAN_IN, FAN_OUT, and AVERAGE mode. Experimental results show that the ReLU network fails to converge in all three modes. The PReLU network converges only in the FAN_OUT mode. On the contrary, ELU/MPELU networks are able to converge in all three modes. These results may be due to the robust to variations of inputs introduced by the left saturation of ELU/MPELU. To verify this, the statistics (mean and variance) are computed. Tab. 7 shows that the ReLU network roughly preserves the variance of inputs, which results in very large activations at higher layers and overflow of softmax as discussed in [7]. The MPELU network does not suffer from this since it has the left saturation to a small negative value and thereby gradually decreases the variance during forward propagation.

4.4 Residual Analysis of the Proposed Initialization

The left side of Eqn. (15

) is approximated by the first order Taylor expansion. This section estimates the residual term

,

(23)

To this end, two cases with and without BN will be considered.


With BN. BN are usually adopted immediately before MPELU. Therefore, it is reasonable to assume that the input of MPELU,

, has a Gaussian distribution with zero mean at the initialization stage. According to probability theory, over 99.73% inputs fall into the range of

, and in this range only half of them contribute to the residuals. We consider three inputs taking , , and whose corresponding residuals are:

(24)
(25)
(26)

Eqn. (24), (25), and (26) say that at the initialization, more than 99.865%, 97.725%, and 84.135% (the probability of falling in [, ], [, ], and [, ], respectively) inputs will have the residuals less than , , and , respectively. Here, has unit variance. If and are initialized with 1, more than 84.135% inputs will have the residuals less than 0.5. Furthermore, consider some negative input whose residual is less than . For ,

(27)

If and are initialized with 1, then we obtain:

(28)

This means there will be about 55.57% inputs having the residuals less than 0.01. Although the residuals are innegligible, Eqn. (22) still works well in practice. The analysis can be side-verified by Clevert et al. work [8] in which they observed that ELU does not show better performance when used with BN. ELU () behaves more like LReLU (), a linear function, for the whole period of training since most residuals are small, see Tab. 3, LReLU (D).


Without BN. In this case, it is difficult to estimate the residuals analytically. Fortunately, the residual can be easily computed from the outputs of a convolutional layer. For this purpose, the 30-layer MPELU network without BN from Sec. 4.3 is adopted. By Eqn. (27), we consider the inputs of residuals less than {0.01, 0.5, 2, 4.5}, or equivalently , , , and .

residual conv1 conv7 conv14 conv20 conv27
0.01 51.24 46.83 56.70 78.18 89.45
0.5 51.65 50.00 84.75 99.60 1
2 52.11 53.78 97.09 1 1
4.5 52.53 57.54 99.71 1 1
Table 8:

The histogram of units for residuals. The bins are (0, 0.01), (0, 0.5), (0, 2), and (0, 4.5). Conv{1, 7, 14, 20, 27} are picked from the 27 convolutional layers. For each bin (each row), the deeper the layer, the higher percentage of units fall in it. Once the depth reaches 14, most of units will have residuals 0.5 or less. It is interesting to note that the outputs of the median layer, conv14, approximately have a standard normal distribution

For simplicity, the statistics are computed every 7 layers. As shown in Tab. 8, the deeper layers have a better approximation for Eqn. (15). Also, once the depth reaches the median, e.g., conv14, most of units will have the residuals less than 0.5. In addition, the statistics of conv14 is very close to a standard normal distribution, which suggests that it plays a role of BN which ensures that gradients can be properly propagated to the lower layers at the initialization. We argue that the residuals are acceptable for the initialization. Sec. 4.3 has proven the effectiveness of the proposed initialization.

5 Deep MPELU Residual Networks

Sec. 4 shows that MPELU and the proposed initialization can bring benefits to the plain networks. This section gives a deep MPELU ResNet to show that the proposed methods are especially suitable for the ResNet architecture [11] and provides state-of-the-art performance on the CIFAR-10/100 datasets.

5.1 MPELU and Batch Normalization

This section demonstrates that MPELU, as opposed to ELU, can be used with BN. Clevert et al. [8] found that BN can improve ReLU networks, but not (even be harmful to) ELU networks. Observing this, Shah et al. [32] proposed to remove most BN layers when constructing ResNet using ELU. While removing BN could lower the barrier between them, it tends to diminish the desired regularization properties, which may lead to unexpected negative effect on the generalization capability. We argue that a proper method to alleviate the problem is introducing learnable parameters and .

# layers / # params 20 32 44 56 110 # params
ResNet [11] 8.75 7.51 7.17 6.97 6.43 (6.61 0.16) 1.73M
ELU ResNet 7.980 7.872 7.714 7.844 8.11 (8.36 0.29) 1.73M
MPELU ResNet (A) 8.12 7.35 6.90 6.72 6.21 (6.89 0.47) 1.74M
MPELU ResNet (B) 8.16 7.12 6.67 6.27 5.64 (5.77 0.15) 1.74M
Table 9: Classification error on CIFAR-10. ReLU is simply replaced with ELU or MPELU. The mean test error over 5 runs is reported except that we show best (mean std) for depth 110. In MPELU ResNet (A), and are initialized with 1 and updated by SGD with weight decay. For (B), we pay a special attention to MPELU after addition, and initialize and with 98 and 0.01, respectively

To examine this, we simply replace ReLU with ELU and MPELU in ResNet, keeping any other settings unchanged. and in MPELU (A) are initialized with 1 and updated by SGD with weight decay. Tab. 9 shows the ELU ResNet performs worse than the original ResNet, potentially demonstrating that BN does not improve the ELU ResNets. On the contrary, the MPELU ResNets (A) consistently reduces the test error for different depths.

The improvement over ELU may receive an explanation from Eqn. (2) that origins from the learnable parameters in MPELU. Eqn. (2) suggests that the outputs of BN directly flow into its PReLU submodule and therefore avoid the ELU submodule. Another possible reason comes from the principle of ResNet, a hypothesis that it is easier to optimize the residual mapping than the original mapping. The ResNet architecture is derived from the extreme case of the hypothesis where the identity mapping is optimal. Compared to ReLU and ELU, MPELU covers larger solution space, which allows the solvers to have more opportunities for approximating identity mappings, and therefore improves the performance. To verify this, we pay a special attention to the MPELU layers after addition, where and are initialized with 98 and 0.01 respectively. By doing so, the shortcut connection and the MPELU layer after addition combine to an identity mapping. Following the philosophy in [11], if an identity mapping were optimal, it would be easier to learn an identity mapping by a shortcut connection plus such a MPELU layer than plus a ReLU or ELU layer since neither ReLU or ELU covers the identity mapping. The results are given in MPELU ResNets (B). Tab. 9 shows that MPELU ResNets (B) consistently outperform the counterparts by a large margin, demonstrating the benefit from the larger solution space introduced by the learnable parameters.

5.2 Network Architectures

(a)
(b)
(c)
(d)
Figure 4: Various residual blocks. (a) the non-bottleneck block in [11], (b) MPELU non-bottleneck block, (c) the full pre-activation bottleneck block in [15], (d) MPELU full pre-activation bottleneck block

He et al. [11, 15] investigated the usage of activation functions for deep residual networks. The resulted ResNet and Pre-ResNet architectures are highly optimized for ReLU. Even though the performance can be improved by simply replacing ReLU with MPELU as shown in Sec. 5.1, we expect that it would benefit from an adjusted deployment. For this reason, this section proposes a variant of the residual architecture, MPELU ResNet which includes two types of blocks, non-bottleneck and bottleneck, as described in the following.

MPELU Non-bottleneck Residual Block. This block, (Fig. 4(b)), is a simplification of the original non-bottleneck residual block in ResNet [11] (Fig. 4(a)). The experimental results from Sec. 5.1 suggest that ResNet using MPELU gains more opportunities for finding a better solution than using ReLU or ELU. However, introducing nonlinear units (e.g., MPELU) after addition would still affect the optimization. For example, if an identity mapping were optimal, to the extreme, it would require the solvers to fit an identity mapping by a stack of nonlinear units in addition to pushing the residual functions to zero. Inspired by [15, 33], the identity mapping is directly constructed, as shown in Fig. 4(b), by removing the MPELU after addition instead of being fit by the solvers.

(a)
(b)
(c)
(d)
(e)
Figure 5: Alternatives of residual function. (a) MPELU-only pre-activation block ending with a BN, (b) MPELU-only pre-activation block, (c) nopre-activation with a BN, (d) nopre-activation bottleneck, (e) nopre-activation without BN.

MPELU Bottleneck Residual Block. A naive MPELU Bottleneck block can be simply obtained by replacing ReLU (Fig. 4(c)) with MPELU (Fig. 4(d)). This pull pre-activation structure is highly optimized for ReLU.

This section presents a nopre-activation bottleneck block optimized for MPELU (see Fig. 5(d)). Since the pre-activation part is removed, the complexity and the number of parameters of this block can be largely reduced. As a consequence, the final complexity and the number of parameters of the entire network is comparable to the original. Besides, we adopt a BN (denoted by BN) plus a MPELU right after the first convolution layer, and a BN (denoted by BN) plus a MPELU right after the last element-wise addition of the entire network. The BN and BN are important for the nopre-activation bottleneck block. We will empirically demonstrate this. In addition to this structure, other alternatives (see Fig. 5) are also investigated.

5.3 Results on CIFAR

This section firstly evaluates the variants and alternatives of the proposed MPELU ResNet, then compares it to the state-of-the-art architectures. The implementation details are given in appendix.

Fig. / # layers / # params Fig. 20 32 44 56 110 # params
ResNet [11] Fig. 4(a) 8.75 7.51 7.17 6.97 6.43 (6.61 0.16) 1.73M
ResNet [11]* Fig. 4(a) 8.16 7.06 6.99 6.58 6.27 (6.40 0.18) 1.73M
MPELU ResNet (non-bottle.) Fig. 4(b) 7.71 6.73 6.26 5.95 5.35 (5.47 0.14) 1.74M
Table 10: Test error (%) of non-bottleneck architectures on CIFAR-10. We try different learning rate and weight decay multipliers for and

, and pick the one that gets the best performance. We retrained the original ResNet for 200 epochs and denote the results by *

Fig. / # layers / # params Fig. 164 # params
the original Pre-ResNet [15] Fig. 4(c) 5.46 1.703M
MPELU full pre-activ. Fig. 4(d) 5.20 (5.32 0.13) 1.728M
MPELU-only pre-activ. with BN Fig. 5(a) diverged within few steps 1.727M
MPELU-only pre-activ. Fig. 5(b) 5.49 1.712M
MPELU nopre with BN Fig. 5(c) diverged within few steps 1.713M
MPELU nopre Fig. 5(d) 4.87 (5.04 0.14) 1.696M
MPELU nopre (no BN and BN) - diverged within few steps 1.696M
MPELU nopre (no BN) - 5.29 1.696M
MPELU nopre without BN Fig. 5(e) diverged within few steps 1.688M
Table 11: Test error (%) of bottleneck architectures on CIFAR-10. and are initialized with 0.25 and 1, respectively, and updated by SGD with weight decay

Classification Results. For shallower architectures, the MPELU ResNets (non-bottle.) are considered. Tab. 10 shows that the MPELU ResNets (non-bottle.) achieve consistent improvement with negligible increase of parameters. For example, the 110-layer MPELU ResNet reduces the mean test error rate to 5.47%, which is 1.14% lower than the original ResNet-110. Note that this improvement is obtained merely via a simple strategy – changing the usage of activation functions, demonstrating the benefit from MPELU.

When the networks go deeper (164 layers), we focus on the bottleneck architectures to reduce the time/memory complexity as done in [11]. Tab. 11 shows that the MPELU full pre-activ., Fig. 4(d), provides a marginal decrease in the mean test error rate from 5.46% to 5.32% compared to the original Pre-ResNet, Fig. 4(c). This is done by simply replacing ReLU with MPELU. For the MPELU-only pre-activ. with BN (Fig. 5(a)), the network fails to converge under the initial learning rate 0.1. Following [11], we warm up the training using learning rate 0.01 for one epoch, then switch back to 0.1. With this policy, the network is able to converge but to a worse solution than the full pre-activ. architecture. Based on this observation, we keep the pre-activation part and remove the BN before addition (see Fig. 5(b)). Interestingly, the network can converge without warming up, leading to the mean test error 5.49% which is also worse than the full pre-activ. architecture. Through these results, the MPELU-only pre-activ. architectures are not considered in the rest of the paper.

We focus on the MPELU nopre architecture (Fig. 5(d)), and its variants. Somehow surprisingly, as shown in Tab. 11, simply removing the pre-activation brings about lower test error rate with less parameters and complexity, which suggests that the deep residual architectures have the potential to enjoy the benefit from MPELU. In addition, the performance is also examined by adding more BN layers to and removing BN layers from the MPELU nopre architecture. For the former case (Fig. 5(c)), as demonstrated in Tab. 11, adding one more BN before addition makes the network diverge within few steps. Seeing this, we tried the warming up and found that the network converged well. Combining this phenomenon with the observations of Fig. 5(a) and ResNet-110 [11], we suspect that the BN before addition would exert a negative impact on the gradient signals so that we have to lower the initial learning rate to warm up the training. For the latter case, removing all the BN from the residual function (see Fig. 5(e)) also leads to divergence. Again, the same result happens when BN and BN are removed from the MPELU nopre. However, if keeping BN, the network still converges and performs slightly worse (5.29% 5.04% mean test error). These results suggest that BN and BN are important to the nopre architecture.

Considering the time/memory complexity and model size, the MPELU nopre is picked as the proposed bottleneck architecture of this paper and used to compared to other state-of-the-art methods.

Method settings depth # params CIFAR-10 CIFAR-100

NIN [25]
- - - 8.81 -
DSN [35] - - - 7.97 34.57
All-CNN [36] - - - 7.25 33.71
Highway [16] - - - 7.72 32.39
ELU [8] - - - 6.55 24.28
Fitnets [37] - - - 8.39 35.04
ResNet [11] - 110 1.7M 6.61 -
- 1202 19.4M 7.93 -
sto. ResNet [38] - 110 1.7M 5.23 24.58
- 1202 10.2M 4.91 -
Wide ResNet [39] k = 8 16 11.0M 4.81 22.07
k = 10 28 36.5M 4.17 20.50
Pre-ResNet [15] - 164 1.7M 5.46 24.33
- 1001 10.2M 4.62 (4.69 0.20) 22.71 (22.68 0.22)
164 1.696M 4.58 (4.67 0.06) 21.35 (21.78 0.33)
MPELU nopre 1001 10.28M 3.63 (3.78 0.09) 18.96 (19.08 0.16)
ResNet 164 1.696M 4.87 (5.06 0.14) 23.16 (23.29 0.11)
(Fig. 5(d)) 164 1.696M 4.43 (4.53 0.12) 21.69 (21.88 0.19)
1001 10.28M 3.57 (3.71 0.11) 18.81 (18.98 0.19)
Table 12: Comparison to state-of-the-art methods on CIFAR-10/100. MPELU are initialized with and that are updated by SGD with weight decay. denotes that the hyper-parameter settings follow [34] (see appendix). Our results are based on the best of 5 runs with mean std

Comparison to state-of-the-art methods. To compare to the state-of-the-art methods, we adopt an aggressive training strategy from [34] (See appendix for details), denoted by the symbol .

The test error rate is given in Tab. 12. It is easy to see that with the training strategy , the mean test error of MPELU nopre ResNet-164 () is considerably reduced especially on CIFAR-100 dataset (21.88% 23.29%). This might be because that CIFAR-100 is challenger than CIFAR-10. Training for more epochs with large learning rate would help the model learn the underlying elusive concepts. Interestingly, changing the initial value of to 1 in MPELU can further improve the test error on CIFAR-100 (21.78%) but not on CIFAR-10 (4.67%). For comparison, we also trained the 1001-layer MPELU nopre ResNet. Tab. 12 shows that even though more parameters are introduced, the MPELU ResNet architectures do not suffer from overfitting and still enjoy the performance gains from increased parameters and depth. The best results from the proposed MPELU nopre ResNet-1001 are 3.57% test error on CIFAR-10 and 18.81% on CIFAR-100, which are considerably lower than those by the original Pre-ResNet [15].

6 Conclusions

Activation function is the pivotal component of deep neural networks. Recently, some work on this subject has been proposed. This paper generalized the existing work to a new Multiple Parametric Exponential Linear Units (MPELU). By introducing the learnable parameters, MPELU can become the rectified or the exponential linear units and combine their advantages. Comprehensive experiments via networks of varying depth (from 9-layer NIN [25] to 1001-layer ResNet [11]) are conducted to examine the performance of MPELU. Experimental results showed that MPELU can bring benefits to the classification performance and the convergence of deep networks. In addition, MPELU can work with Batch Normalization as opposed to ELU. Weight initialization is also an important factor in deep neural networks. This paper proposes an initialization for networks using exponential linear units, which complements the current theory of this field. To our knowledge, this is the first method that gives an analytic solution for networks using exponential linear units. Experimental results demonstrated that the proposed initialization not only enable the training of very deep networks using exponential linear units, but leads to better generalization performance. In addition, these experiments suggested that Batch Normalization might be one of factors that caused the degradation problem. Finally, this paper investigated the usage of MPELU with ResNet and presented a deep MPELU residual networks which achieved state-of-the-art accuracy on the CIFAR-10/100 datasets.

Acknowledgement

We would like to acknowledge NVIDIA Corporation for donating the Titan X GPU and supporting this research. This work was supported by the National Natural Science Foundation of China (Grants No., NSFC-61402046, NSFC-61471067, NSFC-81671651), Fund for Beijing University of Posts and Telecommunications (Grants No., 2013XD-04, 2015XD-02), Fund for National Great Science Specific Project (Grants No. 2014ZX03002002-004), Fund for Beijing Key Laboratory of Work Safety and Intelligent Monitoring.

Appendix: Implementation Details

NIN on CIFAR-10 (Sec. 4.1). During training, all the models are trained using SGD with batch size 128 for 120k iterations (around 307 epochs). The learning rate is initially set to 0.1, and then decreased by a factor of 10 after 100k iterations. The weight decay and momentum are 0.0001 and 0.9. The weights are initialized from a zero-mean Gaussian distribution with 0.01 standard deviation. and in MPELU are initialized with 0.25 or 1, and updated by SGD without weight decay. During test, we adopt the single-view test. Following [40, 25, 16], the data is preprocessed with global contrast normalization and ZCA whitening. When data augmentation is used, the patches are randomly cropped from the preprocessed images, and then flipped with a probability of 50%.

The 15-layer networks on ImageNet (Sec. 4.2). The models are trained by SGD with mini-batch size of 64 for 750k iterations (37.5 epochs). The learning rate is 0.01 initially, then divided by 10 at 100k and 600k iterations. The weight decay and momentum are 0.0005 and 0.9, respectively. All of images are scaled to pixels. During training, a sub image is randomly sampled from the original image or its flipped version. No further data augmentation is used. During test, we adopt the single-view test.

MPELU ResNet on CIFAR-10/100 (Sec. 5.3). The implementation details mainly follow [11]

and the fb.resnet.torch

[33]. Specifically, the models are trained by SGD with batch size of 128 for 200 epochs (no warming up). The learning rate is initially set to 0.1, then decreased by a factor of 10 at 81 and 122 epochs. The weight decay is set to 0.0001, and the momentum is set to 0.9. MPELU are initialized with and that are updated by SGD with weight decay. All the MPELU models are initialized from the proposed method (Sec. 3.2). For comparison, we follow the standard data augmentation implemented by fb.resnet.torch [33]: each image is padded with 4 pixels and then a 3232 patch is randomly cropped from it or its horizontal flip version. When the aggressive training strategy from [34] is adopted, the models are trained for 300 epochs. The batch size is 64 on two Titan X GPUs (32 each). The learning rate is initially at 0.1, then decreased by a factor of 10 at 150 and 225 epochs.

References

  • [1] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580–587

  • [2] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431–3440
  • [3] Li, Z., Tang, J.: Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Transactions on Multimedia 17(11) (2015) 1989–1999
  • [4] Nair, V., Hinton, G.E.:

    Rectified linear units improve restricted boltzmann machines.

    In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). (2010) 807–814

  • [5] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
  • [6] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML. Volume 30. (2013)
  • [7] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: The IEEE International Conference on Computer Vision (ICCV). (December 2015)
  • [8] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). ICLR (2016)
  • [9] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015)
  • [10] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CVPR (2016)
  • [12] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. (2015) 448–456
  • [13] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks.

    In: International conference on artificial intelligence and statistics. (2010) 249–256

  • [14] Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
  • [15] He, K., Zhang, X., Ren, S., Sun, J. In: Identity Mappings in Deep Residual Networks. Springer International Publishing, Cham (2016) 630–645
  • [16] Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., eds.: Advances in Neural Information Processing Systems 28. Curran Associates, Inc. (2015) 2377–2385
  • [17] Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. arXiv preprint arXiv:1603.00391 (2016)
  • [18] Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE transactions on pattern analysis and machine intelligence 37(10) (2015) 2085–2098
  • [19] Xiaojie, J., Chunyan, X., Jiashi, F., Yunchao, W., Junjun, X., Shuicheng, Y.: Deep learning with s-shaped rectified linear activation units. AAAI (2016)
  • [20] Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. ICML Deep Learning Workshop (2015)
  • [21] Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural computation 18(7) (2006) 1527–1554
  • [22] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. Advances in neural information processing systems 19 (2007) 153
  • [23] Erhan, D., Manzagol, P.A., Bengio, Y., Bengio, S., Vincent, P.: The difficulty of training deep architectures and the effect of unsupervised pre-training. In: AISTATS. Volume 5. (2009) 153–160
  • [24] Mishkin, D., Matas, J.: All you need is a good init. ICLR (2016)
  • [25] Lin, M., Chen, Q., Yan, S.: Network In Network. ICLR (2014)
  • [26] He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
  • [27] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (2014) 1929–1958
  • [28] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., eds.: Computer Vision – ECCV 2014. Volume 8691 of Lecture Notes in Computer Science. Springer International Publishing (2014) 346–361
  • [29] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Aistats. Volume 15. (2011) 275
  • [30] Li, Z., Liu, J., Yang, Y., Zhou, X., Lu, H.: Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Transactions on Knowledge and Data Engineering 26(9) (2014) 2138–2150
  • [31] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia. MM ’14, New York, NY, USA, ACM (2014) 675–678
  • [32] Shah, A., Kadam, E., Shah, H., Shinde, S.: Deep residual networks with exponential linear unit. arXiv preprint arXiv:1604.04112 (2016)
  • [33] Gross, S., Wilber, M.: Training and investigating residual nets (2016)
  • [34] Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
  • [35] Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS. Volume 2. (2015)  6
  • [36] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
  • [37] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. ICLR (2015)
  • [38] Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.: Deep networks with stochastic depth. ECCV 2016 (2016)
  • [39] Zagoruyko, S., Komodakis, N.: Wide residual networks. BMVC 2016 (2016)
  • [40] Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of The 30th International Conference on Machine Learning. (2013) 1319–1327