Parametric Exponential Linear Unit for ResNet in Torch from http://arxiv.org/abs/1605.09332
The activation function is an important component in Convolutional Neural Networks (CNNs). For instance, recent breakthroughs in Deep Learning can be attributed to the Rectified Linear Unit (ReLU). Another recently proposed activation function, the Exponential Linear Unit (ELU), has the supplementary property of reducing bias shift without explicitly centering the values at zero. In this paper, we show that learning a parameterization of ELU improves its performance. We analyzed our proposed Parametric ELU (PELU) in the context of vanishing gradients and provide a gradient-based optimization framework. We conducted several experiments on CIFAR-10/100 and ImageNet with different network architectures, such as NiN, Overfeat, All-CNN and ResNet. Our results show that our PELU has relative error improvements over ELU of 4.45 on CIFAR-10 and 100, and as much as 7.28 on ImageNet. We also observed that Vgg using PELU tended to prefer activations saturating closer to zero, as in ReLU, except at the last layer, which saturated near -2. Finally, other presented results suggest that varying the shape of the activations during training along with the other parameters helps controlling vanishing gradients and bias shift, thus facilitating learning.READ FULL TEXT VIEW PDF
Parametric Exponential Linear Unit for ResNet in Torch from http://arxiv.org/abs/1605.09332
Over the past few years, Convolutional Neural Networks (CNNs) have become the leading approach in computer vision(Krizhevsky et al., 2012; LeCun et al., 2015; Vinyals et al., 2015; Jaderberg et al., 2015; Ren et al., 2015; Hosang et al., 2016)
. Through a series of non-linear transformations, CNNs can process high-dimensional input observations into simple low-dimensional concepts. The key principle of CNNs is that features at each layer are composed of features from the layer below. This creates a hierarchical organization of increasingly abstract concepts. Since levels of organization are often seen in complex biological structures, such a hierarchical organization makes CNNs particularly well-adapted for capturing high-level abstractions from real-world observations.
The activation function plays a crucial role in learning representative features. Defined as , the Rectified Linear Unit (ReLU) is one of the most popular activation function (Nair & Hinton, 2010). It has interesting properties, such as low computational complexity, non-contracting first-order derivative and induces sparse activations, which have been shown to improve performance Krizhevsky et al. (2012)
. The main drawback of ReLU is its zero derivative for negative arguments. This blocks the back-propagated error signal from the layer above, which may prevent the network from reactivating dead neurons. To overcome this limitation, Leaky ReLU (LReLU) adds a positive slopeto the negative part of ReLU (Maas et al., 2013). Defined as , where , LReLU has a non-zero derivative for negative arguments. Unlike ReLU, its parameter allows a small portion of the back-propagated error signal to pass to the layer below. By using a small enough value , the network can still output sparse activations while preserving its ability to reactivate dead neurons. In order to avoid specifying by hand the slope parameter , Parametric ReLU (PReLU) directly learns its value during back-propagation (He et al., 2015b). As the training phase progresses, the network can adjust its weights and biases in conjunction with the slopes of all its PReLU for potentially learning better features. Indeed, He et al. (2015b) have empirically shown that learning the slope parameter gives better performance than manually setting it to a pre-defined value.
A recently proposed important activation function is the Exponential Linear Unit (ELU). Is is defined as identity for positive arguments and for negative ones (Clevert et al., 2015). The parameter can be any positive value, but is usually set to . ELU has the interesting property of reducing bias shift, which is defined as the change of a neuron’s mean value due to weight update. If not taken into account, bias shift leads to oscillations and impeded learning (Clevert et al., 2015). Clevert et al. (2015)
have shown that either centering the neuron values at zero or using activation functions with negative values can reduce bias shift. Centering the neuron values can be done with the Batch Normalization (BN) method(Ioffe & Szegedy, 2015), while adding negative values can be done with parameterizations such as LReLU or PReLU.
Based on the observation that learning a parameterization of ReLU improves performance (He et al., 2015b), we propose the Parametric ELU (PELU) that learns a parameterization of ELU. We define parameters controlling different aspects of the function and propose learning them during back-propagation. Our parameterization preserves differentiability by acting on both the positive and negative parts of the function. Differentiable activation functions usually give better parameter updates during back-propagation (LeCun et al., 2015). PELU also has the same computational complexity as ELU. Since parameters are defined layer-wise instead of per-neurons, the number of added parameters is only , where is the number of layers. Our experiments on the CIFAR-10/100 and ImageNet datasets have shown that ResNet (Shah et al., 2016), Network in Network (Lin et al., 2013), All-CNN (Springenberg et al., 2015) and Overfeat (Sermanet et al., 2013) with PELU all had better performances than with ELU. We finally show that our PELUs in the CNNs adopt different non-linear behaviors during training, which we believe helps the CNNs learning better features.
Our proposed PELU activation function is related to other parametric approaches in the literature. The Adaptive Piecewise Linear (APL) unit learns a weighted sum of parametrized Hinge functions (Agostinelli et al., 2014). One drawback of APL is that the number of points at which the function is non-differentiable increase linearly with . Moreover, though APL can be either a convex or non-convex function, the rightmost linear function is forced to have unit slope and zero bias. This may be an inappropriate constraint which could affect the representation ability of the CNNs.
Another activation function is Maxout, which outputs the maximum over affine functions for each input neuron (Goodfellow et al., 2013). The main drawback of Maxout is that it multiplies by the amount of weights to be learned in each layer. For instance, in the context of CNNs, we would apply a max operator over the feature maps of each convolutional layers. This could become too computationally demanding in cases where the CNNs are very deep. Unlike Maxout, our PELU adds only parameters, where is the number of layers.
Finally, the S-Shaped ReLU (SReLU) imitates the Webner-Fechner law and the Stevens law by learning a combination of three linear functions (Jin et al., 2015). Although this parametric function can be either convex or non-convex, SReLU has two points at which it is non-differentiable. Unlike SReLU, our PELU is fully differentiable, since our parameterization acts on both the positive and negative sides of the function. This in turns improves the back-propagation weight and bias updates.
In this section, we present our proposed PELU function and analyze it in the context of vanishing gradients. We also elaborate on the gradient descent rules for learning the parameterization.
The standard Exponential Linear Unit (ELU) is defined as identity for positive arguments and for negative arguments (Clevert et al., 2015). Although the parameter can be any positive value, Clevert et al. (2015) proposed using to have a fully differentiable function. For other values , the function is non-differentiable at . For this reason, we do not directly learn parameter during back-propagation. Updating with the gradient would break differentiability at , which could imped back-propagation.
We start by adding two additional parameters to ELU as follows:
for which the original ELU can be recovered when . As shown in Figure 1, each parameter in (1) controls different aspects of the activation. Parameter changes the slope of the linear function in the positive quadrant (the larger , the steeper the slope), parameter affects the scale of the exponential decay (the larger , the smaller the decay), while acts on the saturation point in the negative quadrant (the larger , the lower the saturation point). We also constrain the parameters to be positive to have a monotonic function. Consequently, reducing the weight magnitude during training always lowers the neuron contribution.
Using this parameterization, the network can control its non-linear behavior throughout the course of the training phase. It may increase the slope with or the decay with to counter vanishing gradients, and push the mean activation towards zero by lowering the saturation point with for better managing bias shift. We now look into gradient descent and define update rules for each parameter, so that the network can adjust its behavior as it seems fit. However, a standard gradient update on parameters would make the function non-differentiable at and impair back-propagation. Instead of relying on a projection operator to restore differentiability after each update, we constrain our parameterization by forcing to stay differentiable at . We equal the derivatives on both sides of zero, and solve for :
which gives as solution. Incorporating (2) gives the proposed Parametric ELU (PELU):
With this parameterization, in addition to changing the saturation point and exponential decay respectively, both and adjust the slope of the linear function in the positive part to ensure differentiability at .
To understand the effect of the proposed parameterization, we now investigate the vanishing gradient for the following simple network, containing one neuron in each of its layers:
where we have omitted, without loss of generality, the biases for simplicity. In (4),
is the loss function between the network predictionand label , which takes value at
. In this case, it can be shown using the chain rule of derivation that the derivative ofwith respect to any weight is:
where is a shortcut for . Vanishing gradient happens when the product term inside the bracket has a very small magnitude, which makes . Since the updates are proportional to the gradients, the weights at lower layers converge more slowly than those at higher layers, due to the exponential decrease as gets smaller (the product has terms). One way the network can fight vanishing gradients is with , so that the magnitude of the product does not tend to zero. Therefore, a natural way to investigate vanishing gradient is by analyzing the interaction between weight and activation , after dropping layer index . Specifically, our goal is to find the range of values for which . This will indicate how precise the activations must be to manage vanishing gradients.
If and , then maximizes the interval length of for which , which length takes value .
With our proposed PELU, we have:
Assuming and , and using the fact that is monotonically increasing, the interval length of values for which is given by the magnitude of the zero of . Solving the derivative equals zero for gives . Using the fact that , it can be shown that is pseudo-concave, so it has a unique optimum. Maximizing with respect to is thus the solution of solving the derivative equals zero, which gives , at . ∎
This result shows that in the optimal scenario where , the length of negative values for which is no more than . Without our proposed parameterization (), dealing with vanishing gradient is mostly possible with positive arguments, which makes the negative ones (useful for bias shift) hurtful for back-propagation. With the proposed parameterization, can be adjusted to increase the length and allow more negative activations to counter vanishing gradients. The ratio can also be modified to ensure so that for . Based on this analysis, the proposed parameterization gives more flexibility to the network, and the experiments in Section 4 have shown that the networks do indeed take advantage of it.
PELU is trained simultaneously with all the network parameters during back-propagation. Using the chain rule of derivation, the derivative of objective with respect to and for one layer is:
sums over all elements of the tensor on whichis applied. The terms are the gradients propagated from the above layers, while and are the gradients of with respect to :
To preserve the parameter positivity after the updates, we force them to always be greater than . The update rules are the following:
In (8), is the momentum and is the learning rate. When specified by the training regimes, we also use a weight decay regularization on both the weight matrices and PELU parameters. This is different than PReLU, which did not use weight decay to avoid a shape bias towards ReLU. In our case, weight decay is necessary for and , otherwise the network could circumvent it for the s by adjusting or , a behavior that would be hurtful for training.
In this section, we present our experiments in supervised learning on the CIFAR-10/100 and ImageNet tasks. Our goal is to show that, with the same network architecture, parameterizing ELU improves the performance. We also provide results with the ReLU activation function for reference.
As second experiment, we performed object classification on the CIFAR-10 and CIFAR-100 datasets (60,000 32x32 colored images, 10 and 100 classes respectively) (Krizhevsky et al., 2012). We trained a residual network (ResNet) with the identity function for the skip connexion and bottleneck residual mappings with shape (Conv + ACT)x2 + Conv + BN (He et al., 2015a; Shah et al., 2016). The ACT module is either PELU, ELU or BN+ReLU. We performed standard center crop + horizontal flip for data augmentation. Only color-normalized 32 x 32 images were used during the test phase.
We also evaluated our proposed PELU on a smaller convolutional network. We refer to this network as SmallNet. It contains three convolutional layers followed by two fully connected layers. The convolutional layers were respectively composed of 32, 64 and 128 3x3 filters with 1x1 stride and 1x1 zero padding, each followed by ACT, 2x2 max pooling with a stride of 2x2 and dropout with probability 0.2. The fully connected layers were defined as 2048512, followed by ACT, dropout with probability 0.5, and a final linear layer 512 10 for CIFAR-10 and 512 100 for CIFAR-100. We performed global pixel-wise mean subtraction, and used horizontal flip as data augmentation.
Table 1 presents the test error results (in %) of SmallNet and ResNet110 on both tasks, with ELU, ReLU and PELU. For SmallNet, PELU reduced the error of ELU from 14.81% to 13.54% on CIFAR-10, and from 39.76% to 38.93% on CIFAR-100, which corresponds to a relative improvement of 8.58% and 2.09% respectively. As for ResNet110, PELU reduced the error of ELU from 5.62% to 5.37% on CIFAR-10, and from 26.55% to 25.04% on CIFAR-100, which corresponds to a relative improvement of 4.45% and 5.68% respectively. These results suggest that parameterizing the ELU activation improves its performance.
It is worth noting for ResNet110 that weight decay played an important role in obtaining these performances. Preliminary experiments conducted with a weight decay of 0.0001 showed no significant improvements of PELU over ELU. We observed larger differences between the train and test set error percentages, which indicated possible over-fitting. By increasing the weight decay to 0.001, we obtained the performance improvements shown in Table 1. Importantly, we did not have to increase the weight decay for SmallNet. The PELU, ELU and BN+ReLU SmallNets used the same decay. Although these results suggest that residual networks with PELU activations may be more prone to over-fitting, weight decay can still be used to correctly regularize the ResNets.
We finally tested the proposed PELU on ImageNet 2012 task (ILSVRC2012) using four different network architectures: ResNet18 (Shah et al., 2016), Network in Network (NiN) (Lin et al., 2013), All-CNN (Springenberg et al., 2015) and Overfeat (Sermanet et al., 2013)
. We used either PELU, ELU or BN+ReLU for the activation module. Due to NiN’s relatively complex architecture, we added BN after each max pooling layer (every three layers) for further reducing vanishing gradients. Each network was trained with a momentum-based stochastic gradient descent () with the training regimes shown in Table 2. Regime #1 starts at a higher learning rate (1e-1) than regime #2 (1e-2), and has a larger learning rate decay of 10 compared to 2 and 5.
|Regime #1 (ResNet18, NiN)||Regime #2 (Overfeat, AllCNN)|
Figure 2 presents the TOP-1 error rate (in %) of all four networks on ImageNet 2012 validation dataset. We see from these figures that PELU consistently obtained the lowest error rates for all networks. The best result was obtained with NiN. In this case, PELU improved the error rate from 40.40% (ELU) to 36.06%, which corresponds to a relative improvement of 7.29%. Importantly, NiN obtained these improvements at little computational cost. It only added 24 additional parameters, i.e. 0.0003% increase in the number of parameters. This suggests that PELU acts on the network in a different manner than the weights and biases. Such a low number of parameters cannot significantly increase the expressive power of the network. We would not have seen such a large improvement by adding 24 additional weights to a convolutional layer with the ELU activation.
We see from the curves in Figure 2 that the training regime has an interesting effect on the convergence of the networks. The performance of PELU is closer to the performance of ELU for regime #2, while it is significantly better than ELU for regime #2. We also see that the error rates of All-CNN and Overfeat with PELU increase by a small amount starting at epoch 44. Since ELU and ReLU do not have this error rate increase, this shows possible over-fitting for PELU. For regime #2, the error rates decrease more steadily and monotonically. Although performing more experiments would improve our understanding, these results suggest that PELU could and should be trained with larger learning rates and decays for obtaining better performance improvements.
In this section, we elaborate on other parameter configurations and perform a visual evaluation of the parameter progression throughout the training phase.
The proposed PELU activation function (3) has two parameters and , where is used with a multiplication and with a division. A priori, any of the four configurations , , or could be used as parameterization. In this section, we show experimentally that PELU with the proposed configuration is the preferred choice, as it achieves the best overall results.
For evaluating the effect of parameter configuration, we trained several CNNs on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2012). We used ResNets with a depth varying from 20, 32, 44, 56 to 110. The ResNets had the identity function for the skip connexion and two different residual mappings (He et al., 2015a; Shah et al., 2016). We used a basic Conv + PELU + Conv + BN block for depth 20, 32 and 44, and bottleneck block with shape (Conv + PELU)x2 + Conv + BN for depth 56 and 110. We report the averaged error rate achieved over five tries.
First, we can see from the results presented in Figure 3 that the error rate reduces as the network gets deeper. This is in conformity with our intuition that deeper networks have more representative capability. Also, we can see that the proposed configuration obtained the best performance overall. Configuration obtained 5.37% error rate on CIFAR-10 and 25.04% error rate on CIFAR-100. We believe this improvement is due to weight decay. When using configuration along with weight decay, pushing parameters and towards zero encourages PELU to be similar to ReLU. In this case, the CNN is less penalized for using ReLU and more penalized for using other parametric forms. This may help the CNN to use as much PELUs that look like ReLUs as it needs without incurring a large penalty. The experiments in section 5.2 partly supports our claim. Although the performance improvement of is relatively small in comparison to the other three configurations, configuration should be preferred for subsequent experiments.
We perform a visual evaluation of the non-linear behaviors adopted by a Vgg network during training (Simonyan & Zisserman, 2014). To this effect, we trained a Vgg network with PELU activations on the CIFAR-10 dataset. We performed global pixel-wise mean subtraction, and used horizontal flip as data augmentation. The trained network obtained 6.95% and 29.29% error rate on CIFAR-10 and CIFAR-100 respectively.
Figure 3 shows the progression of the slope () and the negative of the saturation point (parameter ) for PELU at layers 2, 7, 10 and 14. We can see different behaviors. In layer 2, the slope quickly increases to a large value (around 6) and slowly decreases to its convergence value. We observe a similar behavior for layer 7, except that the slope increases at a later iteration. Layer 2 increases at about iteration 650 while layer 7 at about iteration 1300. Moreover, in contrast to layer 2 and 7, the slope in layer 10 increases to a smaller value (around 2) and does not decrease after reaching it. Layer 14 also displays a similar behavior, but reaches a much higher value (around 15). We believe that adopting these behaviors helps early during training to disentangle redundant neurons. Since peak activations scatter the inputs more than flat ones, spreading neurons at the lower layers may allow the network to unclutter neurons activating similarly. This may help the higher layers to more easily find relevant features in the data.
The saturation point in layer 2, 7 and 10 converges in the same way to a value near zero, while in layer 14 it reaches a value near -2. This is an interesting behavior as using a negative saturation reduces bias shift. In another experiment with Vgg, we tried adding BN at different locations in the network. We saw similar convergence behaviors for the saturation point. It seems that, it this specific case, the network could counter bias shift with only the last layer, and favored sparse activations in the other layers. These results suggest that the network takes advantage of the parameterization by using different non-linear behaviors at different layers.
The activation function is a key element in Convolutional Neural Networks (CNNs). In this paper, we proposed learning a parameterization of the Exponential Linear Unit (ELU) function. Our analysis of our proposed Parametric ELU (PELU) suggests that CNNs with PELU may have more control over bias shift and vanishing gradients. We performed several supervised learning experiments and showed that networks trained with PELU consistently improved their performance over ELU. Our results suggest that the CNNs take advantage of the added flexibility provided by learning the proper activation shape. As training progresses, we have observed that the CNNs change the parametric form of their PELU both across the layers and across the epochs. In terms of possible implications of our results, parameterizing other activation functions could be worth investigating. Functions like Softplus, Sigmoid or Tanh may prove to be successful in some cases with proper parameterizations. Other interesting avenues for future work include applying PELU to other network architectures, such as recurrent neural networks, and to other tasks, such as object detection
We thankfully acknowledge the support of Nvidia Corporation for providing the Tesla K80 and K20 GPUs for our experiments.
Rectified linear units improve restricted boltzmann machines.In ICML, pp. 807–814, 2010.