ProbAct: A Probabilistic Activation Function for Deep Neural Networks

05/26/2019 ∙ by Joonho Lee, et al. ∙ KYUSHU UNIVERSITY 0

Activation functions play an important role in the training of artificial neural networks and the Rectified Linear Unit (ReLU) has been the mainstream in recent years. Most of the activation functions currently used are deterministic in nature, whose input-output relationship is fixed. In this work, we propose a probabilistic activation function, called ProbAct. The output value of ProbAct is sampled from a normal distribution, with the mean value same as the output of ReLU and with a fixed or trainable variance for each element. In the trainable ProbAct, the variance of the activation distribution is trained through back-propagation. We also show that the stochastic perturbation through ProbAct is a viable generalization technique that can prevent overfitting. In our experiments, we demonstrate that when using ProbAct, it is possible to boost the image classification performance on CIFAR-10, CIFAR-100, and STL-10 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Activation functions add a non-linearity to neural networks and thus the ability to learn complex functional mappings from data article2 . Different activation functions with different characteristics have been proposed thus far. Sigmoid Cybenko1989 and hyperbolic tangent (Tanh) were especially popular during the early usage of neural networks schmidhuber2015deep

. Those functions were used due to their monotonicity, continuity, and bounded properties. However, their derivatives become small when the absolute value of the input becomes large, which leads to the vanishing gradient problem when using Stochastic Gradient Descent (SGD).

In recent times, the Rectified Linear Unit (ReLU) nair2010rectified has become an extremely popular activation function for neural networks. ReLU is defined as:

(1)

and the derivative is:

(2)

ReLU solves the vanishing gradient problem with an identity derivative. However, a zero value for all negative values creates the

dead neurons

problem. To counter the problem, several variants of ReLU have been proposed. For example, we can find simple extensions of ReLU, such as Leaky ReLU xu2015empirical , Parametric ReLU (PReLU) he2015delving , and Exponential Linear Unit (ELU) 2015arXiv151107289C .

In this work, we propose a new activation function, called ProbAct, which is not only trainable but also stochastic. In fact, the idea of ProbAct is inspired by the stochastic behaviour of the biological neurons. Noise in neuronal spikes can arise due to uncertain bio-mechanical effects lewicki1998review . We try to emulate a similar behaviour in the information flow to the neurons by injecting stochastic sampling from a normal distribution to the activations. Simply speaking, even for the same input value , the output value from the ProbAct varies by stochastic perturbation — this is the largest difference from conventional activation functions.

The induced perturbations prove to be effective in avoiding network overfitting during training. The operation has a resemblance to data augmentation, and hence we call it augmentation-by-activation. Furthermore, we show that ProbAct increases the classification accuracy by +2-3% compared to ReLU or conventional activation functions on established image datasets. The main contributions of our work are as follows:

  • We introduce a novel activation function, called ProbAct, whose output undergoes stochastic perturbation.

  • We propose a method of governing the stochastic perturbation by using a parameter trained through back-propagation.

  • We show that ProbAct improves the performance on various visual object classification tasks.

  • We also show that the improvement by ProbAct is realized by the augmentation-by-activation, which acts as a stochastic regularizer to prevent overfitting of the network.

2 Related Work

2.1 Activation Functions

Various approaches have been applied to improve activation functions. Research for activation functions can be broadly defined as three trends: fixed activation functions, adaptive activation functions, and activation functions with non-parametric inner structures.

Fixed activation functions are functions that are fixed before training such as sigmoid, tanh, and ReLU. In particular, ReLU has led to significant improvements in neural network researches. However, the morphological characteristic of ReLU activation function is a ramp function and has a zero value for all negative numbers. This can cause neurons to die. Therefore variants of ReLU were suggested to solve this problem. For example, Leaky ReLUs xu2015empirical use a fixed value for . Other activation functions like Swish Ramachandran2018SearchingFA and Exponential Linear Sigmoid SquasHing (ELiSH) Basirat2018TheQF take a different approach and use bounded negative regions.

Adaptive activation functions use trainable parameters in order to optimize the activation function. For example, Parametric ReLUs (PReLUs) he2015delving are similar to Leaky ReLUs but with a trainable parameter instead of a fixed value. In addition, S-shape ReLU (SReLU) Jin:2016:DLS:3016100.3016142 , Parametric ELU (PELU) trottier2017parametric were suggested to improve the performance of conventional ReLU functions.

Non-parametric activation functions were suggested to further increase the flexibility of the activation functions. The Maxout activation function was suggested by Goodfellow et al. pmlr-v28-goodfellow13 and uses a piece-wise linear approximation arbitrary convex functions through using the affine transform and the selecting maximum values part.

Also, kernel-based non-parametric activation functions for neural networks (Kafnets) was suggested by Scardapane et al. SCARDAPANE201919 . Kafnets allow for the non-linear combination of information from various paths in the network. In another work, the activation ensemble method was suggested by Harmon and Klabjan harmon2017activation .

2.2 Generalization and Stochastic Methods

There has been less research on stochastic activation functions due to expensive sampling processes. Noisy activation functions pmlr-v48-gulcehre16 tried to deal with these problems by adding noises to the non-linearity in proportion to the magnitude of saturation of the non-linearity. RReLU xu2015empirical uses Leaky ReLUs with randomized slopes during training and a fixed slope during testing.

There are many other generalization techniques which use random distributions or stochastic functions. For example, dropout srivastava2014dropout generalizes by removing connections at random during training. Dropout can be interpreted as a way of model averaging. Liu et al. suggested Random Self-Ensemble (RSE) by combining randomness and ensemble learning liu2018towards . Inayoshi and Kurita proposed back-propagating noise to the hidden layers Inayoshi . Furthermore, the effects of adding noise to inputs Bishop_1995 ; An_1996 , the gradient Audhkhasi_2013 ; neelakantan2015adding , and weights Murray_1993 ; An_1996 ; NIPS2011_4329 ; pmlr-v37-blundell15 have been studied. However, we adopt the concept that stochastic neurons with sparse representations allow internal regularization as shown by Bengio bengio2013estimating .

Introducing Bayesian inference on weights of a network

mackay1992practical introduces regularization effects. However, the exact Bayesian form of the network is intractable and graves2011practical hinton1993keeping shridhar2019comprehensive uses Variational Inference to approximate the posterior distribution. blundell2015weight showed that this approach attains performance similar to Dropout.

ProbAct can be thought of a way to add noise to the network. The difference from other noise injection methods is that we propose an adaptable and trainable variance that is injected to every layer. To the best of our knowledge, our proposed method is the first approach that adopts stochastic noise into activation function.

(a) ReLU
(b) ProbAct
(c) Effect of stochastic perturbation
Figure 1: Comparison of (a) ReLU and (b) the proposed activation function. (c) is the effect of stochastic perturbation by ProbAct at feature spaces in a neural network.

3 ProbAct: A Stochastic Activation Function

In this section, we define the proposed method, ProbAct. In general, each layer of neural networks computes its output for the given input :

(3)

where

is the weight vector of the layer and

is the activation function, such as ReLU. As shown in Figure 1

, we introduce a random variable to ReLU to create a stochastic activation function. Specifically, ProbAct is defined as:

(4)

where is ReLU (i.e., ) and the perturbation term is:

(5)

The perturbation parameter is a fixed or trainable value which specifies the range of stochastic perturbation and is a random value sampled from a normal distribution . The value of is either determined manually or trained along with other network parameters (i.e., weights) with simple implementation. If , ProbAct behaves same as ReLU. This means that ProbAct is a generalization of ReLU.

3.1 Setting the Parameter for Stochastic Perturbation

The parameter specifies the range of stochastic perturbation. In the following, we will consider two cases of setting , fixed and trainable.

3.1.1 Fixed Case

There are several ways to choose the desired sigma. The simplest one is setting with a constant value as a hyper-parameter. Choosing one constant for all values is a good idea as is randomly sampled from a normal distribution and acts as a scaling factor to the sampled value . This can be interpreted as repeated addition of the Gaussian noise to the activation maps, which helps in better optimization of the network bengio2013estimating . The network is optimized using gradient-based learning as it follows Theorem 1 as stated:

Theorem 1

The gradient propagation of a stochastic unit based on a deterministic function with inputs (a vector containing outputs from other neurons), internal parameters (weights and bias) and noise is possible, if has non-zero gradients with respect to and . bengio2013estimating

(6)

The nature of is Gaussian as

is sampled from a Gaussian distribution, and

being a constant value does not affect the Gaussian properties. This ensures learning using gradient-based methods. The proposed method does not significantly affect the number of parameters in the architecture, so the computational cost is relatively low. However, choosing the best is a difficult task as setting any other hyper-parameter for a network.

3.1.2 Trainable Case

Using a trainable reduces the requirement to determine as a hyper-parameter and allows to learn the appropriate range of sampling. There are two ways of introducing a trainable :

  • Single Trainable : A shared trainable across the network. This introduces a single extra parameter used for all ProbActs. This is similar to the fixed but with the value of trained .

  • Element-wise Trainable : This method uses a different trainable parameter for each ProbAct. This adds the flexibility to learn a different distribution for every neuron.

Training

The trainable parameter is trained using a back-propagation simultaneously with other model parameters. The gradient computation of

is done using the chain rule. Given an objective function

, the gradient of with respect to , where is the perturbation parameter of the -th unit in the -th layer, is:

(7)

where is the output of the -th unit in the -th layer. The term is the gradient propagated from the deeper layer. The gradient of the activation is given by:

(8)
Bounded

Training without any bounds can create perturbations in a highly unpredictable manner when

, making the training difficult. Taking the advantages of monotonic nature of the sigmoid function, we bound the upper and lower limit of

to using:

(9)

where represents the element-wise learnable parameter, and and are scaling parameters that can be set as hyper-parameters. We used and in the experiments. These values were found through exploratory testing.

3.2 Stochastic Regularizer

Figure 1 (c) illustrates the effect of stochastic perturbation by ProbAct in the feature space of each neural network layer. Intuitively, ProbAct adds perturbation to each feature vector independently, and this function acts as a regularizer to the network. It should be noted that while the noise added to each ProbAct is isotropic, the noise from early layers is propagated to the subsequent layers; hence, the total noise added to a certain layer depends on the noise and weights of the early layers. For example, even in the simplest case where the network has two layers with a unit for each layer, the distribution of the second layer’s output differs depending on the first and second layer’s weights ( and ) and sigmas ( and ) as:

(10)

Incidentally, as shown in Figure 1 (c), a small noise variance tends to be learned in the final layer to make the network output stable (see Section 4.4 for a quantitative evaluation).

Using Eq. (6) from Theorem 1, assume is noise injection function that depends on noise , and some differentiable transformations over inputs and model internal parameters . We can derive the output, as:

(11)

If we use Eq. (11) for another noise addition methods like dropout hinton2012improving or masking the noise in denoising auto-encoders vincent2008extracting , we can infer as noise multiplied just after a non-linearity is induced in a neuron. In the case of semantic hashing salakhutdinov2007semantic , the noise is added just before the non-linearity. In the case of ProbAct, we sample from Gaussian noise and add it while computing . Or we can say, we add a noise to the pre-activation, which is used as an input to the next layer. In doing so, self regularization behaviour is induced in the network.

Further, the effect of regularization is proportional to the variance of the distribution. A high variance is induced with a higher value, allowing sampling from a high variance distribution which is further away from the mean. This way the prediction is not over-reliant on one value, helpful in countering overfitting problem. For the fixed case, the variance of the noise is fixed. However, it helps in optimizing the weights of the network. We also show the regularization behaviour empirically in the experiments section.

4 Experiments

In the experiments, we empirically evaluate ProbAct on image classification tasks to show the effectiveness of induced stochastic perturbations in the output. We also show that ProbAct acts as a regularizer to prevent overfitting experimentally. Our results are available on GitHub. 111https://github.com/kumar-shridhar/ProbAct-Probabilistic-Activation-Function

4.1 Datasets

To evaluate the proposed activation, we used three datasets, CIFAR-10 cifar10 , CIFAR-100 cifar10 , and STL-10 coates2011analysis .

CIFAR-10 Dataset

The CIFAR-10 dataset consists of 60,000 images with 10 classes, with 6,000 images per class, each image 32 by 32 pixels. The dataset is split into 50,000 training images and 10,000 test images.

CIFAR-100 Dataset

CIFAR-100 dataset has 100 classes containing 600 images per class. There are 500 training images and 100 test images per class. The resolution of the images is also 32 by 32 pixels.

STL-10 Dataset

STL-10 dataset has 500 images per class with 10 classes and 100 test images per class. The images are 96 by 96 pixels per image.

4.2 Experimental Setup

To evaluate the performance of the proposed method on classification, we compare ProbAct to the following activation functions: ReLU, Leaky ReLU xu2015empirical , PReLU he2015delving , and Swish Ramachandran2018SearchingFA . We utilize a 16-layer Visual Geometry Group network (VGG-16) 2014arXiv1409.1556S architecture. The specific hyper-parameters and training settings are shown in the Supplementary Materials.

For setting an experimental environment, we do not use regularization tricks, pre-training, and data augmentation. The inputs are normalized to . The STL-10 images are resized to 32 by 32 to match the CIFAR datasets to keep a fixed input shape to the network.

For the Fixed evaluations, and are reported. In case of trainable , we set the three types of the values: Single Trainable , Element-wise Trainable (unbound), and Element-wise Trainable (bound). Element-wise Trainable (bound) is the Element-wise Trainable when is bound by and Element-wise Trainable (unbound) lacks this constraint.

4.3 Quantitative Evaluation

The results of the experiment on CIFAR-10, CIFAR-100 and STL-10 are shown in Table 1. These results are obtained by averaging the training of the networks three times. When using Element-wise Trainable (bound) ProbAct, we achieved performance improvement +2.25% on CIFAR-10, +2.89% on CIFAR-100, and +3.37% on STL-10 compared to the standard ReLU. In addition, the proposed method performed better than any of the evaluated activation functions.

In order to demonstrate the viability of the proposed method, the training and testing times relative to the standard ReLU are also shown in Table 1. The time comparison shows that we can achieve higher performance with only a relatively small time difference.

The training and testing times for ProbAct is comparable to ReLU and other activation functions. This is mainly because the learnable values are few compared to the learnable weight values in a network. Hence, there is no significant extra time needed to train these parameters. This shows ProbAct as a strong replacement over popular activation functions.

Activation function CIFAR-10 CIFAR-100 STL-10 Train time Test time
ReLU 86.67 52.94 60.80 1.00 ReLU 1.00 ReLU
Leaky ReLU 86.49 49.44 59.16 1.04 ReLU 1.08 ReLU
PReLU 86.35 43.30 60.01 1.16 ReLU 1.00 ReLU
Swish 86.55 54.01 63.50 1.20 ReLU 1.13 ReLU
ProbAct
 Fixed ( ) 88.50 56.85 62.30 1.09 ReLU 1.25 ReLU
 Fixed () 88.87 58.45 62.50 1.10 ReLU 1.27 ReLU
 Single Trainable 87.40 52.87 63.07 1.23 ReLU 1.30 ReLU
 Element-wise 86.40 53.10 61.70 1.25 ReLU 1.31 ReLU
  Trainable (unbound)
 Element-wise 88.92 55.83 64.17 1.26 ReLU 1.33 ReLU
  Trainable (bound)
Table 1: Comparison performance with different activation functions. The test accuracy(%) indicates the average of testing over three sets.

4.4 Parameter Analysis

Figure 2: (a) Transition of single trainable for VGG-16 architecture on the three datasets. (b) Layer-wise mean value for VGG-16 ProbAct layers trained on CIFAR10 dataset.

We visualized training aspects of the Single Trainable in Figure 2

 (a) for 200 epochs for CIFAR-10 dataset. We trained the network for 400 epochs and cropped it for 200 epochs for better visualization as there is no change in the

value after 200 epochs. After 100 training epochs, the Single Trainable goes towards 0. However, in order to test the Single Trainable when , we replaced ProbAct with ReLU on the trained network. We confirmed that even with ReLU on the network trained with Single Trainable , we could achieve higher results than when training on ReLU. This shows that while training, helps to optimize the other learnable weights better than standard ReLU architecture, allowing better model performance. Figure 2 (b) shows the mean Element-wise Trainable over 200 epochs for all the layers. We demonstrate the ability of the network to train element-wise across all layers, even when the number of trainable parameters is increased due to Element-wise Trainable parameters.

Figure 3: Histograms show how element-wise learnable parameters, are distributed after training VGG-16 architecture for CIFAR-10, CIFAR-100 and STL-10 datasets at each layer. X-axis denotes value after training and Y-axis denotes its frequency. Every subfigure represents a ProbAct layer in layer-wise order from top left to bottom right.

Figure 3 shows the frequency distribution for the bounded element-wise trained values after 400 epochs. We observed two peak values for every distribution across all three datasets. We assume that the derivative of a sigmoid function becomes 0 at both the boundaries of the function. The points in the left peak lie in the lower boundary of (0 in our case) making ProbAct behave as ReLU. Right peak points lie in the upper boundary of the sigmoid (2 in our case) and take as 2. The values in between the peaks signify other values.

In the case of both CIFAR datasets, the distribution of parameter in the last layer is quite narrow and concentrated in the negative domain. As shown in Figure 2 (b), the values becomes 0, which indicates that ProbAct conducts the ReLU-like operation.

4.5 Overfitting

(a) Test Accuracy
(b) Train Accuracy
Figure 4: (a) shows the test accuracy comparison between ReLU and ProbAct layer with/without dropout layers, while (b) shows the train accuracy comparison of the same on CIFAR-100 dataset.

A high value acts an inbuilt regularizer that generalizes better on data, preventing overfitting. To define the level of overfitting in a network, we use a term which shows the difference between the training and test accuracy after training:

(12)

The idea is that if is small then the learning on the training set generalized well to the test set. If is large, then the training set was memorized.

In addition, we compare the generalization ability with and without a dropout in Figure 4

for CIFAR-100. Results for CIFAR-10 can be found in the Supplementary Materials. We use a dropout layer before the linear classification layer with dropout probability of 0.5. ReLU without a dropout layer is around

and does not change much with the use of a dropout. On the other hand, fixed achieves without a dropout and achieving without a dropout, showing the built-in regularization nature of ProbAct. With the increase in value, overfitting can be controlled largely due to a higher variance allowing varied sampling and better model averaging. Moreover, overfitting can further be reduced by introducing a dropout alongside with ProbAct with going down to for with a dropout.

4.6 Reduced Data

The training data size was reduced to 50% and 25% of the original data size for CIFAR-10 and CIFAR-100 dataset. We maintained the class distribution by randomly choosing 25% and 50% images for each class. The process was repeated three times to create three randomly chosen datasets. We run our experiments on all three datasets and average the results.

Table 2 shows the test accuracy for ReLU and ProbAct with Element-wise Trainable (bound) on 25% and 50% data size. We achieve +3% average increase in test accuracy when the data size was halved and 2.5% increase when it was further halved. The higher test accuracy of ProbAct shows the applications of ProbAct in real life use cases when the training data size is small.

Activation function CIFAR-10 (50%) CIFAR-100 (50%) CIFAR-10 (25%) CIFAR-100 (25%)
ReLU 82.74 42.36 75.62 30.42
ProbAct 84.73 46.11 79.02 31.67
Table 2: Test Accuracy (%) comparison between ReLU and ProbAct on reduced subsets of CIFAR-10 and CIFAR-100 (50% and 25% of original dataset).The test accuracy(%) indicates the average of three sets of testing.

5 Conclusion

In this paper, we introduced a novel probabilistic activation function ProbAct, that adds perturbation in the every activation maps, allowing better network generalization. Through the experiments, we verified that the stochastic perturbation prevents the network from memorizing the training samples because of the change in samples every time, resulting in evenly optimized network weights and a more robust network that has a lower generalization error. Furthermore, we confirmed that the augmentation-like operation in ProbAct is very effective for classifying images when the number of images is insufficient to train networks. We also show that ProbAct has the potential to act as a regularizer like a dropout, thus preventing overfitting.

There are some areas that need to be explored more. Choosing the desired hyper-parameter for : a fixed value for a fixed case or slope value for bounded Element-wise Trainable requires several trial and error. Further, we need to explore the relationship with learning rate to find better hyper-parameters. With a clear relationship of with , it is possible to learn and optimize better, which might further boost the network performance.

References

6 Appendix

The VGG-16 architecture used in the experiments is defined as follows:

Vgg16

:

where, numbers 64,128 and 256 represents the filters of Convolution layer

which is followed by a Batch Normalization layer, followed by

an activation function. M represents the Max Pooling layer and C represents the Linear classification layer of dimension (512, number of classes).

Other hyper-parameters settings include:

Hyper-parameter Value
Convolution Kernel Size 3

Convolution layer Padding

1
Max-Pooling Kernel Size 2

Max-Pooling Stride

2
Optimizer Adam
Batch Size 256
Fixed values [0.05, 0.1, 0.25, 0.5, 1, 2]
Learning Rate 0.01 (Dropped 1/10 after every 100 epochs)
Number of Epochs 400
Image Resolution
Single trainable Initializer Zero
Element-wise trainable Initializer Xavier initialization
Table 3: Hyper-parameters for the experiments
Figure 5: Test accuracy comparison between ReLU and ProbAct layer with/without dropout layers on CIFAR-10 dataset.
Figure 6: Train accuracy comparison between ReLU and ProbAct layer with/without dropout layers on CIFAR-10 dataset.