I Introduction
Deep learning is a branch of machine learning that uses multilayer neural networks to identify complex features within the input data and solve complex realworld problems. It can be used for both supervised and unsupervised machine learning tasks
[1]. Currently, deep learning is used in areas such as computer vision, video analytic, pattern recognition, anomaly detection, natural language processing, information retrieval, and recommender system, among other things. Also, it has widespread used in robotics, selfdriving cars, and artificial intelligence systems in general
[2].Activation function is at the heart of any deep neural networks. It provides the nonlinear property for deep neural networks and controls the information propagation through adjacent layers [3]
. Therefore, the design of an activation function is crucial to the learning behavior and performance of neural networks. Currently, the most successful and popular activation function is the rectified linear unit (ReLU)
[4], defined as . On one hand, ReLU propagates all the positive inputs identically, which alleviates gradient vanishing and allows for much deeper neural networks. On the other hand, ReLU improves calculation efficiency by outputting only zero for negative inputs. Thanks to its simplicity and effectiveness, ReLU has become the default activation function used across the deep learning community.While ReLU is fantastic, researchers found that it is not the end of story about the activation function – the challenges of ReLU arise from three main aspects: nonzero mean, negative missing, and unbounded output. 1) Nonzero mean. Apparently, ReLU is nonnegative and, therefore, has a mean activation larger than zero. According to [5], units that have a nonzero mean activation are not conducive to network convergence. 2) Negative missing. ReLU simply restrains the negative value to hardzero, which providing sparsity but resulting in negative missing. The variants of ReLU, including leaky ReLU (LReLU) [6], parametric ReLU (PReLU) [7] and randomized leaky ReLU (RReLU) [8], enable nonzero slope to the negative parts. It is proven that the negative parts are helpful for network learning. 3) Unbounded output. The output range of ReLU may cause the output distribution to be scattered in the nonnegative real number space. According to [9], samples with low distribution concentration may make the network difficult to train. Although batch normalization (BN) [10]
is generally performed before ReLU to alleviate the internal covariate shift problem. However, as the normalized activation corresponding to the input sample depends on the overall samples in the minibatch, i.e., dividing by the running variance. Hence even if BN is added, excessively centrifugal samples will make the features of samples at the center inseparable. In recent years, numerous activation functions have been proposed to replace ReLU, including LReLU, PReLU, RReLU, ELU
[5], SELU [11], Swish [12], Maxout [13], to name a few, but none of them have managed to overcome the above three challenges.In this paper, we introduce the “SoftRootSign” (SRS) which named by its appearance (“”). The proposed SRS has smoothness, nonmonotonicity, and boundedness (see Fig. 1). In fact, the bounded property of SRS distinguishes itself from most stateoftheart activation functions. Compared to ReLU, SRS can adaptively adjust the output through a pair of independent trainable parameters to capture negative information and provide a zeromean property, which leads to better generalization performance as well as faster learning speed. At the same time, our nonlinearity avoids and rectifies the output distribution to be scattered in the nonnegative real number space. This is desirable during inference, because it makes activation functions more compatible with BN and less sensitive to initialization.
To validate the effectiveness of the proposed activation function, we evaluated SRS on deep networks applied to a variety of tasks such as image classification, machine translation and generative modelling. Our SRS matches or exceeds models with ReLU and other stateoftheart nonlinearities, showing that the proposed activation function is generalized and can achieve high performance across tasks. Ablation study further verified the compatibility with BN and selfadaptability for different initialization schemes.
Our contributions can be summarized as follows:

We revealed potential drawbacks of ReLU, and introduced a novel activation function to solve these drawbacks. The proposed activation function, namely “SoftRootSign” (SRS), is smooth, nonmonotonic, and bounded; and the bounded output is an important aspect of SRS.

We further analyzed and discussed the properties of the proposed SRS, and demonstrated the exact roots of SRS’s success in deep neural network training.

We evaluated SRS on image classification task, machine translation task, and generative modelling task. The proposed activation function is shown to generalize, achieving high performance across tasks.

We conducted ablation study and observed that SRS is compatible with BN and adaptive to different initial values. This makes it possible to use significantly higher learning rates and more general initial schemes.
Our paper is organized as follows. Section II discusses related works. In Section III, we introduce the proposed SoftRootSign activation function (SRS), and identify the roots of SRS’s success. Then, in Section IV, we present empirical results on image classification task, machine translation task and generative modelling task. We further conduct ablation study in Section V and conclude in Section VI.
Ii Related Work
In a deep neural network, different activation functions have different characteristics. Currently, the most popular and widelyused activation function for neural network is the rectified linear unit (ReLU) [4], defined as
, which was first proposed for restricted Boltzmann machines and then successfully used for neural networks. On one hand, ReLU identically propagates all the positive inputs, which alleviates gradient vanishing and allows for much deeper neural networks. On the other hand, ReLU is computational efficient by just outputting zero for negative inputs. However, because of the nonzero mean, negative missing and unbounded output, ReLU is at a potential disadvantage during optimization.
In recent year, various activation functions have been proposed to replace the ReLU. Leaky ReLU (LReLU) [6] replaces the negative part of the ReLU with a linear function have been shown to be superior to ReLU. Parametric ReLU (PReLU) [7] generalizes LReLU by learning the slope of the negative part which yielded improved learning behavior on large image benchmark datasets. Randomized leaky ReLU (RReLU) [8] randomly samples the slope of the negative part which raised the performance on image benchmark datasets and convolutional networks. However, nonhard rectification of these activation functions do not ensure a noiserobust deactivation state and will destroy sparsity. Other variants, i.e. shifted ReLU (SReLU) [14] and flexible ReLU (FReLU) [14], have flexibility of choosing horizontal shifts from learned biases, but they are not continuously differentiable might cause some undesired problems in gradientbased optimization.
More recently, the exponential linear unit (ELU) [5] has been proposed to capture negative values to allow for mean activations close to zero, but which saturates to a negative value with smaller arguments. Compared with LReLU, PReLU and RReLU, ELU not only provides fast convergence, but also has a clear saturation plateau in its negative region, allowing them to learn more important features. Building on this success, the variants of ELU [15] [16] [17] [18] also demonstrate similar performance improvements. However, the incompatibility between these activation functions and batch normalization (BN) [10] has not been well treated. Another alternative to ReLU is scaled exponential linear unit (SELU) [11]
, which induces variance stabilization and overcomes the gradientbased problems like gradient vanishing or exploding. The main idea is to drive neuron activations across all layers to emit a zero mean and unit variance output. But there is still incompatibilities with BN. Besides, a special initialization method called LeCun Normal
[19] is required to make a deep neural network with SELU remarkable. Recently, Swish [12] opened up a new direction of bringing optimization methods including exhaustive search algorithm [20][21] to activation function search. But one drawback is that the resulting nonlinearity is very dependent on the chosen network architecture.In this paper, we propose a novel activation function called “SoftRootSign” (SRS), which is designed to solve the above potential disadvantages of ReLU. The proposed SRS is smooth, nonmonotonic, and bounded. It cannot be derived with (scaled) sigmoidlike [22] [23] [24], ReLUlike [4] [25] [26], ELUlike, Swishlike [12] [27] [28] [29] or other nonlinearities [13] [30] [31] [32]. In contrast to ReLU, SRS can adaptively adjust the output by a pair of independent trainable parameters to capture negative information and provide zeromean property, leading not only to faster learning, but also to better generalization performance. In addition, SRS avoids and rectifies the output distribution to be scattered in the nonnegative real number space, improving its compatibility with BN and reducing the sensitivity to initialization. As compared with some variants of ReLU, i.e., LReLU, PReLU, and RReLU, SRS has a clear saturation plateau in its negative region, allowing it to learn more important features. Compared with others variants of ReLU, i.e., FReLU, and SReLU, SRS has an order of continuity as infinite which helps with effective optimization and generalization. While ELU, SELU and SRS have similar properties in some extent. They all provide zeromean property for fast convergence and sacrifice hardzero sparsity on gradients for robust learning. But comparatively, SRS has a better compatibility with BN and stronger adaptability for different initialization. Finally, in contrast to Swish, SRS is a handdesigned activation function that is more fit for important properties.
Iii The Proposed Method
This section presents the proposed SoftRootSign activation function (SRS), analyzes and discusses the SRS’s properties, including the gradient regression, the suitable data distribution, the smooth output landscape, and the bounded output.
Iiia SoftRootSign Activation Function (SRS)
We observe that an effective activation function is required to have 1) negative and positive values for controlling the mean toward zero to speed up learning [5]; 2) saturation regions (derivatives approaching zero) to ensure noiserobust state; and 3) a continuousdifferential curve that helps with effective optimization and generalization [12]. Based on the above insights, we propose the SoftRootSign activation function (SRS), which is defined as
(1) 
where and are a pair of trainable nonnegative parameters. Fig. 1 shows the graph of the proposed SRS.
In contrast to ReLU [4], SRS has nonmonotonic region when which helps capture negative information and provides zeromean property. Meanwhile, SRS is bounded output when which avoids and rectifies the output distribution to be scattered in the nonnegative real number space. The derivative of SRS is defined as
(2) 
Fig. 2 illustrates the first derivative of SRS, which gives nice continuity and effectivity.
0.5  1.0  2.0  3.0  4.0  5.0  

1.0  0.2346 (0.4237)  0.0685 (0.2746)  0.2569 (0.4941)  0.3749 (0.7669)  0.4642 (1.0571)  0.5364 (1.3540) 
2.0  0.3321 (1.0468)  0.0275 (0.5874)  0.1326 (0.6804)  0.1957 (0.7925)  0.2403 (0.9000)  
3.0  0.1177 (0.8254)  0.0120 (0.7565)  0.0765 (0.7947)  0.1179 (0.8461)  
4.0  0.2340 (1.2033)  0.0650 (0.8685)  0.0060 (0.8449)  0.0486 (0.8620)  
5.0  0.3438 (1.8933)  0.1204 (0.9917)  0.0415 (0.9046)  0.0034 (0.8942)  
6.0  0.1631 (1.1196)  0.0761 (0.9640)  0.0288 (0.9291) 
The output expectation (variance) distribution of SRS. Consider a continuous random variable
is set to be normally distributed with zero mean and unit variance, and the random variable
generated by transforming . When the integral converges absolutely, output expectation (variance) distribution of with respect to can be calculated as follows. Here ‘’ means nonexistence, i.e. the integrand does not converge absolutely. It can be seen that under rational parameter settings, the output distribution of SRS maps to a suitable distribution (near zero mean and unit variance) [11]. This means that SRS can effectively prevent the output scattered distribution, thus ensuring fast and robust learning through multiple layers.As shown in Fig. 1 and 2, the proposed SRS activation function is bounded output with a range . Specifically, the minimum of SRS is observed to be at with a magnitude of ; and the maximum of SRS is when the network input . In fact, the maximum value and slope of the function can be controlled by changing the parameters and , respectively. Through further setting and as trainable, SRS can not only control how fast the first derivative asymptotes to saturation, but also adaptively adjust the output to a suitable distribution, which avoiding gradientbased problems and ensuring fast as well as robust learning for multiple layers [11].
IiiB Analysis and Discussion for SRS
IiiB1 Gradient Regression
We consider an activation function is gradient regression if each unit it maps is not in the saturation regime, or can move out of the saturation after fewer iterations. Deep neural network with sigmoidlike units, e.g., Sigmoid [22], Softsign [23], and hyperbolic tangent (Tanh) [24], has been already shown to slow down optimization convergence because once it reaches saturation regime, it is almost impossible to escape during training. Unsaturated activation function, i.e. ReLU, is the identity for positive arguments, which provides gradient regression property and addresses the problems mentioned thus far. But the neuron that in the negative of ReLU will not be updated during training process.
To understand this, we further study gradient regression property for SRS via looking at the evolution of activations during forward propagation. For simplicity, the parameters and are fixed as 5.0 and 3.0 respectively. Extension to the trainable case is more robust. Let denote the current iteration, and initialize the input , weight and bias to random values within the interval . Define the output of each iteration as , where is the activation function. After many iterations, the evolution of the activation value (after the nonlinearity) and its discrete derivative value (1st difference) during forward propagation of SRS are shown in Fig. (a)a. As expected, the unit does constantly update through consecutive iterations. We also observe that the intermediates with large derivate values decreases, and vice versa. Additional comparison of gradient stability is done for Sigmoid and ReLU, as shown in Fig. (b)b and (c)c. We see that very quickly at the beginning, the Sigmoid derivative value is pushed below 0.5. And the model with Sigmoid never escaped this high saturation regime during iterations, as mentioned earlier. Although ReLU can move out of the saturation after few iterations. However, the activation value and its derivative value do lots of vanish during intermediate iterations. Therefore, networks with SRS have not only gradient regression capability but also additional stability.
IiiB2 Suitable Data Distribution
For activations of a neural network, it is known that mapping the mean and variance within suitable intervals, i.e. and , can not only speed up learning but also avoid both vanishing and exploding gradients [5] [11]. While ReLU is obviously nonnegative, which is not conducive to network convergence. In contrast, through training a pair of independent nonnegative parameters and , SRS can adaptively adjust the output to a suitable distribution.
Without loss of generality, consider a continuous random variable
is set to be normally distributed with zero mean and unit variance, in which case its probability density is
(3) 
And the random variable generated by transforming is denoted as
(4) 
Since the integral converges absolutely, the expectation value and variance of the with respect to can be calculated as
(5) 
(6)  
Where denotes the expectation and the variance of a random variable. Then we can calculate the expectation and variance distributions of SRS corresponding to different and , as shown in Table I. It can be seen that under rational parameter settings, the output distribution of SRS maps to near zero mean and unit variance. This means that SRS can effectively prevent the output scattered distribution, thus ensuring fast and robust learning through multiple layers.
IiiB3 Smooth output landscape
ReLU is the zero for negative arguments and thereby making the deep neural network sparse and efficient. However, because of the zerohard rectification, ReLU has an order of continuity as 0 which means it is not continuously differentiable. That is to say, networks with ReLU will have numerous sharp regions in the output landscape, which causing some undesired problems in gradientbased optimization.
Another alternative to retain noninformative deactivation states is to achieve zero gradient only in the limit. The SRS and some previous works [12] [27] have offered this insight. In contrast, the order of continuity being infinite for SRS is a benefit over ReLU. This stems from the fact that SRS is a continuously differentiable function, which helps smooth the output landscape for effective optimization and generalization.
We further visualized the output landscape of a 5layer randomly initialized neural network with SRS (fixed and for simplicity) and ReLU for visual explanations, as shown in Fig. 4. Specifically, we passed the 2dimensional coordinates of each position in a grid into network, and plotted the scalar network output for each grid point. We observed that activation functions have a dramatic effect on smoothness of output landscapes. The SRS network output landscape is considerably smooth. However, in this regard, the ReLU output landscape spontaneously become chaotic and sharpness. Smoother output landscapes directly result in smoother loss landscapes; smoother loss landscapes are easier to optimize and result in better training and test accuracy [27]. These phenomena partially explain why SRS outperforms ReLU. Additional comparison of output landscapes is done for Softplus [4], LReLU [6], PReLU [7], ELU [5], SELU [11], and RReLU [8], etc. Most of them similar to ReLU have sharpness in the output landscape and thus prove to be a roadblock to effective optimization of gradients (see Fig. 11 for further details).
IiiB4 Bounded Output
The output range of ReLU may cause the output distribution to be scattered in the nonnegative real number space. It means that the network with ReLU may shows high variance, leading to overfitting in the process of optimization convergence. Although batch normalization (BN) [10] is generally performed before ReLU to alleviate the internal covariate shift problem. However, as the normalized activation corresponding to the input example depends on the overall examples in the minibatch, i.e., dividing by the running variance. Hence even if BN is added, excessively centrifugal samples will make the features at the center inseparable.
Compared to ReLU, SRS is bounded output when . That means, SRS modifies the output distribution and avoids the overfitting problem to some extent. This is desirable during inference because it makes activation functions more compatible with BN and less sensitive to initialization, as verified in our ablation study.
Note that bounded output is an important aspect of SRS, which distinguishes itself from most widely used activation functions. But different from traditional bounded activation functions, i.e. Sigmoid, Softsign and hyperbolic tangent (Tanh), the slope of SRS can be controlled by changing the parameter . In other words, through assigning a trainable parameter, SRS can control how fast the first derivative asymptotes to saturation. Fig. 5 plots the first derivative of SRS for different values with a fixed of 5.0. When the magnitude of is small e.g. , SRS can easily map beyond predefined boundary units to saturation. Therefore, the representation is noiserobust and lowcomplex.
Baselines  ReLU  LReLU  PReLU  Softplus  ELU  SELU  Swish 

SRS Baseline  6  5  6  6  4  5  6 
SRS Baseline  0  0  0  0  2  0  0 
SRS Baseline  0  1  0  0  0  1  0 
As gets larger, the derivative saturation threshold gets larger. This in turn helps the saturation units to desaturate during training. Thus it behaves differently in terms of saturation, ensuring that vanishing and exploding gradients will never be observed.
Summary. Based on the above theoretical analyses and empirical studies, we conclude that the design of SoftRootSign activation function has 1) gradient regression; 2) suitable data distribution; 3) smooth output landscape; and 4) bounded output. These properties depend on network characters that are beyond a function’s mathematical properties. They should be taken into account for practical activation function design.
Iv Experiments
This section presents a series of experiments to demonstrate that our SoftRootSign activation function (SRS) improves performance in different tasks, including image classification, machine translation and generative modelling. Since many activation functions have been proposed, we choose the most common activation functions to compare against: the ReLU [4], the LReLU [6], the PReLU [7], the Softplus [4], the ELU [5], the SELU [11] and the Swish [12], and follow the following guidelines:

Leaky ReLU (LReLU):
where . LReLU introduces a nonzero gradient for negative input.

Parametric ReLU (PReLU): PReLU is a modified version of LReLU that makes trainable. Each channel has a shared which is initialized to 0.1.

Softplus:
Softplus can be regarded as a smooth version of ReLU.

Exponential Linear Unit (ELU):
where . ELU produces negative outputs, which helps the network push mean unit activations closer to zero.

Scaled Exponential Linear Unit (SELU):
with predetermined and .

Swish:
where can either be a trainable parameter or equal to 1.0.
Model  CIFAR10  CIFAR100  

VGG  MobileNet  VGG  MobileNet  
LReLU  93.35  87.59  72.84  60.49 
PReLU  92.89  87.87  71.13  57.49 
Softplus  93.18  87.24  72.24  58.58 
ELU  93.21  87.94  73.19  60.59 
SELU  93.09  87.72  72.07  59.76 
Swish  93.16  87.49  72.43  59.05 
ReLU  93.12  85.63  72.24  56.21 
SRS  93.33  87.96  73.24  60.59 
In order to balance model performance and training efficiency, we have selected and as SRS initial values in sparsely connected networks; and these initial values are adjusted to and to fit the densely connected network. We also have limited the minimum value of the parameters to a small constant to prevent the abnormal situation that the denominator iteration approaches zero in the calculation.
We conduct experiments on a variety of models and datasets. As a summary, the results in Table II are aggregated by comparing the performance of SRS to that of different activation function applied to a variety of models and datasets. Specifically, the models with aggregated results are a) VGG16 and MobileNet V1 across the CIFAR10 and CIFAR100 results; b) IWSLT GermanEnglish Transformer model across the four TED test sets results; and c) The iResNet Flow model across three toy density samples results.^{1}^{1}1
To avoid skewing the comparison, each model type is compared only once. A model with multiple results is represented by the median of its results.
It is to be noted that “SRS Baseline” is indicative of better accuracy, and vice versa. We observed that SRS consistently matches or outperforms ReLU on every model for different tasks. SRS also matches or exceeds the best baseline performance on almost every model. Importantly, the “best baseline” changes between different tasks, which demonstrates the stability of SRS to match these varying baselines.Iva Image Classification
First, we evaluate the proposed SRS on the image classification task. On CIFAR10 and CIFAR100, we compare the performance of SRS to that of different activation functions applied to the representative CNNs, i.e. VGG16 [33] and MobileNet V1 [34].
IvA1 Datasets
The CIFAR datasets [35], CIFAR10 and CIFAR100, consist of 32
32 colored images. Both datasets contain 60,000 images, which are split into 50,000 training images and 10,000 test images. CIFAR10 dataset has 10 classes, with 6,000 images per class. CIFAR100 dataset is similar to CIFAR10 dataset, except that has 100 classes, each of which contains 600 images. The standard dataaugmentation scheme, in which the images are zeropadded with 4 pixels on each side, randomly cropped to produce 32
32 images, and horizontally mirrored with probability 0.5 are adopted in our experiments, according to usual practice [36] [37] [38]. During training, we randomly sample 5% of the training images for validation.IvA2 Training settings
We use exactly the same settings to train these models. All networks are trained using stochastic gradient descent (SGD) with a weight decay of
and momentum of 0.9. The weights initialized according to [39]. The biases are initialized with zero. On CIFAR10 and CIFAR100, we trained for 300 epochs, with a minibatch size of 128. The initial learning rate is set to 0.1 and decayed by a factor of 0.1 after 120 and 240 epochs. Unless otherwise specified, we adopt batch normalization (BN)
[10] right after each convolution, nonlinear activation is performed right after BN. Dropout regularization [40] is employed in the fullyconnected layers, with a dropout ratio of 0.5.IvA3 Results
The results shown in Table III report the median of five different runs. As it can be seen, our SRS matches or exceeds models with ReLU and other stateoftheart nonlinearities. In particular, SRS networks perform significantly better than ReLU networks. For example, on VGG, SRS performs well over ReLU with a 0.21% boost on CIFAR10 and 1.0% on CIFAR100 respectively. On MobileNet, SRS networks achieve up to 87.96% on CIFAR10 and 60.59% on CIFAR100, which are improvements of 2.33% and 4.38% above the ReLU baselines respectively. The observation in Fig. 6, clearly shows the learning behavior of SRS and ReLU networks. Though learning behavior differs depending on the models and datasets, SRS always leads for faster convergence than ReLU. In addition, networks with SRS show relatively lower validation error, demonstrating that our SRS manages to overcome the potential optimization difficulties of ReLU.
Model  tst2011  tst2012  tst2013  tst2014 

LReLU  23.34  19.68  20.02  24.24 
PReLU  23.04  19.43  20.02  23.22 
Softplus  23.77  19.75  20.32  23.79 
ELU  23.35  19.96  20.28  23.78 
SELU  23.63  19.93  20.46  24.04 
Swish  23.61  19.77  20.27  23.59 
ReLU  24.08  19.56  19.78  23.88 
SRS  24.08  20.08  20.37  24.20 
IvB Machine Translation
Next, we show the effectiveness of our SRS in IWSLT 2016 GermanEnglish translation task. For this task, we use the base setup of the Transformer as neural machine translation model.
IvB1 Datasets
The IWSLT 2016 GermanEnglish [41] training set consists in subtitles of TED talks, including about 196 thousand sentence pairs. Sentences are encoded using bytepair encoding [42], which has a shared sourcetarget vocabulary of 8,000 tokens. We use the IWSLT16.TED.tst2010 set for validation, and the IWSLT16.TED.{tst2011, tst2012, tst2013, tst2014} sets for testing respectively.
IvB2 Training settings
The base setup of Transformer [43]
model has 6 layers, each of which has a fully connected feedforward network. This consists of two linear transformations with a ReLU activation function in between. We simply replace the ReLU with different nonlinearities. All models are trained using Adam
[44] optimizer with a learning rate of . We trained for 64 epochs (about 196 thousand iterations), with a minibatch size of 3,068 tokens. Dropout regularization is employed with a dropout ratio of 0.3. Label smoothing with a ratio of 0.1 is utilized. We measure the performance in standard BLEU metric.IvB3 Results
The consistency of SRS providing better test accuracy as compared to baselines can also be observed on machine translation task. Table IV shows the BLEU scores of IWSLT 2016 GermanEnglish translation results on four test sets. We observed that network with SRS can always rise to the top. Particularly on the IWSLT16.TED.tst2012 set, the proposed SRS surpasses all baselines by more than 0.12 BLEU scores, demonstrating the effectiveness of our model. Besides, SRS networks perform significantly better than ReLU networks. In specific, On IWSLT16.TED.tst2012, IWSLT16.TED.tst2013 and IWSLT16.TED.tst2014, SRS outperforms ReLU by a nontrivial 0.52%, 0.59% and 0.32% respectively. Fig. 7 clearly shows the learning curve of SRS and ReLU networks on the validation set. Both SRS and ReLU lead for convergence, but SRS converges much faster in comparison to its counterpart. For example, SRS reaches 27.0 BLEU scores at about 32 epochs, while ReLU need nearly twice as many iterations to reach the same BLEU scores. More importantly, networks with SRS exhibits considerably better performance and is generalizable to the test data. This indicates that switching to SRS improves performance with little additional tuning.
IvC Generative Modelling
We additionally verify the utility of SRS in building generative models. In this experiment, we compare SRS to different activation functions with iResNet Flows on the 2dimensional toy datasets.
IvC1 Datasets
The toy data set consists of 2dimensional data. Due to the multimodal and discontinuous nature of the real distribution, it is difficult to fit on Flowbased models. We evaluate SRS and baselines for learning density from the “two moons”, “pinwheel”, and “eight gaussians” samples. The color represents the magnitude of density function, with brighter values indicating larger values.
IvC2 Training settings
We trained iResNet [45], a flowbased generative model that consists of 100 residual blocks. Each block is a composition of a multilayer perception with state sizes of 128128128128 and nonlinearities (e.g. ReLU, ELU). We adopt activation normalization [46] after each residual block and do not use batch normalization (BN) or dropout in this experiments. Adam optimizer was used with a weight decay of . The learning rate is set to . Models are trained for 50,000 steps with a minibatch size of 500. We used the bruteforce computing logdeterminant.
IvC3 Results
Fig. 8 qualitatively shows the 2dimensional density distributions learned by a generative model with different nonlinearities. We observed that models with continuous derivatives can fit both multimodal and even discontinuous distributions. Specifically, SRS and ELU are capable of modeling multimodal distribution and can also learn convincing approximations of discontinuous density function. Softplus learns to stretch the single mode base distribution into multiple modes but has trouble modeling the areas of low probability between disconnected regions. Though Swish achieves generative quality comparable to SRS, it can not fit accurately into the details, i.e. it missed the arch at the tail of the pinwheel. On the other hand, nonlinearity with continuous derivatives, such as ReLU, LReLU, PReLU, and SELU, can lead to unstable optimization results. As stated in [47]
, we believe this is due to our model’s ability to avoid partitioning dimensions, so we can train on a density estimation task and still be able to sample from the learned density efficiently.
V Ablation Study
In this section, we conducted more studies on FashionMNIST datasets and explored two main design choices: 1) compatibility with batch normalization; 2) parameter initialization with SRS. FashionMNIST [48] is a dataset of Zalando’s article images – consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 2828 grayscale image, associated with a label from 10 classes.
For this task, we designed a simple neural network, which arranged in stacks of (3512FC, 1256FC, 110FC) layersunitsfullyconnected (FC). The networks input are 28
28 binary image with a softmax logistic regression for the output layer. The cost function is the crossentropy loss, which optimized with stochastic gradient descent (SGD). The SRS initial values are set to
and , as mentioned earlier. We trained the network for 10,000 steps with 50 examples per minibatch, and reported the median result of three runs.Model  LR = 0.01  LR = 0.1  

w/o BN  w/ BN  w/o BN  w/ BN  
LReLU  13.24  12.64    11.75 
PReLU  13.26  12.82    11.69 
Softplus  13.88  13.11    11.87 
ELU  12.75  12.90    12.32 
SELU  12.89  13.20    12.56 
Swish  12.77  12.36  12.67  11.61 
ReLU  12.96  12.91    11.40 
SRS  12.58  11.40  12.46  11.33 
Va Compatibility with batch normalization.
We firstly study the compatibility of SRS with batch normalization (BN) [10]
. All weight initialization schemes are subject to the Gaussian distribution
. The learning rate is set to 0.01 and 0.1 respectively. Table V reported the test results of SRS and baselines trained with and without BN^{2}^{2}2For the “LR = 0.1, w/o BN” setup of models, we trained all baselines, including Swish, with extra three runs (a total of six runs) because the first three runs did not converge.. Fig. 9 shows the learning curves for SRS and ReLU networks. It can be observed that SRS networks converge with either a learning rate of 0.01 or a larger learning rate of 0.1. We also found that BN can improve the performance of SRS networks. However, due to the training conflict between the representation restore (scale and bias ) in BN and the negative parameter in the activation function, BN does not improve ELU and SELU networks. These indicate that SRS is more compatible with BN, which avoids gradientbased problems and makes it possible to use significantly higher learning rates.VB Parameter initialization with SRS.
Next, we investigate the effects of different parameter initialization for SRS, including the Gaussian initialization , the Xaiver initialization [39], the He initialization [7]. The learning rate is set to 0.01. As shown in Table VI, we observed that no matter what initialization scheme is adopted, SRS all achieves the lowest testing accuracy as compared with ReLU. This indicates that SRS is adaptive to different initial values and thus reduces the sensitivity to initialization.
We additional draw the evolution of the activation means at each hidden layer with Gaussian initialization, as shown in Fig 10. Layer 1 refers to the output of first hidden layer, and there are four hidden layers. We see that very quickly at the beginning, all the ReLU activation values are pushed to value more than zero, causing the output distribution to be scattered in the nonnegative real number space. In contrast, SRS networks quickly converge to near zero and achieve stability. Therefore, SRS modifies the output distribution and avoids the overfitting problem to some extent.
Model  Gaussian  Xaiver  He 

LReLU  13.24  13.33  13.03 
PReLU  13.26  13.27  13.33 
Softplus  13.88  17.22  17.17 
ELU  12.75  13.78  13.19 
SELU  12.89  13.15  12.19 
Swish  12.77  16.41  15.36 
ReLU  12.96  13.32  13.08 
SRS  12.58  13.24  12.71 
Vi Conclusion and Future Work
An activation function plays a critical role in deep neural networks. Currently, the most effective and widelyused activation function is ReLU. However, because of the nonzero mean, negative missing and unbounded output, ReLU is at a potential disadvantage during optimization. Although various alternatives to ReLU have been proposed, none have successfully overcome the above three challenges. In this work, we have introduced a novel activation function called SoftRootSign (SRS) that addresses these issues.
The proposed SRS has smoothness, nonmonotonicity, and boundedness. In fact, the bounded property of SRS distinguishes itself from most stateoftheart activation functions. By defining a custom activation layer, SRS can be easily implemented in most deep learning framework. We have analyzed and studied many interesting properties of SRS, such as 1) gradient regression; 2) suitable data distribution; 3) soft inactive range; and 4) bounded output. We believe that these properties are the roots of SRS’s success, and suggest they should also be considered for practical activation function design.
In experiments, we benchmarked SRS against several baseline activation functions on image classification task, machine translation task and generative modelling task. Empirical results demonstrate that our SRS matches or exceeds the baselines on nearly all tasks. In particular, SRS networks perform significantly better than ReLU networks. Ablation studies show that SRS is compatible with batch normalization (BN) and adaptive to different initial values. This makes it possible to use significantly higher learning rates and more general initial schemes.
Finally, although SRS is more computationally expensive than ReLU because it involves complex mathematical operations, we expect that SRS implementations can be improved, e.g. by faster exponential functions. This remains one of the areas that warrants further exploration.
Appendix A Additional comparison of output landscapes for other nonlinearities
The activation function has a dramatic effect on the smoothness of the output landscape. A smoother output landscape result in a smoother loss landscape, which makes the network easier to optimize and leads to better performance. We have visualized the output landscapes of a 5layer randomly initialized neural network for SRS and ReLU [4] (see Fig. 4). We also conduct output landscape comparisons for Softplus [4], LReLU [6], PReLU [7], ELU [5], SELU [11], RReLU [8], Sigmoid [22], Softsign [23], Tanh, Hardtanh, Swish [12] and Mish [27]. As shown in Fig. 11, most of them similar to ReLU have sharpness in the output landscape and thus prove to be a roadblock to effective optimization of gradients.
References
 [1] Yoshua Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
 [2] Jürgen Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
 [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015.
 [4] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814.
 [5] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
 [6] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013, vol. 30, p. 3.

[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,”
in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.  [8] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
 [9] Sergey Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batchnormalized models,” in Advances in neural information processing systems, 2017, pp. 1945–1953.
 [10] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [11] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter, “Selfnormalizing neural networks,” in Advances in neural information processing systems, 2017, pp. 971–980.
 [12] Prajit Ramachandran, Barret Zoph, and Quoc V Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
 [13] Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.

[14]
Suo Qiu, Xiangmin Xu, and Bolun Cai,
“Frelu: Flexible rectified linear units for improving convolutional neural networks,”
in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 1223–1228.  [15] Ludovic Trottier, Philippe Gigu, Brahim Chaibdraa, et al., “Parametric exponential linear unit for deep convolutional neural networks,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2017, pp. 207–214.
 [16] Yang Li, Chunxiao Fan, Yong Li, Qiong Wu, and Yue Ming, “Improving deep neural network with multiple parametric exponential linear units,” Neurocomputing, vol. 301, pp. 11–24, 2018.
 [17] Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti, and Brian Whitney, “Improving deep learning by inverse square root linear units (isrlus),” arXiv preprint arXiv:1710.09967, 2017.
 [18] Rahul Duggal and Anubha Gupta, “Ptelu: Parametric tan hyperbolic linear unit activation for deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 974–978.
 [19] Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller, “Efficient backprop,” in Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
 [20] Renato Negrinho and Geoff Gordon, “Deeparchitect: Automatically designing and training deep architectures,” arXiv preprint arXiv:1704.08792, 2017.
 [21] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv:1611.02167, 2016.

[22]
Yoshifusa Ito,
“Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory,”
Neural Networks, vol. 4, no. 3, pp. 385–394, 1991.  [23] James Bergstra, Guillaume Desjardins, Pascal Lamblin, and Yoshua Bengio, “Quadratic polynomials learn better image features,” Technical report, 1337, 2009.
 [24] EJ Parkes and BR Duffy, “An automated tanhfunction method for finding solitary wave solutions to nonlinear evolution equations,” Computer physics communications, vol. 98, no. 3, pp. 288–300, 1996.
 [25] Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio, “Noisy activation functions,” in International conference on machine learning, 2016, pp. 3059–3068.
 [26] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in international conference on machine learning, 2016, pp. 2217–2225.
 [27] Diganta Misra, “Mish: A self regularized nonmonotonic neural activation function,” arXiv preprint arXiv:1908.08681, 2019.
 [28] Mina Basirat and Peter M Roth, “The quest for the golden activation function,” arXiv preprint arXiv:1808.00783, 2018.
 [29] Stefan Elfwing, Eiji Uchibe, and Kenji Doya, “Sigmoidweighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, vol. 107, pp. 3–11, 2018.
 [30] Xiaojie Jin, Chunyan Xu, Jiashi Feng, Yunchao Wei, Junjun Xiong, and Shuicheng Yan, “Deep learning with sshaped rectified linear activation units,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [31] Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.

[32]
Alberto Carini and Giovanni L Sicuranza,
“Even mirror fourier nonlinear filters,”
in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 5608–5612.  [33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 [34] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 [35] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
 [36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
 [37] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
 [38] Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 [39] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
 [40] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [41] Mauro Cettolo, Niehues Jan, Stüker Sebastian, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico, “The iwslt 2016 evaluation campaign,” in International Workshop on Spoken Language Translation, 2016.
 [42] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
 [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
 [44] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [45] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and JörnHenrik Jacobsen, “Invertible residual networks,” arXiv preprint arXiv:1811.00995, 2018.
 [46] Durk P Kingma and Prafulla Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018, pp. 10215–10224.
 [47] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
 [48] Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” 2017.
Comments
There are no comments yet.