DeepAI
Log In Sign Up

Adaptively Customizing Activation Functions for Various Layers

12/17/2021
by   Haigen Hu, et al.
12

To enhance the nonlinearity of neural networks and increase their mapping abilities between the inputs and response variables, activation functions play a crucial role to model more complex relationships and patterns in the data. In this work, a novel methodology is proposed to adaptively customize activation functions only by adding very few parameters to the traditional activation functions such as Sigmoid, Tanh, and ReLU. To verify the effectiveness of the proposed methodology, some theoretical and experimental analysis on accelerating the convergence and improving the performance is presented, and a series of experiments are conducted based on various network models (such as AlexNet, VGGNet, GoogLeNet, ResNet and DenseNet), and various datasets (such as CIFAR10, CIFAR100, miniImageNet, PASCAL VOC and COCO) . To further verify the validity and suitability in various optimization strategies and usage scenarios, some comparison experiments are also implemented among different optimization strategies (such as SGD, Momentum, AdaGrad, AdaDelta and ADAM) and different recognition tasks like classification and detection. The results show that the proposed methodology is very simple but with significant performance in convergence speed, precision and generalization, and it can surpass other popular methods like ReLU and adaptive functions like Swish in almost all experiments in terms of overall performance.The code is publicly available at https://github.com/HuHaigen/Adaptively-Customizing-Activation-Functions. The package includes the proposed three adaptive activation functions for reproducibility purposes.

READ FULL TEXT VIEW PDF

page 3

page 4

page 5

page 6

page 7

page 10

page 12

page 14

03/29/2021

Comparison of different convolutional neural network activation functions and methods for building ensembles

Recently, much attention has been devoted to finding highly efficient an...
10/14/2020

Effects of the Nonlinearity in Activation Functions on the Performance of Deep Learning Models

The nonlinearity of activation functions used in deep learning models ar...
07/23/2020

Funnel Activation for Visual Recognition

We present a conceptually simple but effective funnel activation for ima...
12/18/2020

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on

Virtual try-on has garnered interest as a neural rendering benchmark tas...
04/08/2021

Learning specialized activation functions with the Piecewise Linear Unit

The choice of activation functions is crucial for modern deep neural net...
09/29/2021

A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning

Neural networks have shown tremendous growth in recent years to solve nu...
08/18/2021

Generalizing MLPs With Dropouts, Batch Normalization, and Skip Connections

A multilayer perceptron (MLP) is typically made of multiple fully connec...

1 Introduction

Activation functions play a key role during the training process of neural networks, and considerable attention has been paid to explore standard activation functions over the past years. Especially, with the remarkable development of Deep Neural Networks (DNN) in various computer vision applications, such as image classification

(he2016deep; krizhevsky2012imagenet; tan2017photograph), image segmentation (chen2017deeplab), object detection (girshick2014rich; jiang2016speed; he2015delving), image enhancement (lin2018image; tang2018joint)

, image retrieval

(yu2014click; yu2016deep) and tracking (wu2016regional)

, Rectified Linear Unit (ReLU)

(Nair2010) has become extremely popular in the deep learning community in recent years. Owing to the significant improvements of ReLU in the deep neural networks, some extended versions are constantly springing up. For instance, Leaky ReLU (LReLU) (maas2013rectifier) is proposed by replacing the negative part of the ReLU with a non-zero slope, while Exponential Linear Units (ELUs) (clevert2015fast) can tend to converge cost to zero faster and produce more accurate results. All these extended versions can more or less achieve a certain effect in the respective fields.

However, there is hardly a generally accepted rule-of-thumb for the choice of activation functions owing to the fact that it solely depends on the problem at hand. Even the most popularly and commonly used activation function ReLU is not suitable for all datasets and network architectures. Therefore, adaptive activation functions have drawn more and more attention in recent years. For example, Maxout(goodfellow2013maxout) can approximate any convex functions by selecting the maximum output value of multiple linear activation functions, but a large number of extra parameters are introduced, which causes large storage memory and high computation cost. In Parametric rectified linear unit (PReLU) (he2015delving), the slopes of negative part can be obtained by learning from data rather than the pre-defined fixed values, thus PReLU has theoretically all the advantages of ReLU and effectively avoids Dead ReLU. But in practice, it has not been fully confirmed that PReLU always surpasses ReLU. In 2017, an activation function with the property of “self-normalization" is proposed, named SELU (klambauer2017self)

, and it can avoid the problem of gradient vanishing and exploding, thereby leading to the feedforward neural network to obtain beyond state-of-the-art performance. However, the effectiveness of SELU in Convolutional neural networks (CNN) has not been confirmed. In the same year, swish

(ramachandran2017searching) with some complex characteristics, such as no upper and lower bound, smooth and non-monotonic, can perform better than ReLU on many deep models.

Although the existing adaptive activation functions are relatively more flexible than the traditional activation function owing to the adaptability, and have already achieved great improvements, they are limited to some specific application scenarios, and there are still many problems to be solved, such as low generalization capability and poor precision performance. For example, their performance often depends on some specific network models and data sets. In this work, a novel methodology is proposed to explore the optimal activation functions with more flexibility and adaptability only by adding few additional parameters to the traditional activation functions such as Sigmoid, Tanh and ReLU. The proposed methodology can avoid local minimums and accelerate convergence only by introducing very few parameters to the fixed activation functions, thereby increasing the precision, reducing the training cost and improving the generalization performance.

The primary contributions of our work are summarized as follows:

  • A novel methodology is proposed to customize activation functions with more flexibility and adaptation for various layers only by introducing very few parameters to the traditional activation functions such as Sigmoid, Tanh, and ReLU.

  • A theoretical analysis for accelerating the convergence and improving the performance is presented by taking an activation function of one layer as an example without loss of generality, and an experimental study is performed by comparing the weight increments between two successive epochs in different layers during the training process between the proposed AReLU and ReLU on CIFAR100 based on VGGNet.

  • The proposed AReLU is a generalized form of the ReLU-based versions, while ReLU and PReLU are the special cases of the proposed AReLU.

The rest of the paper is organised as follows. Section 2 introduces the related work, and the proposed methodology is presented in Section 3. Section 4 presents the analysis for our methodology. Section 5 details the experimental results for comparison and validation. Section 6 concludes the paper.

2 Related work

Over the last few decades, many various activation functions have been proposed in the artificial neural network community. According to whether the parameter or shape of an activation function is learnable or variable during the training phase, activation functions can be divided in two categories: fixed activation functions and adaptive activation functions.

2.1 Fixed activation functions

Fixed activation functions indicate that the parameters or shapes can not be modified during the training phase (shown in Fig. 1), and the most common fixed activation functions can be fallen into three categories: Logistic (Sigmoid), Hyperbolic Tangent (Tanh) and Rectified Linear Activation (ReLU).

Sigmoid

Sigmoid function is a common S-like function or S-like growth curve, and is normally used to refer specifically to the logistic function. It can map any real value to the range [0,1], thereby being interpreted as a probability, defined as follows:

(1)

It is differentiable, and the derivative is derived as follows:

(2)

Note that the gradient as or , meaning that, when the output of Sigmoid saturates for a large positive or negative inputs (i.e. the curve becomes parallel to -axis shown in Fig. 1

), the gradients are almost zero. Due to the zero gradient, the weights are no longer updated and the networks will not learn, thus the neuron dies, thereby causing the vanishing gradient problem. Besides, Sigmoid outputs are not zero-centered, and it can indirectly introduce undesirable zig-zagging dynamics in the gradient updates for the weights.

Tanh

Tanh function, a hyperbolic tangent function, graphically looks very similar to Sigmoid. Actually, the Tanh is simply a scaled Sigmoid, such that its outputs range from -1 to 1, defined as follows:

(3)

Like the Sigmoid, Tanh is also affected by the vanishing gradient problem. But unlike the Sigmoid, its output is zero-centered, the negative inputs will be mapped strongly negatives and the zero inputs will be mapped near zero. Therefore, the non-linearity of Tanh is always preferred to that of Sigmoid, and it has been widely used in deep learning & machine learning, especially in classification scenarios between two classes.

ReLU

ReLU is a very simple and efficient activation function that has been widely used in almost all deep learning domains, especially in CNNs, defined as

(4)

Owing to the simpler mathematical operations, ReLU is far more computationally efficient than Tanh and Sigmoid. Besides, ReLU can solve parts of the saturation problem only in the positive region. Whereas for the negative inputs, the results contain one or more true zero values (called a sparse representation) to accelerate learning and simplify the model in representational learning, but the weights and biases are not updated owing to the zero gradient during the backpropogation process, thereby causing the dying ReLU problem.

Figure 1: An illustration of fixed activation functions with a fixed shape.

2.2 Adaptive activation functions

Adaptive activation functions refer primarily to the functions that the parameters or shapes are trained and learned along with other parameters in neural networks (shown in Fig. 2), thereby adaptively varying with training data. In other words, the main idea of this kind of functions is to search a good function shape using knowledge given by the training data. For example, PReLU (he2015delving) replaces the fixed slope of LReLU (maas2013rectifier) with a trainable parameter in the negative region. Whereas Swish (ramachandran2017searching) is a recently proposed activation function with no upper bound, lower bound, smooth, and non-monotonic characteristic, and it can be loosely viewed as a bridging function between the linear function and the ReLU function. Other similar activation functions like FReLU (qiu2018frelu) and PELU (trottier2017parametric) have achieved performance improvements in some specific tasks.

Although the existing adaptive activation functions has shown to improve the network performances significantly, thanks to properties such as no saturation feature, flexibility and adaptivity, exploring the optimal and appropriate activation functions is still an open field of research, and there is still potential room for improvement in various scenarios, especially for complex datasets and different models.

Figure 2: An illustration of various adaptive activation functions. The shapes of activation functions can be controlled and adjusted by some parameters, which are trained along with other parameters in the neural networks.

3 Methodology

The training of neural networks is essentially a non-convex optimization problem, in which the optimal weight parameters can be searched and found by using the back-propagation algorithm, so that the functional subspace will be explored and determined by the activation function. Adaptive activation functions refer to the functions that adapt themselves to the network inputs, therefore they can learn hyper-parameters to adapt the parameters of the affine transformation to a given input, and thereby increase the flexibility and the representation ability of network models.

In this work, we attempt to construct a new parameter learning method for each layer only by introducing a few parameters to the fixed activation functions, and the general form in the layer with activation functions can be defined as follows

(5)

where represents a traditional activation function (fixed activation function). , , and are four learnable parameters in the layer, and they can adapt to the different tasks according to the complexity of input data so as to efficiently avoid falling into local minimums. denotes the weighted sum of inputs, including the bias term, defined as

(6)

and indicate weights and bias, respectively.

is an input vector.

In practice, the proposed adaptive activation function is very simple, and it is composed of two embedded linear equations, namely: internal linear equation

(7)

and external linear equation

(8)

Therefore Equation (1) is rewritten as

(9)

In the following sections, the effectiveness and advantages of the proposed methodology will be verified by taking some common fixed activation functions as baselines, such as Sigmoid, Tanh and ReLU, and the corresponding adaptive activation functions are named ASimoid, ATanh and AReLU, respectively. According to Equation (5), these functions are respectively defined as

(10)
(11)
(12)

In ASigmoid and ATanh, and are respectively used to scale the inputs of Sigmoid and Tanh,while and scale the outputs,simultaneously.

Significantly, when and , the negative part of the AReLU is replaced with a zero slope, while the slope of the positive part is fixed. In this case, AReLU is actually degenerated to a standard ReLU, given as

(13)

Furthermore, when and , the slope (i.e., the parameter ) of the negative part is adjustable, which means that the parameter can learn from data rather than be obtained by the pre-defined. Under these conditions, AReLU is evolved to PReLU when , given as

(14)

Therefore, AReLU is a generalized form of the ReLU-based versions, while ReLU and PReLU are the special cases of the proposed AReLU.

Above, it can be clearly seen that our method only adds four parameters for each layer. For the entire network model, parameters should be added. This parameter amount and calculation amount is negligible compared with the entire network model.

Figure 3: The shapes of ASigmoid (green), ATanh (orange) and AReLU (blue) at different layers during the training process on CIFAR100 based on VGG.

4 Analysis

A convex loss function

for the linear weighted combination of each activation function applied to an input is defined to find the optimal weights by adopting suitable optimization strategies based on the back-propagation algorithm. Thus, the training process of the network model is essentially an iterative optimization process for the weight parameters by minimizing the loss function in the functional subspace.

4.1 Theoretical Analysis

In order to facilitate analysis, we just take an activation function of one layer as an example without loss of generality. Suppose that a neural network with traditional activation functions is given as

(15)

For the update process of weights, the partial derivative chain is defined as follows.

(16)
(17)

Meanwhile, the weight is updated as follows.

(18)

where is learning rate. Equation (16) and (17) are substituted into (18), we can obtain the weights update equation for a common activation function as follows.

(19)

Considering that the proposed adaptive activation functions consist of two linear equations, for simplicity, we just consider the internal linear function and omit its intercept term in a certain layer, given as

(20)

where

is the output of an adaptive activation function. The hyperparameter

represents a generalization form for scale inputs in any layers, and it can be inferred to fine-tune the learning rate so as to speed up the update of weights, and the corresponding derivation is given as follows.

For the output , the partial derivative is given as

(21)

With Equations (16), (18) and (21), the update process of the weight is given as

(22)

By comparison between Equations (19) and (22), the learning rate of the adaptive activation function can be written as:

(23)

From Equation (23), we can adaptively adjust the learning rate by using the hyperparameter . Simultaneously, the optimization of the hyperparameter in neural networks is similar to the hyperparameter , then the update process of

is achieved by using the chain rule.

(24)
(25)

With Equations (24) and (25)

(26)

With Equations (6) and (26)

(27)

Therefore, the adaptive activation function is to achieve rapid convergence by the adapting learning rate, and this method is achieved by adjusting the weight and the parameter mutually to speed up learning in neural networks and lead to higher classification precision.

Besides, the internal linear equation has also its respective intercept, which contributes to tuning the parameters from another vertical direction during training process, thereby avoiding involving local extremum.

Similarly, the external linear equation has the same effects for accelerating the convergence and improving the performance. More importantly, the internal equation is embedded within the external, such case will enable the optimization toward a global optimum solution more efficiently from all directions.

4.2 Experimental Analysis

Owing to the fact that each layer has its own independent activation function, the optimal hyperparameters of each layers can obtained by learning from the respective complexity of input data, and their values will vary with the input data characteristics. Therefore, the obtained weights will be optimal, and the corresponding activation functions are different with different layers. Fig. 3 shows the visualization of different layers during the training process on CIFAR100 based on VGG, and different shapes of ASigmoid, ATanh and AReLU at different layers indicate that these functions can learn the optimal hyperparameters from the inputs of the respective layers, which would lead to the enhancement and improvement of the fitting capability and the accuracy of the networks.

Moreover, compared with the traditional adaptive activation functions, two embedded linear equations with intercepts can accelerate the weights adjustment. For further verification, the change curves of the weight increments between two successive epochs in various layers are visualized along with the training process (shown in Fig. 4). The results clearly show the amplitudes of the increments

by using the proposed AReLU are much larger than those of the traditional ReLU in the early training stages, then the increments of the two methods converge, which means the proposed methods can provide faster weight updates than the traditional methods. Consequently the proposed methods can improve greatly the convergence speed and reduce the computational burden. Meanwhile, the large amplitudes of the increments can also help to avoid falling into a local optimum when training artificial neural networks with gradient-based learning methods and backpropagation.

Figure 4: The change curves of the weight increments between two successive epochs in different layers during the training process by using AReLU and ReLU on CIFAR100 based on VGGNet. Column 1,2 and 3 represent the , and layer, respectively. Rows 1-4 illustrate four various weights by using AReLU and ReLU.

5 Experiments

In this section, a series of experiments are implemented to verify and evaluate the effectiveness of the proposed methodology based on the three baseline activation functions such as Sigmoid, Tanh and ReLU. Considering the fact that ReLU is the most common activation function used in neural networks, there exist many derivatives of ReLU, and some typical derivatives like LReLU and PReLU are selected for comparison to highlight the effectiveness of the parameterization method in the activation function. Whereas swish, as an outstanding activation function, is used to demonstrate the state-of-the-art performance of the proposed adaptive activation function. Firstly, many comparison experiments between the proposed functions and its corresponding baseline functions have been conducted by using Stochastic gradient descent (SGD)

(cramer1946mathematical) on the datasets of CIFAR10 and CIFAR100 based on different network models, such as AlexNet (krizhevsky2012imagenet), VGG (simonyan2014very), GoogleNet (szegedy2015going), ResNet (he2016deep) and DenseNet (huang2017densely). Then, some experiments are implemented to further verify the validity and suitability in various optimization strategies, such as SGD, Momentum (qian1999momentum), AdaGrad (duchi2011adaptive), AdaDelta (zeiler2012adadelta) and ADAM (kingma2014adam). Finally, a series of comparison experiments are conducted to further verify the effectiveness, suitability and generalization ability on other more complicated datasets like miniImageNet(Oriol2016), PASCAL VOC (VOC2012) and COCO (COCO2014).

5.1 Experimental setup

We test the proposed adaptive activation functions on CIFAR10 and CIFAR100 based on AlexNet, VGGNet, GoogleNet, ResNet and DenseNet. The detailed experimental setup is illustrated in Fig. 5.

Figure 5: The network architectures of AlexNet, VGGNet, GoogleNet, ResNet and DenseNet on CIFAR10 and CIFAR100.

Note that Dense block consists of , , , , respectively, and the transition layer is shown in Fig. 3, which corresponds to the sequence Moreover, the growth rate is k=24 for all.

5.2 Cifar10

In these experiments, the proposed adaptive activation functions (10)(12) are applied on CIFAR10 dataset based on the models (shown in Fig. 5), respectively. All trainings are implemented for no less than 80 epochs with a 64-batch size and without data augmentation by using SGD with fixed learning rate schedule of 0.001, 0.0001 and 0.00001 with the training process, respectively.

Fig. 6 shows the convergence curves (top row) and the area enclosed by the convergence curves (bottom row) during the training process. It is obvious that the smaller the area, the faster the convergence speed, and it is also clearly shown from the results that the proposed activation functions can surpass the corresponding baseline functions and LReLU for different network models in terms of convergence speed. Thereinto, AReLU can obtain the fastest convergence speed among these activation functions, especially on DenseNet, ResNet and VGG16, it can converge much faster than other activation functions. Tables 1 and Table 2111Indicates that the activation function does not converge in this mode. show the quantified results in precision, and the proposed methodology has an overall advantage. Table 1 illustrates the comparison results between the proposed methods and their respectively corresponding baselines. From the results, the proposed methodology can effectively applied to the classic fixed activation functions, and can surpass the corresponding baseline functions on most network models. For instance, ASigmoid can surpass the corresponding baseline function Sigmoid on all models, while ATanh can also obtain better precision than the corresponding Tanh on the models of VGGNet, ResNet and DenseNet, and the obtained precision of ATanh in AlexNet and GoogleNet are only slightly lower than those of the corresponding Tanh. Table 2 shows the comparison results between AReLU and other adaptive activation functions. AReLU, except PReLU on the AlexNet and VGGNet, can overall obtain higher precision than other adaptive functions on various models. Note that some traditional adaptive functions like PELU and FReLU are not suitable for some network models owing to the lack of convergence during the training process. While the proposed methodology can apply to various deep learning models and has better generalization performance than other traditional methodologies.

Figure 6: The top row represents the loss-epoch convergence curves for various activation functions on CIFAR10 by using SGD optimizer on various models during the training process, such as AlexNet, DenseNet, VGGNet, ResNet and DenseNet. Whereas the bottom row illustrates the area enclosed by the corresponding convergence curves in top row, and the smaller area means the faster convergence speed.
Methods AlexNet VGGNet GoogleNet ResNet DenseNet
Sigmoid 0.81411.52e-3 0.86801.17e-3 0.80271.13e-3 0.80791.25e-3 0.76113.86e-3
ASigmoid(ours) 0.84691.68e-3 0.88192.28e-4 0.85174.07e-4 0.80871.36e-3 0.90768.16e-4
Tanh 0.80439.68e-4 0.88417.72e-4 0.84336.06e-4 0.57849.75e-4 0.89661.99e-5
ATanh(ours) 0.79005.31e-4 0.89043.05e-4 0.83427.99e-4 0.60148.11e-4 0.91631.87e-4
ReLU 0.83513.95e-4 0.88001.01e-4 0.87334.36e-5 0.66412.59e-3 0.92421.10e-4
LReLU(maas2013rectifier) 0.82551.66e-3 0.92534.68e-5 0.87429.31e-4 0.66981.27e-3 0.92892.37e-5
AReLU(ours) 0.83312.48e-4 0.83281.26e-3 0.87731.83e-3 0.92305.51e-4 0.95385.58e-4
Table 1: The classification precision of various fixed activation functions for different models on CIFAR10.
Methods AlexNet VGGNet GoogleNet ResNet DenseNet
PReLU(he2015delving) 0.85584.61e-4 0.93445.89e-5 0.85511.79e-4 0.65223.20e-4 0.92311.40e-4
Swish(ramachandran2017searching) 0.75572.20e-3 0.9171e-4 0.87103.25e-3 0.64463.39e-4 0.92761.87e-4
PELU(trottier2017parametric) 1 0.81284.45e-4 0.78913.91e-4
FReLU(qiu2018frelu) 0.85582.44e-4 0.86945.21e-3 0.87263.71e-4 0.82144.11e-4
AReLU(ours) 0.83312.48e-4 0.83281.26e-3 0.87731.83e-3 0.92305.51e-4 0.95385.58e-4
Table 2: The classification precision of various adaptive adaptive activation functions for different models on CIFAR10. The top results are highlighted in black bold and the second-best results in blue.

5.3 Cifar100

To further verify the validity and applicability, CIFAR-100 is selected to training based on several typical network models, such as AlexNet, DenseNet (shown in Fig.

5) and VGG-v. The Dataset is trained for 150 epochs with a 250-batch size and a fixed learning rate of 0.001, 0.0001 and 0.00001 with the training process, respectively. Note that VGG-v is an extended version of the VGGNet.

Table 3 shows the classification comparison results between the proposed methods and their corresponding baselines on AlexNet, VGG-v and DenseNet, respectively. Except AReLU in AlexNet, the proposed methods can surpass the respective corresponding baseline functions. Table 4222Indicates that the activation function does not converge in this method. illustrates the comparison results between AReLU and other adaptive functions. From the results, AReLU can achieve the best precision performance on the three network models. Similarly, PELU and FReLU cannot converge to desired loss values.

Methods AlexNet VGG-v DenseNet
Sigmoid 0.46622.49e-3 0.55455.89e-4 0.25611.23e-3
ASigmoid(ours) 0.53128.66e-4 0.65787.56e-4 0.61033.15e-3
Tanh 0.50586.06e-4 0.60076.90e-4 0.59602.53e-3
ATanh(ours) 0.52361.98e-4 0.61662.41e-4 0.67343.98e-4
ReLU 0.57011.18e-4 0.69724.99e-4 0.56161.09e-3
LReLU(maas2013rectifier) 0.55001.09e-3 0.69915.53e-4 0.69145.12e-4
AReLU(ours) 0.56471.71e-4 0.70051.04e-3 0.70811.64e-3
Table 3: The classification precision of various fixed activation functions for different models on CIFAR100.
Methods AlexNet VGG-v DenseNet
PReLU(he2015delving) 0.53255.01e-4 0.68383.84e-5 0.57818.87e-3
Swish(ramachandran2017searching) 0.55191.16e-3 0.67298.32e-5 0.70793.01e-3
PELU(trottier2017parametric) 1
FReLU(qiu2018frelu) 0.1534 0.13252.16e-4
AReLU(ours) 0.56471.71e-4 0.70051.04e-3 0.70811.64e-3
Table 4: The classification precision of various adaptive activation functions for different models on CIFAR100.

5.4 Validity and practicability in various optimization strategies

Gradient descent algorithms are often used as a black-box optimizer in neural networks, and different optimization strategies have great influence on the performance of activation functions in practice. Therefore, to further verify the validity and practicability in various optimization strategies, the best method AReLU is selected as activation function based on GoogleNet and ResNet, and a series of comparison experiments are achieved on CIFAR10 among various optimizers, such as SGD, Momentum, AdaGrad, AdaDelta and ADAM.

Fig. 7 shows the convergence curves by using various activation functions with various optimization strategies on GoogleNet and ResNet, respectively. The results show that AReLU converges faster than all other activation functions on the GoogleNet model. Whereas on the ResNet model, it is also obvious for AReLU to have an overall convergence advantage, especially compared with AdaGrad and AdaDelta. These results indicate that the proposed AReLU can accelerate convergence, thereby reducing the training cost.

Table 5 further reveals that the proposed AReLU can achieve better overall performance than other activation functions based on different optimization strategies and different network models. Except ReLU with a Momentum optimizer on GoogleNet and Swish with an ADAM optimizer on ResNet, the proposed AReLU surpasses other activation functions with various optimization strategies on both the network models, and the obtained precision performance is far better than other methods. While AReLU is only slightly worse than ReLU with a Momentum optimizer on GoogleNet and Swish with an ADAM optimizer on ResNet, respectively. Significantly, AReLU with SGD can generally achieve the best precision performance among these activation functions with various optimization strategies on both models. Fig. 8 illustrates the convergence of the proposed AReLU with various optimizers, and the results show that SGD has faster convergence speed than all other optimizers on both models, especially on ResNet.

The above results shows the proposed adaptive activation functions have faster convergence speed and higher precision than traditional activation functions. And it suggests the proposed methodology can avoid local minimums and accelerate convergence, thereby increasing the precision, reducing the training cost and improving the generalization performance.

Figure 7: The convergence curves by using various optimization strategies on GoogleNet (top row) and ResNet (bottom row).
Figure 8: The convergence curves of AReLU by using various optimization strategies on GoogleNet and ResNet.
Models Methods SGD Momentum AdaGrad AdaDelta ADAM
ReLU 0.87334.36e-5 0.89517.30e-5 0.60651.53e-4 0.56863.31e-4 0.87971.08e-3
LReLU(maas2013rectifier) 0.87429.31e-4 0.84984.20e-4 0.66233.22e-3 0.65211.92e-3 0.86548.78e-5
GoogleNet PReLU(he2015delving) 0.85511.79e-4 0.81382.14e-4 0.62788.44e-6 0.53671.45e-3 0.81772.34e-6
Swish(ramachandran2017searching) 0.87103.25e-3 0.87721.39e-3 0.67471.58e-3 0.61392.66e-4 0.87431.92e-3
AReLU(ours) 0.87731.83e-3 0.86518.44e-5 0.74086.38e-3 0.72212.57e-3 0.89103.37e-4
ReLU 0.66412.59e-3 0.62741.99e-3 0.42537.38e-4 0.35531.57e-3 0.85196.20e-4
LReLU(maas2013rectifier) 0.66981.27e-3 0.66351.40e-3 0.51621.61e-4 0.40463.87e-3 0.80651.07e-3
ResNet PReLU(he2015delving) 0.65223.20e-4 0.63252.11e-4 0.34932.07e-3 0.28464.28e-3 0.78898.08e-4
Swish(ramachandran2017searching) 0.64463.39e-4 0.60997.02e-5 0.40323.41e-3 0.31123.31e-3 0.89096.84e-5
AReLU(ours) 0.92305.51e-4 0.78621.38e-3 0.73412.92e-3 0.71555.56e-4 0.88052.31e-4
Table 5: Classification precision comparisons between various activation functions by using different optimization strategies and models on CIFAR10. The top results are highlighted in black bold and the second-best results in blue.

5.5 More complicated datasets

Two more complicated datasets miniImageNet(Oriol2016) and PASCAL VOC(VOC2012) are used to further test the validity of the proposed methodology based on ResNet50. Table 6333Indicates that the activation function does not converge in this method. shows the performance comparisons of classification precision between AReLU and other adaptive functions. The results indicate that AReLU and PReLU can obtain the best classification precision in PASCAL VOC and miniImageNet, respectively. Overall, their performances are nearly equal in the two datasets. Note that PELU and FReLU cannot converge, meaning that the two datasets are very complicated and challenging.

The above all experiments are conducted for various models, datasets and methods based on classification tasks, and it means that the proposed adaptive activation functions can overall obtain the best classification performance. To test the validity and practicability in other deep learning tasks, PASCAL VOC is used for the object detection tasks by respectively adopting Faster RCNN(RenHGS15) and YOLOv2(redmon2016yolo9000) based on the proposed AReLU. Besides, a more complicated detection dataset COCO(COCO2014) is selected to further verify the effectiveness by employing FCOS(tian2021fcos). Table 7444Activation function does not converge in this model. shows performance comparisons of detection precision among various adaptive functions. From the results, the proposed AReLU can nearly achieve the best detection performance including AP50, AP75 and mAP among various adaptive functions based on various methods and datasets. It means that our method has a validity and practicability.

From the results of a series of comparison experiments, the proposed methods can achieve better performance in various scenarios, such as datasets, network models, optimization methods and deep learning tasks. The most significant reason is that our methodology has an internal and external bilinear structure, and the internal function is embedded within the external, such case will enable the optimization toward a global optimum solution more efficiently from all directions, thereby accelerating the convergence and improving the performance. More importantly, the proposed methodology only adds a small number of parameters (i.e., four parameters for each layer) , and the number of parameters is negligible compared with millions of parameters in entire network model, which means the amount of network computing and the risk of over-fitting is only increased inconsiderably.

Methods miniImageNet PASCAL VOC
ReLU 0.75624.60e-3 0.59335.70e-3
PReLU(he2015delving) 0.78131.20e-3 0.60662.00e-4
Swish(ramachandran2017searching) 0.71342.50e-3 0.51542.30e-3
PELU(trottier2017parametric) 1
FReLU(qiu2018frelu)
AReLU(ours) 0.77511.40e-3 0.60923.02e-3
Table 6: The classification precision of various activation functions on miniImageNet and PASCAL VOC based on ResNet50. The top results are highlighted in black bold and the second-best results in blue.
Methods Faster-RCNN (PASCAL VOC) YOLOv2 (PASCAL VOC) FCOS (COCO)
AP50 AP75 mAP AP50 AP75 mAP AP50 AP75 mAP
ReLU 0.8501 0.6266 0.7661 0.5924 0.1459 0.3970 0.4974 0.3426 0.3232
PReLU(he2015delving) 1 0.1777 0.0176 0.0931 0.4943 0.3399 0.3214
Swish(ramachandran2017searching) 0.7111 0.4981 0.6658 0.6250 0.2432 0.4621 0.4923 0.3418 0.3220
PELU(trottier2017parametric)
FReLU(qiu2018frelu) 0.8486 0.6273 0.7669 0.2604 0.0627 0.1670 0.5030 0.3473 0.3274
AReLU(ours) 0.8530 0.6271 0.7675 0.6384 0.2459 0.4687 0.5009 0.3490 0.3279
Table 7: The detection precision of various activation functions for different methods on various datasets. The top results are highlighted in black bold and the second-best results in blue.

6 Conclusions

In this work, a novel methodology is proposed to adaptively customize activation functions for various layers, and it will contribute to avoiding local minimums and accelerating convergence, thereby increasing the precision, reducing the training cost and improving the generalization performance. In this methodology, a small number of parameters are introduced to the traditional activation functions such as Sigmoid, Tanh and ReLU, and some theoretical and experimental analysis for accelerating the convergence and improving the performance is presented. To verify the effectiveness of the proposed methodology, a series of experiments are implemented on CIFAR10, CIFAR100, miniImageNet, PASCAL VOC and COCO by employing various network models such as VGGNet, GoogleNet, ResNet, DenseNet, various optimization strategies such as SGD, Momentum, AdaGrad, AdaDelta and ADAM, and various task like classification and detection tasks. The results show that the proposed methodology is very simple but with significant performance in convergence speed, precision and generalization, and it can surpass other popular methods like ReLU and swish in almost all experiments in terms of overall performance.

7 Acknowledgments

The authors would like to express their appreciation to the referees for their helpful comments and suggestions. This work was supported in part by Zhejiang Provincial Natural Science Foundation of China (Grant No. LGF20H180002 and GF22F037921), and in part by National Natural Science Foundation of China (Grant No. 61802347, 61801428 and 61972354), and the National Key Research and Development Program of China (Grant No. 2018YFB1305202).

References