1 Introduction
Activation functions play a key role during the training process of neural networks, and considerable attention has been paid to explore standard activation functions over the past years. Especially, with the remarkable development of Deep Neural Networks (DNN) in various computer vision applications, such as image classification
(he2016deep; krizhevsky2012imagenet; tan2017photograph), image segmentation (chen2017deeplab), object detection (girshick2014rich; jiang2016speed; he2015delving), image enhancement (lin2018image; tang2018joint)(yu2014click; yu2016deep) and tracking (wu2016regional), Rectified Linear Unit (ReLU)
(Nair2010) has become extremely popular in the deep learning community in recent years. Owing to the significant improvements of ReLU in the deep neural networks, some extended versions are constantly springing up. For instance, Leaky ReLU (LReLU) (maas2013rectifier) is proposed by replacing the negative part of the ReLU with a nonzero slope, while Exponential Linear Units (ELUs) (clevert2015fast) can tend to converge cost to zero faster and produce more accurate results. All these extended versions can more or less achieve a certain effect in the respective fields.However, there is hardly a generally accepted ruleofthumb for the choice of activation functions owing to the fact that it solely depends on the problem at hand. Even the most popularly and commonly used activation function ReLU is not suitable for all datasets and network architectures. Therefore, adaptive activation functions have drawn more and more attention in recent years. For example, Maxout(goodfellow2013maxout) can approximate any convex functions by selecting the maximum output value of multiple linear activation functions, but a large number of extra parameters are introduced, which causes large storage memory and high computation cost. In Parametric rectified linear unit (PReLU) (he2015delving), the slopes of negative part can be obtained by learning from data rather than the predefined fixed values, thus PReLU has theoretically all the advantages of ReLU and effectively avoids Dead ReLU. But in practice, it has not been fully confirmed that PReLU always surpasses ReLU. In 2017, an activation function with the property of “selfnormalization" is proposed, named SELU (klambauer2017self)
, and it can avoid the problem of gradient vanishing and exploding, thereby leading to the feedforward neural network to obtain beyond stateoftheart performance. However, the effectiveness of SELU in Convolutional neural networks (CNN) has not been confirmed. In the same year, swish
(ramachandran2017searching) with some complex characteristics, such as no upper and lower bound, smooth and nonmonotonic, can perform better than ReLU on many deep models.Although the existing adaptive activation functions are relatively more flexible than the traditional activation function owing to the adaptability, and have already achieved great improvements, they are limited to some specific application scenarios, and there are still many problems to be solved, such as low generalization capability and poor precision performance. For example, their performance often depends on some specific network models and data sets. In this work, a novel methodology is proposed to explore the optimal activation functions with more flexibility and adaptability only by adding few additional parameters to the traditional activation functions such as Sigmoid, Tanh and ReLU. The proposed methodology can avoid local minimums and accelerate convergence only by introducing very few parameters to the fixed activation functions, thereby increasing the precision, reducing the training cost and improving the generalization performance.
The primary contributions of our work are summarized as follows:

A novel methodology is proposed to customize activation functions with more flexibility and adaptation for various layers only by introducing very few parameters to the traditional activation functions such as Sigmoid, Tanh, and ReLU.

A theoretical analysis for accelerating the convergence and improving the performance is presented by taking an activation function of one layer as an example without loss of generality, and an experimental study is performed by comparing the weight increments between two successive epochs in different layers during the training process between the proposed AReLU and ReLU on CIFAR100 based on VGGNet.

The proposed AReLU is a generalized form of the ReLUbased versions, while ReLU and PReLU are the special cases of the proposed AReLU.
The rest of the paper is organised as follows. Section 2 introduces the related work, and the proposed methodology is presented in Section 3. Section 4 presents the analysis for our methodology. Section 5 details the experimental results for comparison and validation. Section 6 concludes the paper.
2 Related work
Over the last few decades, many various activation functions have been proposed in the artificial neural network community. According to whether the parameter or shape of an activation function is learnable or variable during the training phase, activation functions can be divided in two categories: fixed activation functions and adaptive activation functions.
2.1 Fixed activation functions
Fixed activation functions indicate that the parameters or shapes can not be modified during the training phase (shown in Fig. 1), and the most common fixed activation functions can be fallen into three categories: Logistic (Sigmoid), Hyperbolic Tangent (Tanh) and Rectified Linear Activation (ReLU).
Sigmoid
Sigmoid function is a common Slike function or Slike growth curve, and is normally used to refer specifically to the logistic function. It can map any real value to the range [0,1], thereby being interpreted as a probability, defined as follows:
(1) 
It is differentiable, and the derivative is derived as follows:
(2) 
Note that the gradient as or , meaning that, when the output of Sigmoid saturates for a large positive or negative inputs (i.e. the curve becomes parallel to axis shown in Fig. 1
), the gradients are almost zero. Due to the zero gradient, the weights are no longer updated and the networks will not learn, thus the neuron dies, thereby causing the vanishing gradient problem. Besides, Sigmoid outputs are not zerocentered, and it can indirectly introduce undesirable zigzagging dynamics in the gradient updates for the weights.
Tanh
Tanh function, a hyperbolic tangent function, graphically looks very similar to Sigmoid. Actually, the Tanh is simply a scaled Sigmoid, such that its outputs range from 1 to 1, defined as follows:
(3) 
Like the Sigmoid, Tanh is also affected by the vanishing gradient problem. But unlike the Sigmoid, its output is zerocentered, the negative inputs will be mapped strongly negatives and the zero inputs will be mapped near zero. Therefore, the nonlinearity of Tanh is always preferred to that of Sigmoid, and it has been widely used in deep learning & machine learning, especially in classification scenarios between two classes.
ReLU
ReLU is a very simple and efficient activation function that has been widely used in almost all deep learning domains, especially in CNNs, defined as
(4) 
Owing to the simpler mathematical operations, ReLU is far more computationally efficient than Tanh and Sigmoid. Besides, ReLU can solve parts of the saturation problem only in the positive region. Whereas for the negative inputs, the results contain one or more true zero values (called a sparse representation) to accelerate learning and simplify the model in representational learning, but the weights and biases are not updated owing to the zero gradient during the backpropogation process, thereby causing the dying ReLU problem.
2.2 Adaptive activation functions
Adaptive activation functions refer primarily to the functions that the parameters or shapes are trained and learned along with other parameters in neural networks (shown in Fig. 2), thereby adaptively varying with training data. In other words, the main idea of this kind of functions is to search a good function shape using knowledge given by the training data. For example, PReLU (he2015delving) replaces the fixed slope of LReLU (maas2013rectifier) with a trainable parameter in the negative region. Whereas Swish (ramachandran2017searching) is a recently proposed activation function with no upper bound, lower bound, smooth, and nonmonotonic characteristic, and it can be loosely viewed as a bridging function between the linear function and the ReLU function. Other similar activation functions like FReLU (qiu2018frelu) and PELU (trottier2017parametric) have achieved performance improvements in some specific tasks.
Although the existing adaptive activation functions has shown to improve the network performances significantly, thanks to properties such as no saturation feature, flexibility and adaptivity, exploring the optimal and appropriate activation functions is still an open field of research, and there is still potential room for improvement in various scenarios, especially for complex datasets and different models.
3 Methodology
The training of neural networks is essentially a nonconvex optimization problem, in which the optimal weight parameters can be searched and found by using the backpropagation algorithm, so that the functional subspace will be explored and determined by the activation function. Adaptive activation functions refer to the functions that adapt themselves to the network inputs, therefore they can learn hyperparameters to adapt the parameters of the affine transformation to a given input, and thereby increase the flexibility and the representation ability of network models.
In this work, we attempt to construct a new parameter learning method for each layer only by introducing a few parameters to the fixed activation functions, and the general form in the layer with activation functions can be defined as follows
(5) 
where represents a traditional activation function (fixed activation function). , , and are four learnable parameters in the layer, and they can adapt to the different tasks according to the complexity of input data so as to efficiently avoid falling into local minimums. denotes the weighted sum of inputs, including the bias term, defined as
(6) 
and indicate weights and bias, respectively.
is an input vector.
In practice, the proposed adaptive activation function is very simple, and it is composed of two embedded linear equations, namely: internal linear equation
(7) 
and external linear equation
(8) 
Therefore Equation (1) is rewritten as
(9) 
In the following sections, the effectiveness and advantages of the proposed methodology will be verified by taking some common fixed activation functions as baselines, such as Sigmoid, Tanh and ReLU, and the corresponding adaptive activation functions are named ASimoid, ATanh and AReLU, respectively. According to Equation (5), these functions are respectively defined as
(10) 
(11) 
(12) 
In ASigmoid and ATanh, and are respectively used to scale the inputs of Sigmoid and Tanh,while and scale the outputs,simultaneously.
Significantly, when and , the negative part of the AReLU is replaced with a zero slope, while the slope of the positive part is fixed. In this case, AReLU is actually degenerated to a standard ReLU, given as
(13) 
Furthermore, when and , the slope (i.e., the parameter ) of the negative part is adjustable, which means that the parameter can learn from data rather than be obtained by the predefined. Under these conditions, AReLU is evolved to PReLU when , given as
(14) 
Therefore, AReLU is a generalized form of the ReLUbased versions, while ReLU and PReLU are the special cases of the proposed AReLU.
Above, it can be clearly seen that our method only adds four parameters for each layer. For the entire network model, parameters should be added. This parameter amount and calculation amount is negligible compared with the entire network model.
4 Analysis
A convex loss function
for the linear weighted combination of each activation function applied to an input is defined to find the optimal weights by adopting suitable optimization strategies based on the backpropagation algorithm. Thus, the training process of the network model is essentially an iterative optimization process for the weight parameters by minimizing the loss function in the functional subspace.4.1 Theoretical Analysis
In order to facilitate analysis, we just take an activation function of one layer as an example without loss of generality. Suppose that a neural network with traditional activation functions is given as
(15) 
For the update process of weights, the partial derivative chain is defined as follows.
(16) 
(17) 
Meanwhile, the weight is updated as follows.
(18) 
where is learning rate. Equation (16) and (17) are substituted into (18), we can obtain the weights update equation for a common activation function as follows.
(19) 
Considering that the proposed adaptive activation functions consist of two linear equations, for simplicity, we just consider the internal linear function and omit its intercept term in a certain layer, given as
(20) 
where
is the output of an adaptive activation function. The hyperparameter
represents a generalization form for scale inputs in any layers, and it can be inferred to finetune the learning rate so as to speed up the update of weights, and the corresponding derivation is given as follows.For the output , the partial derivative is given as
(21) 
With Equations (16), (18) and (21), the update process of the weight is given as
(22) 
By comparison between Equations (19) and (22), the learning rate of the adaptive activation function can be written as:
(23) 
From Equation (23), we can adaptively adjust the learning rate by using the hyperparameter . Simultaneously, the optimization of the hyperparameter in neural networks is similar to the hyperparameter , then the update process of
is achieved by using the chain rule.
(24) 
(25) 
With Equations (24) and (25)
(26) 
With Equations (6) and (26)
(27) 
Therefore, the adaptive activation function is to achieve rapid convergence by the adapting learning rate, and this method is achieved by adjusting the weight and the parameter mutually to speed up learning in neural networks and lead to higher classification precision.
Besides, the internal linear equation has also its respective intercept, which contributes to tuning the parameters from another vertical direction during training process, thereby avoiding involving local extremum.
Similarly, the external linear equation has the same effects for accelerating the convergence and improving the performance. More importantly, the internal equation is embedded within the external, such case will enable the optimization toward a global optimum solution more efficiently from all directions.
4.2 Experimental Analysis
Owing to the fact that each layer has its own independent activation function, the optimal hyperparameters of each layers can obtained by learning from the respective complexity of input data, and their values will vary with the input data characteristics. Therefore, the obtained weights will be optimal, and the corresponding activation functions are different with different layers. Fig. 3 shows the visualization of different layers during the training process on CIFAR100 based on VGG, and different shapes of ASigmoid, ATanh and AReLU at different layers indicate that these functions can learn the optimal hyperparameters from the inputs of the respective layers, which would lead to the enhancement and improvement of the fitting capability and the accuracy of the networks.
Moreover, compared with the traditional adaptive activation functions, two embedded linear equations with intercepts can accelerate the weights adjustment. For further verification, the change curves of the weight increments between two successive epochs in various layers are visualized along with the training process (shown in Fig. 4). The results clearly show the amplitudes of the increments
by using the proposed AReLU are much larger than those of the traditional ReLU in the early training stages, then the increments of the two methods converge, which means the proposed methods can provide faster weight updates than the traditional methods. Consequently the proposed methods can improve greatly the convergence speed and reduce the computational burden. Meanwhile, the large amplitudes of the increments can also help to avoid falling into a local optimum when training artificial neural networks with gradientbased learning methods and backpropagation.
5 Experiments
In this section, a series of experiments are implemented to verify and evaluate the effectiveness of the proposed methodology based on the three baseline activation functions such as Sigmoid, Tanh and ReLU. Considering the fact that ReLU is the most common activation function used in neural networks, there exist many derivatives of ReLU, and some typical derivatives like LReLU and PReLU are selected for comparison to highlight the effectiveness of the parameterization method in the activation function. Whereas swish, as an outstanding activation function, is used to demonstrate the stateoftheart performance of the proposed adaptive activation function. Firstly, many comparison experiments between the proposed functions and its corresponding baseline functions have been conducted by using Stochastic gradient descent (SGD)
(cramer1946mathematical) on the datasets of CIFAR10 and CIFAR100 based on different network models, such as AlexNet (krizhevsky2012imagenet), VGG (simonyan2014very), GoogleNet (szegedy2015going), ResNet (he2016deep) and DenseNet (huang2017densely). Then, some experiments are implemented to further verify the validity and suitability in various optimization strategies, such as SGD, Momentum (qian1999momentum), AdaGrad (duchi2011adaptive), AdaDelta (zeiler2012adadelta) and ADAM (kingma2014adam). Finally, a series of comparison experiments are conducted to further verify the effectiveness, suitability and generalization ability on other more complicated datasets like miniImageNet(Oriol2016), PASCAL VOC (VOC2012) and COCO (COCO2014).5.1 Experimental setup
We test the proposed adaptive activation functions on CIFAR10 and CIFAR100 based on AlexNet, VGGNet, GoogleNet, ResNet and DenseNet. The detailed experimental setup is illustrated in Fig. 5.
Note that Dense block consists of , , , , respectively, and the transition layer is shown in Fig. 3, which corresponds to the sequence Moreover, the growth rate is k=24 for all.
5.2 Cifar10
In these experiments, the proposed adaptive activation functions (10)(12) are applied on CIFAR10 dataset based on the models (shown in Fig. 5), respectively. All trainings are implemented for no less than 80 epochs with a 64batch size and without data augmentation by using SGD with fixed learning rate schedule of 0.001, 0.0001 and 0.00001 with the training process, respectively.
Fig. 6 shows the convergence curves (top row) and the area enclosed by the convergence curves (bottom row) during the training process. It is obvious that the smaller the area, the faster the convergence speed, and it is also clearly shown from the results that the proposed activation functions can surpass the corresponding baseline functions and LReLU for different network models in terms of convergence speed. Thereinto, AReLU can obtain the fastest convergence speed among these activation functions, especially on DenseNet, ResNet and VGG16, it can converge much faster than other activation functions. Tables 1 and Table 2^{1}^{1}1Indicates that the activation function does not converge in this mode. show the quantified results in precision, and the proposed methodology has an overall advantage. Table 1 illustrates the comparison results between the proposed methods and their respectively corresponding baselines. From the results, the proposed methodology can effectively applied to the classic fixed activation functions, and can surpass the corresponding baseline functions on most network models. For instance, ASigmoid can surpass the corresponding baseline function Sigmoid on all models, while ATanh can also obtain better precision than the corresponding Tanh on the models of VGGNet, ResNet and DenseNet, and the obtained precision of ATanh in AlexNet and GoogleNet are only slightly lower than those of the corresponding Tanh. Table 2 shows the comparison results between AReLU and other adaptive activation functions. AReLU, except PReLU on the AlexNet and VGGNet, can overall obtain higher precision than other adaptive functions on various models. Note that some traditional adaptive functions like PELU and FReLU are not suitable for some network models owing to the lack of convergence during the training process. While the proposed methodology can apply to various deep learning models and has better generalization performance than other traditional methodologies.
Methods  AlexNet  VGGNet  GoogleNet  ResNet  DenseNet 

Sigmoid  0.81411.52e3  0.86801.17e3  0.80271.13e3  0.80791.25e3  0.76113.86e3 
ASigmoid(ours)  0.84691.68e3  0.88192.28e4  0.85174.07e4  0.80871.36e3  0.90768.16e4 
Tanh  0.80439.68e4  0.88417.72e4  0.84336.06e4  0.57849.75e4  0.89661.99e5 
ATanh(ours)  0.79005.31e4  0.89043.05e4  0.83427.99e4  0.60148.11e4  0.91631.87e4 
ReLU  0.83513.95e4  0.88001.01e4  0.87334.36e5  0.66412.59e3  0.92421.10e4 
LReLU(maas2013rectifier)  0.82551.66e3  0.92534.68e5  0.87429.31e4  0.66981.27e3  0.92892.37e5 
AReLU(ours)  0.83312.48e4  0.83281.26e3  0.87731.83e3  0.92305.51e4  0.95385.58e4 
Methods  AlexNet  VGGNet  GoogleNet  ResNet  DenseNet 

PReLU(he2015delving)  0.85584.61e4  0.93445.89e5  0.85511.79e4  0.65223.20e4  0.92311.40e4 
Swish(ramachandran2017searching)  0.75572.20e3  0.9171e4  0.87103.25e3  0.64463.39e4  0.92761.87e4 
PELU(trottier2017parametric)  ^{1}  0.81284.45e4  0.78913.91e4  
FReLU(qiu2018frelu)  0.85582.44e4  0.86945.21e3  0.87263.71e4  0.82144.11e4  
AReLU(ours)  0.83312.48e4  0.83281.26e3  0.87731.83e3  0.92305.51e4  0.95385.58e4 
5.3 Cifar100
To further verify the validity and applicability, CIFAR100 is selected to training based on several typical network models, such as AlexNet, DenseNet (shown in Fig.
5) and VGGv. The Dataset is trained for 150 epochs with a 250batch size and a fixed learning rate of 0.001, 0.0001 and 0.00001 with the training process, respectively. Note that VGGv is an extended version of the VGGNet.Table 3 shows the classification comparison results between the proposed methods and their corresponding baselines on AlexNet, VGGv and DenseNet, respectively. Except AReLU in AlexNet, the proposed methods can surpass the respective corresponding baseline functions. Table 4^{2}^{2}2Indicates that the activation function does not converge in this method. illustrates the comparison results between AReLU and other adaptive functions. From the results, AReLU can achieve the best precision performance on the three network models. Similarly, PELU and FReLU cannot converge to desired loss values.
Methods  AlexNet  VGGv  DenseNet 

Sigmoid  0.46622.49e3  0.55455.89e4  0.25611.23e3 
ASigmoid(ours)  0.53128.66e4  0.65787.56e4  0.61033.15e3 
Tanh  0.50586.06e4  0.60076.90e4  0.59602.53e3 
ATanh(ours)  0.52361.98e4  0.61662.41e4  0.67343.98e4 
ReLU  0.57011.18e4  0.69724.99e4  0.56161.09e3 
LReLU(maas2013rectifier)  0.55001.09e3  0.69915.53e4  0.69145.12e4 
AReLU(ours)  0.56471.71e4  0.70051.04e3  0.70811.64e3 
Methods  AlexNet  VGGv  DenseNet 

PReLU(he2015delving)  0.53255.01e4  0.68383.84e5  0.57818.87e3 
Swish(ramachandran2017searching)  0.55191.16e3  0.67298.32e5  0.70793.01e3 
PELU(trottier2017parametric)  ^{1}  
FReLU(qiu2018frelu)  0.1534  0.13252.16e4  
AReLU(ours)  0.56471.71e4  0.70051.04e3  0.70811.64e3 
5.4 Validity and practicability in various optimization strategies
Gradient descent algorithms are often used as a blackbox optimizer in neural networks, and different optimization strategies have great influence on the performance of activation functions in practice. Therefore, to further verify the validity and practicability in various optimization strategies, the best method AReLU is selected as activation function based on GoogleNet and ResNet, and a series of comparison experiments are achieved on CIFAR10 among various optimizers, such as SGD, Momentum, AdaGrad, AdaDelta and ADAM.
Fig. 7 shows the convergence curves by using various activation functions with various optimization strategies on GoogleNet and ResNet, respectively. The results show that AReLU converges faster than all other activation functions on the GoogleNet model. Whereas on the ResNet model, it is also obvious for AReLU to have an overall convergence advantage, especially compared with AdaGrad and AdaDelta. These results indicate that the proposed AReLU can accelerate convergence, thereby reducing the training cost.
Table 5 further reveals that the proposed AReLU can achieve better overall performance than other activation functions based on different optimization strategies and different network models. Except ReLU with a Momentum optimizer on GoogleNet and Swish with an ADAM optimizer on ResNet, the proposed AReLU surpasses other activation functions with various optimization strategies on both the network models, and the obtained precision performance is far better than other methods. While AReLU is only slightly worse than ReLU with a Momentum optimizer on GoogleNet and Swish with an ADAM optimizer on ResNet, respectively. Significantly, AReLU with SGD can generally achieve the best precision performance among these activation functions with various optimization strategies on both models. Fig. 8 illustrates the convergence of the proposed AReLU with various optimizers, and the results show that SGD has faster convergence speed than all other optimizers on both models, especially on ResNet.
The above results shows the proposed adaptive activation functions have faster convergence speed and higher precision than traditional activation functions. And it suggests the proposed methodology can avoid local minimums and accelerate convergence, thereby increasing the precision, reducing the training cost and improving the generalization performance.
Models  Methods  SGD  Momentum  AdaGrad  AdaDelta  ADAM 
ReLU  0.87334.36e5  0.89517.30e5  0.60651.53e4  0.56863.31e4  0.87971.08e3  
LReLU(maas2013rectifier)  0.87429.31e4  0.84984.20e4  0.66233.22e3  0.65211.92e3  0.86548.78e5  
GoogleNet  PReLU(he2015delving)  0.85511.79e4  0.81382.14e4  0.62788.44e6  0.53671.45e3  0.81772.34e6 
Swish(ramachandran2017searching)  0.87103.25e3  0.87721.39e3  0.67471.58e3  0.61392.66e4  0.87431.92e3  
AReLU(ours)  0.87731.83e3  0.86518.44e5  0.74086.38e3  0.72212.57e3  0.89103.37e4  
ReLU  0.66412.59e3  0.62741.99e3  0.42537.38e4  0.35531.57e3  0.85196.20e4  
LReLU(maas2013rectifier)  0.66981.27e3  0.66351.40e3  0.51621.61e4  0.40463.87e3  0.80651.07e3  
ResNet  PReLU(he2015delving)  0.65223.20e4  0.63252.11e4  0.34932.07e3  0.28464.28e3  0.78898.08e4 
Swish(ramachandran2017searching)  0.64463.39e4  0.60997.02e5  0.40323.41e3  0.31123.31e3  0.89096.84e5  
AReLU(ours)  0.92305.51e4  0.78621.38e3  0.73412.92e3  0.71555.56e4  0.88052.31e4 
5.5 More complicated datasets
Two more complicated datasets miniImageNet(Oriol2016) and PASCAL VOC(VOC2012) are used to further test the validity of the proposed methodology based on ResNet50. Table 6^{3}^{3}3Indicates that the activation function does not converge in this method. shows the performance comparisons of classification precision between AReLU and other adaptive functions. The results indicate that AReLU and PReLU can obtain the best classification precision in PASCAL VOC and miniImageNet, respectively. Overall, their performances are nearly equal in the two datasets. Note that PELU and FReLU cannot converge, meaning that the two datasets are very complicated and challenging.
The above all experiments are conducted for various models, datasets and methods based on classification tasks, and it means that the proposed adaptive activation functions can overall obtain the best classification performance. To test the validity and practicability in other deep learning tasks, PASCAL VOC is used for the object detection tasks by respectively adopting Faster RCNN(RenHGS15) and YOLOv2(redmon2016yolo9000) based on the proposed AReLU. Besides, a more complicated detection dataset COCO(COCO2014) is selected to further verify the effectiveness by employing FCOS(tian2021fcos). Table 7^{4}^{4}4Activation function does not converge in this model. shows performance comparisons of detection precision among various adaptive functions. From the results, the proposed AReLU can nearly achieve the best detection performance including AP50, AP75 and mAP among various adaptive functions based on various methods and datasets. It means that our method has a validity and practicability.
From the results of a series of comparison experiments, the proposed methods can achieve better performance in various scenarios, such as datasets, network models, optimization methods and deep learning tasks. The most significant reason is that our methodology has an internal and external bilinear structure, and the internal function is embedded within the external, such case will enable the optimization toward a global optimum solution more efficiently from all directions, thereby accelerating the convergence and improving the performance. More importantly, the proposed methodology only adds a small number of parameters (i.e., four parameters for each layer) , and the number of parameters is negligible compared with millions of parameters in entire network model, which means the amount of network computing and the risk of overfitting is only increased inconsiderably.
Methods  miniImageNet  PASCAL VOC 

ReLU  0.75624.60e3  0.59335.70e3 
PReLU(he2015delving)  0.78131.20e3  0.60662.00e4 
Swish(ramachandran2017searching)  0.71342.50e3  0.51542.30e3 
PELU(trottier2017parametric)  ^{1}  
FReLU(qiu2018frelu)  
AReLU(ours)  0.77511.40e3  0.60923.02e3 
Methods  FasterRCNN (PASCAL VOC)  YOLOv2 (PASCAL VOC)  FCOS (COCO)  
AP50  AP75  mAP  AP50  AP75  mAP  AP50  AP75  mAP  
ReLU  0.8501  0.6266  0.7661  0.5924  0.1459  0.3970  0.4974  0.3426  0.3232 
PReLU(he2015delving)  ^{1}  0.1777  0.0176  0.0931  0.4943  0.3399  0.3214  
Swish(ramachandran2017searching)  0.7111  0.4981  0.6658  0.6250  0.2432  0.4621  0.4923  0.3418  0.3220 
PELU(trottier2017parametric)  
FReLU(qiu2018frelu)  0.8486  0.6273  0.7669  0.2604  0.0627  0.1670  0.5030  0.3473  0.3274 
AReLU(ours)  0.8530  0.6271  0.7675  0.6384  0.2459  0.4687  0.5009  0.3490  0.3279 
6 Conclusions
In this work, a novel methodology is proposed to adaptively customize activation functions for various layers, and it will contribute to avoiding local minimums and accelerating convergence, thereby increasing the precision, reducing the training cost and improving the generalization performance. In this methodology, a small number of parameters are introduced to the traditional activation functions such as Sigmoid, Tanh and ReLU, and some theoretical and experimental analysis for accelerating the convergence and improving the performance is presented. To verify the effectiveness of the proposed methodology, a series of experiments are implemented on CIFAR10, CIFAR100, miniImageNet, PASCAL VOC and COCO by employing various network models such as VGGNet, GoogleNet, ResNet, DenseNet, various optimization strategies such as SGD, Momentum, AdaGrad, AdaDelta and ADAM, and various task like classification and detection tasks. The results show that the proposed methodology is very simple but with significant performance in convergence speed, precision and generalization, and it can surpass other popular methods like ReLU and swish in almost all experiments in terms of overall performance.
7 Acknowledgments
The authors would like to express their appreciation to the referees for their helpful comments and suggestions. This work was supported in part by Zhejiang Provincial Natural Science Foundation of China (Grant No. LGF20H180002 and GF22F037921), and in part by National Natural Science Foundation of China (Grant No. 61802347, 61801428 and 61972354), and the National Key Research and Development Program of China (Grant No. 2018YFB1305202).