1 Introduction
Artificial neural networks (ANNs) have occupied the center stage in the realm of deep learning in the recent past. ANNs are made up of several hidden layers, and each hidden layer consists of several neurons. At each neuron, an affine linear map is composed with a nonlinear function known as
activation function. During the training of an ANN, the linear map is optimized, however an activation function is usually fixed in the beginning along with the architecture of the ANN. There has been an increasing interest in developing a methodical understanding of activation functions, in particular with regards to the construction of novel activation functions and identifying mathematical properties leading to a better learning[1].An activation function is considered good if it can increase the learning rate and leaning to better convergence which leads to more accurate results. At the early stage of deep learning research, researchers used shallow networks (fewer hidden layers), and tanh or sigmoid, were used as activation functions. As the research progressed and deeper networks (multiple hidden layers) came into fashion to achieve challenging tasks, Rectified Linear Unit (ReLU)(
[2],[3],[4]) emerged as the most popular activation function. Despite its simplicity, deep neural networks with ReLU have learned many complex and highly nonlinear functions with high accuracy.To overcome the shortcomings of ReLU (nonzero mean, negative missing, unbounded output, to name a few, see[5] and to increase the accuracy considerably in a variety of tasks in comparison to networks with ReLU, many new activation functions have been proposed over the years. Many of these new activation functions are variants of ReLU, for example, Leaky ReLU [6], Exponential Linear Unit (ELU) [7], Parametric Rectified Linear Unit (PReLU) [8], Randomized Leaky Rectified Linear Units (RReLU) [9] and Inverse Square Root Linear Units (ISRLUS) [10]. In the recent past, some activation functions constructed from tanh or sigmoid have achieved stateoftheart results on a variety of challenging datasets. Most notably, among such activation functions, Swish [11] has emerged as a close favorite to ReLU. Some of these novel activation functions have shown that introducing of hyperparameters in the argument of the functions may provide activation functions for special values of these hyperparameters that can outperform activation functions for other values of hyperparameters, for example, see[11], [5].
In this article, we propose a family of activation functions with four hyperparameters of the form
(1) 
We show that activation functions for some specific values of hyperparameters outperform several well known and conventional activation functions, including ReLU and Swish. Moreover, using a hyperparameterized combination of known activation functions, we attempt to make the search for novel activation functions organized. As indicated above and validated below, such an organized search can often find better performing activation functions in the vicinity of known functions.
2 Related works
We give a brief idea of a few most widely used activation functions. All of these functions, along with some members of TanhSoft family, are given in Figure 1.

Sigmoid:
Sigmoid activation function, which is also known as logistic function, used in binary classification problem in outcome layers and it produces outputs based on probability. Sigmoid is a smooth, bounded, nonlinear and differentiable function range in
. Sigmoid suffers from vanishing gradient problem. Sigmoid is defined as
(2) 
Hyperbolic Tangent Function: Hyperbolic Tangent Function, tanh, is a smooth, nonlinear and differentiable function in the range defined as
(3) It is used in recurrent neural networks, natural language processing
[12] and speech processing tasks but also suffers from vanishing gradient problem. 
Rectified Linear Unit (ReLU): The rectified linear unit (ReLU) activation function was first introduced by Nair and Hinton in 2010 [REF]. At present, It is one of the most widely used activation function. ReLU is a kind of linear function, and it is identity in the positive axis while in the negative axis. One of the best property about ReLU is, it learns really fast. ReLU suffers from vanishing gradient problem. Also, in some situation, it has been observed that upto 50 of neurons are dead because of 0 value in the negative axis. ReLU([4], [3], [2]) is defined as
(4) 
Leaky Rectified Linear Unit: Leaky Rectified Linear Unit(Leaky ReLU) was proposed by Mass et al. on 2013[6]. Leaky ReLU has introduced an nonzero gradient in the negative axis to overcome the vanishing gradient and dead neuron problems of ReLU. LReLU is defined as
(5) 
Parametric ReLU: Parametric ReLU(PReLU) is similar to Leaky ReLU where Leaky ReLU has predetermined negative slope where PReLU has a parametric negative slope. PReLU is defined as
(6) where and so f(x) is equivalent to max(x,ax).

Swish:
To construct Swish, researcher from Google brain team uses reinforcement learning based automatic search technique. It was proposed by Ramachandran et al., on 2017
[11]. Swish is a nonmonotonic, smooth function which is bounded below and unbounded above. Swish is defined as(7) 
Eswish: ESwish [13] is a slight modified version of Swish function, introduced by Alcaide on 2018, defined by
(8) where is a trainable parameter. This function follows all the properties of Swish and can provide better results than Swish as claimed in the paper for some values of .

ELISH: Exponential Linear Sigmoid SquasHing(ELiSH) was proposed by Barisat et el. [14] on 2018. It is unbounded above, bounded below, nonmonotonic, and smooth function defined by
(9) 
Exponential Linear Units: Exponential Linear Units(ELU) was proposed by Clevert et al., 2015 [7]. ELU is deifined in such a way so that it overcomes the vanishing gradient problem of ReLU. ELU is a fast learner and it generalises better than ReLU and LReLU. ELU is defined as
(11) where is a hyperparameter.
3 TanhSoft activation function family
Standard ANN training process involves tuning the weights in the linear part of the network, however there is a merit in the ability to custom design activation functions, to better fit the problem at hand. Here, rather than looking at individual activation functions, we propose a family of functions indexed by four hyperparameters. We refer to this family as the TanhSoft family as it is created by combining a hyperparametric versions of the and the Softplus activation functions. Explicitly, we express it as
(13) 
Any function in this family can be used as a activation function for hyperparameter values , and , though for experimental and practical purpose, we restrict to small ranges of the hyperparameters ( and ). The interplay between the hyperparameters , and plays a major role for the TanhSoft family and controls the slope of the curve in both positive and negative axes. The hyperparameter is used as a switch to change the SoftPlus component of TanhSoft to a linear function () allowing us to cover a larger class of functions.
Note that recovers the zero function and becomes the linear function family . For large values of some parameter while fixing the other parameters, the TanhSoft family converges to some known activation functions pointwise. For example,
With the similar hyperparametric settings except for very small negative values of , has behaviour similar to the Parametric ReLU activation function. Also,
We remark that the MISH[20] activation function is also obtained from and Softplus but it is a composition while we use a hyperparametric product. Worth noting that the author has reported unstable training behaviour for the specific function in [20], however, we failed to find any instability during the training process. Also, in [21] the authors have mentioned the function , which arise as an example from the TanhSoft family. In fact, we show that because of the introduction of hyperparameters, better activation functions of the form can be obtained.
Being a product of two smooth functions, TanhSoft is a family of smooth activation functions. As expected, the monotonocity and boundedness of the functions in the family depend on the specific values of the hyperparameters. The derivative of the TanhSoft family is given by
(14) 
A detailed study of the mathematical properties of the TanhSoft family will be presented in a later work. In this work, we focus on providing several examples of activation functions from the family which perform well on many challenging datasets.
4 Search Findings
We have performed an organized search of activation functions within the TanhSoft family by varying the values of hyperparameters and training and testing them with DenseNet121[22] and SimpleNet[23] on CIFAR10[24] dataset. Several functions were tested and we select eight functions as example to report their performance. The Top1 and Top3 accuracies of these eight functions are given in Table 1. All these functions, either outperformed or give near accuracy as compared to ReLU. Most notably, and constantly outperform ReLU even with more complex models. We have given detailed results in Section 5 with more complex models and datasets.





90.73  98.73  91.01  
91.14  98.86  92.23  
90.98  98.80  92.07  
90.84  98.64  91.38  
90.60  98.80  91.20  
90.69  98.42  91.35  
91.11  98.66  91.98  
90.78  98.67  89.42  
90.43  98.62  91.58 
Realworld datasets are noisy or challenging, and it is always difficult to find the best activation function to generalize on random datasets. It is hard to say whether the searched function will generalize successfully and replace ReLU on challenging or noisy datasets. Though there may be merit in having a custom activation function corresponding to the problem at hand, but yet it is beneficial to identify activation functions that generalize to several real world data sets, making it easier to implement. Hence we concentrate on two members from the TanhSoft family and establish their generalizability and usefulness over other conventional activation functions. In particular, we consider the subfamilies and and call them as TanhSoft1 and TanhSoft2. In what follows, we discuss the properties of these subfamilies, experiments with more complex models, and comparison with a few other widely used activation functions.
5 TanhSoft1 and TanhSoft2
The functions, TanhSoft1 and TanhSoft2 are given as
(15)  
(16) 
The corresponding derivatives (see equation 14) are
(17)  
(18) 
Figure 4 and 6 show the graph for TanhSoft1 and TanhSoft2 activation functions for different values of , and and respectively. If , then TanhSoft1 becomes the zero function. Similarly, for , TanhSoft2 is the zero function. Like ReLU and Swish, TanhSoft1 and TanhSoft2 are also unbounded above but bounded below. Like Swish, TanhSoft1 and TanhSoft2 are both smooth, nonmonotonic activation functions. Plots of the first derivative of TanhSoft1 and TanhSoft2 are given in Figures 4 and 6 for different values of , and and respectively. A comparison between TanhSoft1, TanhSoft2 and Swish and their first derivatives are given in Figures 8 and 8.
TanhSoft1 and TanhSoft2 can be implemented with a single line of code in Keras Library
[25]or Tensorflow V2.3.0
[26]. The Keras codes for TanhSoft1 and TanhSoft2 arefor specific value of and respectively.
Experiments with TanhSoft1 and TanhSoft2
We tested TanhSoft1 and TanhSoft2 for different values of hyperparameters against widely used activation functions on CIFAR and MNIST datasets. In particular, TanhSoft1 with
, , and TanhSoft2 with , , produced best results. We observe that and in most cases beats or performs equally with baseline activation functions, and underperforms marginally on rare occasions. Table 2 gives a comparison with the baseline activation such as ReLU, Leaky ReLU, ELU, Softplus, and Swish. We have tested our activation function in several models, such as Densenet121[22], DenseNet169[22], InceptionNet V3[27], SimpleNet[23], MobileNets[28], WideResNet 2810[29]. In this next section we will provide details of our experimental framework and results.Baselines  ReLU  Leaky ReLU  ELU  Swish  Softplus 
TanhSoft1 > Baseline  10  7  11  10  10 
TanhSoft1 = Baseline  1  2  0  1  0 
TanhSoft1 < Baseline  0  2  0  0  1 
TanhSoft2 > Baseline  11  9  11  11  11 
TanhSoft2 = Baseline  0  1  0  0  0 
TanhSoft2 < Baseline  0  1  0  0  0 
Mnist
MNIST [30] database contains image data of handwritten digits from 0 to 9. The dataset contains 60k training and 10k testing greyscale images. We run a custom 5layer CNN architecture on MNIST dataset with different activation functions and results are reported in Table 3. We have reported accuracy and loss for TanhSoft1 for different values of in Figures 12 and 12.
Activation Function  5fold Accuracy on MNIST data 
TanhSoft1()  99.0 ( 0.05) 
TanhSoft2()  99.1 ( 0.05) 
ReLU  99.0 ( 0.06) 
Swish  98.9 ( 0.08) 
Leaky ReLU( = 0.01)  99.0 ( 0.1) 
ELU  98.9( 0.06) 
Softplus  98.9( 0.06) 
Cifar
The CIFAR [24] dataset consists of colored images, consists of 60k images and divided into 50k training and 10k test images. There are two type of CIFAR dataset such as CIFAR10 and CIFAR100. CIFAR10 dataset has 10 classes with 6000 images per class while for CIFAR100 has 100 classes with 600 images per class. We have reported the results of TanhSoft1 and TanhSoft1 for and respectively along with ReLU, Leaky ReLu, ELU, Softplus and Swish in the CIFAR10 dataset with DenseNet121, DenseNet169, InceptionNet V3 and SimpleNet while for CIFAR100 dataset results have been reported with DenseNet121, DenseNet169, InceptionNet V3, MobileNet[28], WideResNet[29] and SimpleNet. We have trained with Adam optimizer[31]
with 100 epochs for DenseNet121, DenseNet169, InceptionNet V3, MobileNet, WideResNet and 200 epochs for SimpleNet. We used weight decay
in SimpleNet model. Weight decay has been decided according to[32]. Table 5, 6 contains results for CIFAR10 data while Table 7, 8 contains results for CIFAR100 data. Table 4 contains results for accuracy and Loss on CIFAR10 dataset with SimpleNet Model and TanhSoft2 activation function for different values of and0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  
Accuracy  91.05  91.60  92.01  91.75  91.77  92.23  91.79  91.61  92.05  91.87 
Loss  0.5103  0.4631  0.4428  0.4659  0.4480  0.4204  0.4606  0.4688  0.4511  0.4578 






TanhSoft1()  90.98  98.80  91.05  98.75  
TanhSoft2()  91.14  98.86  91.10  98.78  
ReLU  90.73  98.73  90.64  98.79  
Leaky ReLU( = 0.01)  90.77  98.80  90.61  98.78  
ELU  90.49  98.61  90.40  98.65  
Swish  90.77  98.80  90.38  98.68  
Softplus  90.45  98.65  90.51  98.69  






TanhSoft1()  92.07  91.93  98.84  
TanhSoft2()  92.23  91.99  98.85  
ReLU  91.01  91.29  98.80  
Leaky ReLU( = 0.01)  91.05  91.84  98.93  
ELU  91.19  91.01  98.79  
Swish  91.59  91.26  98.75  
Softplus  91.42  91.79  98.84 







TanhSoft1()  66.99  83.76  65.37  82.42  65.20  
TanhSoft2()  67.18  84.01  64.99  82.09  65.01  
ReLU  66.40  83.11  64.15  81.69  62.63  
Leaky ReLU( = 0.01) 
67.18  83.37  63.40  81.16  62.58  
ELU  66.52  83.42  64.23  81.45  63.74  
Swish  66.99  83.76  64.95  82.06  64.90  
Softplus  65.93  83.50  64.95  82.05  62.39  








TanhSoft1()  57.56  76.57  69.19  84.63  69.40  
TanhSoft2()  57.56  76.57  69.28  85.91  69.41  
ReLU  56.87  76.33  69.09  85.41  66.54  
Leaky ReLU( = 0.01)  57.78  77.32  69.19  85.24  69.23  
ELU  57.19  76.03  68.32  85.30  64.48  
Swish  55.40  74.81  67.61  83.59  68.45  
SoftPlus  57.53  76.48  69.24  85.15  61.53  

6 Conclusion
In this study, we have explored a new novel hyperparameter family of activation functions, TanhSoft, defined mathematically as where , , and
are tunable hyperparameters. We have shown that in the different complex models, TanhSoft outperforms in the MNIST, Cipher10, and CIFAR100 datasets compared to ReLU, Leaky ReLU, Swish, ELU, and Softplus so that TanhSoft can be a good choice to replace the ReLU, Swish, and other widely used activation functions. Future work can be, applying the proposed novel activation function to more challenging datasets such as ImageNet & COCO and try on different other models to achieve StateoftheArt results.
References
 [1] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions: Comparison of trends in practice and research for deep learning, 2018.

[2]
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.
In Johannes Fürnkranz and Thorsten Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel
, pages 807–814. Omnipress, 2010. 
[3]
Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun.
What is the best multistage architecture for object recognition?
In
IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27  October 4, 2009
, pages 2146–2153. IEEE Computer Society, 2009.  [4] Richard Hahnloser, Rahul Sarpeshkar, Misha Mahowald, Rodney Douglas, and H. Seung. Digital selection and analogue amplification coexist in a cortexinspired silicon circuit. Nature, 405:947–51, 07 2000.
 [5] Yuan Zhou, Dandan Li, Shuwei Huo, and SunYuan Kung. Softrootsign activation function, 2020.
 [6] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
 [7] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus), 2015.

[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification, 2015.
 [9] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network, 2015.
 [10] Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti, and Brian Whitney. Improving deep learning by inverse square root linear units (isrlus), 2017.
 [11] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2017.
 [12] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks, 2016.
 [13] Eric Alcaide. Eswish: Adjusting activations to different network depths, 2018.
 [14] Mina Basirat and Peter M. Roth. The quest for the golden activation function, 2018.
 [15] Joseph Turian, James Bergstra, and Yoshua Bengio. Quadratic features and deep architectures for chunking. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 245–248, Boulder, Colorado, June 2009. Association for Computational Linguistics.

[16]
Phong Le and Willem Zuidema.
Compositional distributional semantics with long short term memory, 2015.
 [17] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: Scaling texttospeech with convolutional sequence learning, 2017.
 [18] Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang, and Yanpeng Li. Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–4, 2015.
 [19] Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. Incorporating secondorder functional knowledge for better option pricing. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, page 451–457, Cambridge, MA, USA, 2000. MIT Press.
 [20] Diganta Misra. Mish: A self regularized nonmonotonic activation function, 2019.
 [21] Xinyu Liu and Xiaoguang Di. Tanhexp: A smooth activation function with high convergence speed for lightweight neural networks, 2020.
 [22] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2016.
 [23] Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, using simple architectures to outperform deeper and more complex architectures, 2016.
 [24] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [25] François Chollet. Keras. https://github.com/fchollet/keras, 2015.
 [26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.

[28]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam.
Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
 [29] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
 [30] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
 [31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.

[32]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics
, 2010.
Comments
There are no comments yet.