Log In Sign Up

How important are activation functions in regression and classification? A survey, performance comparison, and future directions

by   Ameya D. Jagtap, et al.

Inspired by biological neurons, the activation functions play an essential part in the learning process of any artificial neural network commonly used in many real-world problems. Various activation functions have been proposed in the literature for classification as well as regression tasks. In this work, we survey the activation functions that have been employed in the past as well as the current state-of-the-art. In particular, we present various developments in activation functions over the years and the advantages as well as disadvantages or limitations of these activation functions. We also discuss classical (fixed) activation functions, including rectifier units, and adaptive activation functions. In addition to presenting the taxonomy of activation functions based on characterization, a taxonomy of activation functions based on applications is also presented. To this end, the systematic comparison of various fixed and adaptive activation functions is performed for classification data sets such as the MNIST, CIFAR-10, and CIFAR-100. In recent years, a physics-informed machine learning framework has emerged for solving problems related to scientific computations. To this purpose, we also discuss various requirements for activation functions that have been used in the physics-informed machine learning framework. Furthermore, various comparisons are made among different fixed and adaptive activation functions using various machine learning libraries such as TensorFlow, Pytorch, and JAX.


A survey on modern trainable activation functions

In the literature, there is a strong interest to identify and define act...

Power Consumption Variation over Activation Functions

The power that machine learning models consume when making predictions c...

A survey on recently proposed activation functions for Deep Learning

Artificial neural networks (ANN), typically referred to as neural networ...

Domain Wall Leaky Integrate-and-Fire Neurons with Shape-Based Configurable Activation Functions

Complementary metal oxide semiconductor (CMOS) devices display volatile ...

Combinatorially Generated Piecewise Activation Functions

In the neuroevolution literature, research has primarily focused on evol...

1 Introduction

1.1 Aim

In recent years, artificial neural networks (ANN) have achieved remarkable success in both academia as well as industry. The ANN has found success in a variety of industries, such as cybersecurity Shaukat et al. (2020); Sarker (2021), manufacturing technologies Taha and Rostam (2011); Casalino et al. (2016), healthcare Wiens and Shenoy (2018), financial services Coakley and Brown (2000), food industry Guiné (2019), and energy Ahmad et al. (2014), to name a few, where they perform a range of tasks, including logistics, inventory management, etc. With various scientific applications, ANN is now the most sophisticated algorithm for understanding complicated sensory input, such as images Turkmen (2016), video Jiang et al. (2008), audio Heckmann et al. (2002), etc.  The first computational model for neural networks was created by McCulloch and Pitts McCulloch and Pitts (1943), whereas the first ANN, also called as Perceptron was invented by Rosenblatt Rosenblatt (1958). Ivakhnenko and Lapa Ivakhnenko and Lapa (1967) released the Group Method of Data Handling, which was the first functional neural network with multiple layers. However, a seminal book written by Minsky and Papert Minsky and Papert (1969) showed that these early neural networks were incapable of processing simple tasks, and computers lacked the capability to process usable neural networks. Early neural networks developed by Rosenblatt only comprised one or two trainable layers. Such simplistic networks cannot mathematically represent complicated real-world phenomena. Deep neural networks (DNN) are ANNs that have a large number of trainable hidden-layers. The DNN was theoretically more versatile. However, without success, researchers spent several years trying to figure out how to train the DNN, as it could not scale with the first neural network’s straightforward hill-climbing algorithm. In a groundbreaking study, Rumelhart et al. Rumelhart et al. (1985) created the backpropagation

training algorithm to train the DNN. The backpropagation algorithm is used to effectively train any ANN using a gradient descent method that takes advantage of the chain rule. Since then, ANN research has opened up all of the exciting and transformative developments in computer vision

Sebe et al. (2005), speech recognition Deng and Li (2013)

, natural language processing

Chowdhary (2020), and even in scientific computations called Physics-Informed Machine Learning Karniadakis et al. (2021).

Artificial neurons are a set of connected units or nodes in an ANN that loosely replicate the neurons in a biological brain. The Threshold Logic Unit, also known as the Linear Threshold Unit, was the first artificial neuron and was first put forth by McCulloch and Pitts McCulloch and Pitts (1943) in 1943. The model was intended to serve as a computational representation of the brain’s ’nerve net’. Each artificial neuron receives inputs and generates a single output that can be delivered to a number of other neurons. Artificial neurons do not simply output the raw data they receive. Instead, there is one more step, which is analogous to the rate of action potential firing in the brain and is called an activation function. The introduction of the activation function in ANN was inspired by biological neural networks whose purpose is to decide whether a particular neuron fires or not. The simple addition of such a nonlinear function can tremendously help the network to exploit more, thereby learning faster. There are various activation functions proposed in the literature, and it is difficult to find the optimal activation function that can tackle any problem. In this survey, our aim is to discuss various advantages as well as disadvantages or limitations of classical (fixed) as well as modern (adaptive) activation functions.

1.2 Contributions of this work

To the best of our knowledge, this is the first comprehensive survey of activation functions for both classification and regression problems. Apart from that, we also present several original contributions, summarized below.

  • First, we have done a comprehensive survey of the classical (fixed) activation functions like the real-valued activation functions, including the rectifier units. This also includes the oscillatory as well as the non-standard activation functions. We further discuss various properties of the classical activation functions that make them best suited for the particular tasks.

  • We present a taxonomy based on applications in addition to discussing a taxonomy based on the characterization of activation functions like fixed, adaptive, and multi-fold activations. In particular, we discuss various complex-valued activation functions, which have many applications in remote sensing, acoustics, opto-electronics, image processing, quantum neural devices, robotics, bioinformatics, etc. Furthermore, the quantized activations, which are a special type of activation function, are also discussed in detail. The quantized activations are typically used to improve the efficiency of the network without degrading the performance of the model.

  • The state-of-the-art adaptive activation functions, which outperform their classical counterparts, are thoroughly discussed. Such adaptive activation not only accelerates the training of the network but also increases the prediction accuracy. Various adaptive activation functions such as stochastic or probabilistic, ensemble, fractional, etc. are discussed in detail. To this end, we also compare various fixed and adaptive activation functions systematically for classification data sets including MNIST, CIFAR-10, and CIFAR-100.

  • Physics-informed machine learning is an intriguing approach that seamlessly incorporates the governing physical laws into the machine learning framework. Such incorporation of physical laws sets additional requirements from the activation function. To this end, we discuss these requirements for specifically solving scientific problems using a physics-informed machine learning framework. We used various fixed and adaptive activation functions to solve different PDEs. Furthermore, we also compare the predictive accuracy of the solution for different activation functions using various machine learning libraries such as TensorFlow, PyTorch, and JAX using clean and noisy data set. A run-time comparison is made among TensorFlow, PyTorch, and JAX machine learning libraries.

1.3 Organization

This paper is organized as follows: In Section 2, we discussed the historical perspective of the activation functions. In section 3, we compare biological and artificial neurons in detail, followed by section 4, where some of the desired features of the neurons are discussed in depth. Section 5 gives a detailed discussion about the taxonomy of the activation functions. Section 6 covers several classical activation functions with their improved versions in detail. Section 7 is devoted to the motivation and historical development of complex-valued activation functions. Similarly, section 8 discuss the efficient quantized activations for quantized neural networks. In section 9, we discuss various types of adaptive or trainable activation functions, ranging from stochastic or probabilistic to ensemble activations. Various fixed and adaptive activation functions are compared in terms of accuracy for the MNIST, CIFAR-10, and CIFAR-100 data sets in Section 10 using two convolutional neural network-based models, namely,

MobileNet and VGG16. Section 11 present the discussion on the activation function required for solving regression problems using a physics-informed machine learning framework. Finally, we summarize our findings in section 12.

2 Activation functions: A historical perspective

The activation function is a function that acts on the output of the hidden layer and the output layer (optional), which helps the neural network to learn complex features of the data. The motivation to employ the activation function came from the biological neural networks where this activation function decides whether the particular neuron activates (fires) or not. The important feature of the activation function is the ability to introduce nonlinearity in the network in order to capture nonlinear features, without which the neural network acts as a linear regression model. Cybenko

Cybenko (1989) and Hornik Hornik et al. (1989) argue for the activation function’s nonlinearity, demonstrating that the activation function must be bounded, non-constant, monotonically rising, and continuous to ensure the neural network’s universal approximation property.

Figure 1 shows some of the notable historical developments related to activation functions that we shall discuss here. In the literature, the term activation function has been referred to with different names, such as squashing function Haykin and Lippmann (1994), and output function or transfer function Duch and Jankowski (1999). In DasGupta and Schnitger (1993), Dasgupta and Schnitger defined the activation as a real-valued functions defined on subset of . In Goodfellow et al. (2016), Goodfellow et al. described the activation function as a fixed nonlinear function. The activation function was first used in the work of Fukushima Fukushima (1969)

for visual feature extraction in hierarchical neural networks. Later, Fukushima and Miyake

Fukushima and Miyake (1982)

proposed it again for visual pattern recognition. Hahnloser et al.,

Hahnloser et al. (2000, 2003) argued that the activation function is strongly related to biological neurons. In the seminal work of Glorot et al., Glorot et al. (2011), it was found that the presence of activation function enables the fast learning of neural networks. During the early 1990’s, the Sigmoid activation function Han and Moraga (1995)

was one of the most popular activation functions. Due to its vanishing gradient problem, the notable improvement of Sigmoid such as the

improved logistic Sigmoid function

Qin et al. (2018) has recently been proposed. In the late 1990’s, researchers widely used the hyperbolic tangent function LeCun et al. (2012)

as an activation function. Both Sigmoid and hyperbolic tangent face vanishing gradient problems and have difficulties with large inputs as they saturate for large input values. Later researchers proposed some alternative functions, and one of the most popular gradient vanishing-proof activation functions was the rectified linear unit (ReLU). As of 2017, the ReLU was the most popular activation function

Ramachandran et al. (2017) compared to widely used activation functions such as sigmoid and hyperbolic tangent functions. Although the ReLU was very successful in many applications such as speech recognition Maas et al. (2013) and computer vision Glorot et al. (2011), it suffered from issues such as non-differentiability at zero, unboundedness, and the most famous, the dying ReLU problem Maas et al. (2013). Several other linear as well as non-linear variants of the rectifier unit were proposed. The notable ones are leaky ReLU Maas et al. (2013) and parametric ReLU He et al. (2015), which are linear variants. Some of the non-linear variants are softplus Dugas et al. (2000), Exponential linear units (ELU) Clevert et al. (2015), Gaussian Error Linear Units (GELU) Hendrycks and Gimpel (2016), Swish Ramachandran et al. (2017) and Mish Misra (2019).

Figure 1: Notable historical developments over the years related to the activation function.

Recent years have seen an increase in studies on adaptive activation functions. In the early studies, the generalized hyperbolic tangent function was suggested by Chen and Change Chen and Chang (1996). It is parameterized by two additional positive scalar values. Vecci et al. Vecci et al. (1998) proposed an adaptive spline activation function. Trentin Trentin (2001) gives empirical evidence that learning the amplitude for each neuron is superior to having a unit amplitude for all activation functions  (either in terms of generalization error or speed of convergence). Goh et al. Goh and Mandic (2003)

 proposed a trainable amplitude activation function. The activation function adaptation algorithm is suggested for sigmoidal feed-forward neural network training by Chandra and Singh

Chandra and Singh (2004). Recently,  Agostinelli et al.  Agostinelli et al. (2014) proposed learning activation functions. Eisenach et al. Eisenach et al. (2016)  proposed a nonparametric method for learning activation functions. The algebraic activation functions were proposed by Babu and Edla Naresh Babu and Edla (2017). Urban et al. proposed stochastic activation functions based on Gaussian processes in Urban et al. (2017). Alcaide et al., Alcaide (2018) proposed the E-swish activation function. The trained activation function was proposed by Ertuğrul Ertuğrul (2018). Convolutional neural networks with adaptive activations were proposed by Qian et al.,  Qian et al. (2018). In order to discover the ideal quantization scale, Choi et al, Choi et al. (2018) introduced PArameterized Clipping acTivation (PACT), which makes use of an activation clipping parameter that is tuned during training. Apicella et al. Apicella et al. (2019) proposed an effective design for trainable activation functions. Several variants of adaptive ReLU activation function have been proposed recently, such as Parametric ReLU He et al. (2015), S-shaped ReLU Jin et al. (2016), Flexible ReLU Qiu et al. (2018), Paired ReLU Tang et al. (2018), etc. Similarly, adaptive ELUs like Parametric ELU Shah et al. (2016), Continuously Differentiable ELU Barron (2017), Shifted ELU Grelsson and Felsberg (2018), Fast ELU Qiumei et al. (2019), Elastic ELU Kim et al. (2020), etc. are also proposed in the literature. Recently, Jagtap et al. proposed a series of papers on global Jagtap et al. (2020b) and local Jagtap et al. (2020a) adaptive activation functions. The other way to introduce adaptivity is through the use of stochastic activation functions. Gulcehre et al. Gulcehre et al. (2016) proposed a noisy activation function where a structured bounded noise-like effect is added to allow the optimizer to exploit more and learn faster. On a similar idea, Shridhar Shridhar et al. (2019)

proposed the probabilistic activation functions, which are not only trainable but also stochastic in nature. In this work, the authors replicated uncertain behavior in the information flow to the neurons by injecting stochastic sampling from a normal distribution into the activations. The fractional activation functions with trainable fractional parameter is proposed by Ivanov

Ivanov (2018). Later Esquivel et al., Zamora Esquivel et al. (2019) provided adaptable activation functions based on fractional calculus. The ensemble technique is another way to adapt activation functions. Earlier work of Chen Chen (2016) used multiple activation functions for each neuron for the problems related to stochastic control. Agostinelli et al., Agostinelli et al. (2014) constructed the activation functions during the network tranining. Jin et al Jin et al. (2016) proposed the combination of a set of linear functions with open parameters. In recent years, a generalized framework, namely Kronecker Neural Networks for any adaptive activation functions, has been proposed by Jagtap et al., Jagtap et al. (2021). Good reviews of classical activation functions Nwankpa et al. (2018); Szandała (2021) and the modern trainable activation Apicella et al. (2021) are available for various classification problems.

3 Biological vs artificial neurons

A neuron in the human brain is a biological cell that processes information. The biological neuron is depicted in the top figure 2. There are three basic elements of a biological neuron. The dendrites receive singles as an input from other neurons. A cell body (nucleus) that controls the activity of neurons and an axon that transmits signals to the neighboring neurons. The axon’s length may be several times or even tens of thousands of times longer than the cell body. The axon is divided into various branches near its extremity, which are connected to the dendrites of other neurons. There are millions of massively connected neurons (around ), which is approximately equal to the number of stars in the Milky way Brunak and Lautrup (1990). Every neuron is connected to thousands of neighboring neurons, and these neurons are organized into successive layers in the cerebral cortex.

Figure 2: Biological (top) vs artificial (bottom) neurons.

Such a massively parallel neural network communicates through a very short train of pulses, in milliseconds, and has an ability that includes parallel processing, fast learning ability, adaptivity, generalization ability, very low energy consumption, and fault tolerance.  The artificial neuron in the bottom figure 2 attempts to resemble the biological neuron. The artificial neuron, like a biological neuron, takes input , and is made up of three basic parts: weights and bias , which act as a dendrites. The activation function denoted by that act as a cell body and nucleus, and the output . The artificial neuron has been generalized in many ways. Among all the components, the most obvious is the activation function. The activation function plays an important role in replicating the biological neuron firing behaviour above some threshold value. Unlike biological neurons, which send binary values, the artificial neurons send continuous values, and depending on the activation function, the firing behavior of artificial neurons changes significantly.

4 Desired characteristics of the activation functions

There is no universal rule for determining the best activation function; it varies depending on the problem under consideration. Nonetheless, some of the desirable qualities of activation functions are well known in the literature. The following are the essential characteristics of any activation function.

  1. Nonlinerity: One of the most essential characteristics of an activation function is nonlinearity. In comparison to linear activation functions, the non-linearity of the activation function significantly improves the learning capability of neural networks. In Cybenko (1989) Cybenko and Hornik et al. (1989) Hornik advocate for the nonlinear property of the activation function, demonstrating that the activation function must be bounded, non-constant, monotonically growing, and continuous in order to ensure the neural network’s universal approximation property. In Morita (1993, 1996) Morita later discovered that neural networks with non-monotonic activation functions perform better in terms of memory capacity and retrieval ability.

  2. Computationally cheap: The activation function must be easy to evaluate in terms of computation. This has the potential to greatly improve network efficiency.

  3. The vanishing and exploding gradient Problems

    : The vanishing and exploding gradient problems are the important problems of activation functions. The variation of the inputs and outputs of some activation functions, such as the logistic function (Sigmoid), is extremely large. To put it another way, they reduce and transform a bigger input space into a smaller output space that falls between [0,1]. As a result, the back-propagation algorithm has almost no gradients to propagate backward in the network, and any residual gradients that do exist continue to dilute as the program goes down through the top layers. As a result, the initial layers are left with nothing. For hyperbolic tangent and sigmoid activation functions, it has been observed that the saturation region for large input (both positive and negative) is a major reason behind the vanishing of gradient. One of the important remedies to this problem is the use of non-saturating activation functions. Other non-saturating functions, such as ReLU, leaky ReLU, and other variants of ReLU, have been proposed to solve this problem.

  4. Finite range/boundedness: Gradient-based training approaches are more stable when the range of the activation function is finite, because pattern presentations significantly affect only limited weights.

  5. Differentiability: The most desirable quality for using gradient-based optimization approaches is continuously differentiable activation functions. This ensures that the back-propagation algorithm works properly.

5 Taxonomy of activation functions

5.1 Characterization based taxonomy

The activation functions can be broadly divided into linear and nonlinear functions. For all realistic problems, nonlinear activation functions are often employed.

Figure 3: Taxonomy based on characterization of activation functions.

Figure 3 shows the taxonomy based on characterization of activation functions, which divides the activation function into three major categories, first is the fixed activation function that contains all the classical fixed activations, including the rectifier units. The second category is the adaptive or modern activation functions that can adapt itself, which is further divided into parametric as well as ensemble activations. The third category is non-standard activation, such as multi-fold activation from previous layers, which may be adaptive or fixed.

5.2 Application based taxonomy

Although, the taxonomy of activation functions based on their characterization encompasses all the activation functions, it is important to define the taxonomy based on the applications that can further broaden our understanding of activation functions. The activation function can also be divided into real- and complex-valued activation functions; see Fig. 4. The real-valued activations are well-known in the literature, and we shall discuss them in detail later. The complex-valued activation functions are another set of activation functions whose output is a complex number. These activation functions are often required due to the complex-valued output, which has many applications in science and engineering, such as bioinformatics, acoustics, robotics, opto-electronics, quantum neural devices, image processing, etc.

Figure 4: Taxonomy based on the application of activation functions.

The process of mapping continuous values from an infinite set to discrete finite values is known as Quantization. Both real and complex-valued activation functions can be quantized in order to reduce memory requirements. Quantized activations are the class of activation functions that is efficient in terms of memory requirements as well as increased compute efficiency. The output of the quantized activation is integers rather than floating point values. Note that, just like real-valued activation functions, both complex-valued as well as quantized activations can be made adaptive by introducing the tunable variables. The upcoming sections provide a clear discussion of the various complex-valued and quantized activation functions.

6 Classical activation functions in artificial neural networks

This section gives details about the various fixed activation functions that have been proposed in the literature. These activation functions do not contain any tunable parameters. The following are the most common fixed activation functions used.

6.1 Linear and piece-wise linear functions

Linear function is the simplest form of activation function. It has a constant gradient, and the descent is based on this constant value of gradient. The range of linear function is , and has a order of continuity.

The piece-wise linear activation can be defined as

where is a constant. The derivative of piece-wise linear activation is not defined at , and it is zero for and . The range of linear function is , and has a order of continuity.

6.2 Step function

Also called Heaviside, or the unit step function, it is defined as

Step function is one of the most basic forms of activation function. The derivative of a step function is zero when and at , the derivative is not defined. The step function has a range from and has order of continuity.

6.3 Sigmoid function

The sigmoid function Han and Moraga (1995) also called the logistic function was a very popular choice of activation function till the early 1990’s. It is defined as

In Hinton et al. (2012)

, Hinton et al, used a sigmoid activation function for automatic speech recognition. The major advantage of sigmoid activation is its boundedness. The disadvantages are: the vanishing gradient problem, the output not being zero-centered, and the saturation for large input values. In

Nair and Hinton (2010), Nair and Hinton showed that as networks became deeper, training with sigmoid activations proved less effective. The range of the sigmoid function is [0,1] and has a order of continuity.

Some improvements to Sigmoid functions were proposed in the literature as follows.

  • The Bipolar-Sigmoid activation function Kwan (1992) is defined as

    which has a range of [-1,1]. Bipolar sigmoid activation function is widely used in Hopfield neural network, see for more details in Mansor and Sathasivam (2016).

  • The Elliott function defined in Elliott (1993) has a similar structure as the Sigmoid function and it is given by

    with range [0,1]. See Farzad et al. (2019) for more details.

  • The parametric sigmoid function Chandra and Singh (2004) is given by

    The derivatives for

    are skewed and their maxima shift from the point corresponding to the input value equal to zero.

  • An improved logistic Sigmoid function is proposed by Qin et al. (2018) to overcome the vanishing gradient problem as

    which has a [] range.

  • The scaled sigmoid function Eger et al. (2019) is defined as

    which has a range from [-2,2].

6.4 Hyperbolic tangent (tanh) function

The tanh activation function is defined as

From the late 1990’s till early 2000s, tanh was extensively used to train neural networks, and was a preferred choice over the classical sigmoid activation function. The tanh activation has a range from , and in general, is mostly used for regression problems. It has advantage due to the zero-centered structure. The main problem with the tanh activations is the saturation region. Once saturated, it is really challenging for the learning algorithm to adapt the parameters and learn faster. This problem is the vanishing gradient problem.

Some improved tanh activation functions have been proposed to eliminate the problems related to the original tanh activation function.

  • A scaled hyperbolic tangent function is defined by LeCun et al. (1998) as

    Here is the amplitude of the function, and determines its slope at the origin. The output of the activation function is the range .

  • A rectified hyperbolic secant activation function Samatin Njikam and Zhao (2016) is proposed as .

  • The Hexpo function Kong and Takatsuka (2017) has a similar structure as the hyperbolic tangent function, which is defined as

    and has range from [].

  • The penalized hyperbolic tangent function Eger et al. (2019) is defined as

    which gives output in the range [].

  • The linearly scaled tanh (LiSHT) activation function Roy et al. (2019) is defined as . The LiSHT scales the hyperbolic tangent function linearly to tackle its gradient diminishing problem.

6.5 Rectified Linear Unit (ReLU)

ReLU was primarily used to overcome the vanishing gradient problem. ReLU is the most common activation function used for classification problems. It is defined as

The derivative of ReLU is zero when , unity when , and at , the derivative is not defined. The ReLU function has a range from and has order of continuity. Apart from overcoming the vanishing gradient problem, the implementation of ReLU is very easy and thus cheaper, unlike tanh and sigmoid, where an exponential function is needed. Despite having some advantages over classical activations, ReLU still has a saturation region, which can prevent the learning of the networks. In particular, ReLU always discards the negative values. This makes the neurons stop responding to the gradient based optimizer. This problem is known as dead or dying ReLU problem Maas et al. (2013); Lu et al. (2019), meaning the neurons stop outputting other than zero. This is one of the serious problems for ReLU, where most of the neurons become dead, especially when using high learning rate. To overcome these problems, various variants of ReLU have been proposed.

  • Leaky ReLU Maas et al. (2013): The leaky ReLU is defined as

    The hyperparameter

    defines the leakage in the function (slope of the function). By adding the small slope in the region , the leaky ReLU overcomes the dying ReLU problem. Moreover, it has all the advantages of ReLU activation. One of the disadvantage of leaky ReLU is the hyperparameter , which needs to be defined appropriately. In most of the cases is used. In Xu et al. (2015), Xu et al., compared the ReLU and leaky ReLU activation with the conclusion that the latter outperforms the former always.

  • Randomized leaky ReLU Xu et al. (2015): In this case, the is picked randomly in a given range during training and is fixed to its average value during testing.

  • Mirror ReLU Zhao and Griffin (2016): It is defined as

    which can also be defind as .

  • Concatenated ReLU Shang et al. (2016): The concatenated ReLU (CReLU) is given as

    which has range of [). The disadvantage of CReLU over ReLU is increased model complexity.

  • Elastic ReLU Jiang et al. (2018): The Elastic ReLU is defined as

    where is the random number.

  • Bounded ReLU Liew et al. (2016): The ReLU function has unbounded outputs for non-negative inputs. The bounded version of the ReLU function is defined as

    where is the maximum output value the function can produce.

  • V-shaped ReLU Hu (2018): It is defined as

  • Dual ReLU Godin et al. (2018): The Dual ReLU has two dimensions, and it is defined as

    where are the inputs in two different dimensions. The range of Dual ReLU is .

  • Randomly translated ReLU Cao et al. (2018): It is given by

    which has range of [), where

    is randomly sampled from Gaussian distribution.

  • Displaced ReLU (DReLU) Macêdo et al. (2019): DReLU is a diagonally displaced ReLU function that generalizes both ReLU and SReLU Clevert et al. (2015) by allowing its inflection to move diagonally from the origin to any point of the form. It is defined as

    with range [). If DReLU becomes ReLU, and if DReLU becomes SReLU.

  • Natural-Logarithm ReLU Liu et al. (2019): The Natural-Logarithm ReLU uses a logarithmic function to modify ReLU output for positive input values that increase the degree of nonlinearity. It is defined as

    This activation has output range from 0 to , and the is constant.

  • Average Biased ReLU Dubey and Chakraborty (2021): The Average-Biased ReLU is given by

    which has range of [), and the variable is the average of input activation map to activation function.

Figure 5: Classical (fixed) activation functions with their range and continuity.
Figure 6: Derivatives of classical activation functions.

6.6 Gaussian Error Linear Units (GELUs)

In GELU activation Hendrycks and Gimpel (2016) the neuron input is multiplied by where

. The reason to choose Bernoulli distribution is because the neuron’s input follow a normal distribution, especially after Batch Normalization. The output of any activation function should be deterministic, and not stochastic. So, the expected value of the transformation can be found as

Since is a cumulative distribution of Gaussian distribution, it is frequently computed with the error function. Thus, GELU activation is defined as

which can be approximated as .

To improve its capacity for bidirectional convergence, the GELU was upgraded to the symmetrical GELU (SGELU) by Yu and Su Yu and Su (2019).

6.7 Softplus function

The softplus function Dugas et al. (2000) approximates the ReLU activation function in a smooth way, and it is defined as

Softplus function is infinitely differentiable, and it has a range from ). In Liu and Furber (2016), Liu and Fuber proposed the noisy softplus activation. With Noisy Softplus, which is well-matched to the response function of LIF (Leaky Integrate-and-Fire) neurons, the performance of spiking neural networks can be improved, see Liu et al. (2017). The authors of Liu and Furber (2016) proposed the following formula:

where is the mean input, defines the noise level, and controls the curve scaling that can be determined by the neuron parameters.

6.8 Exponential linear unit

The Exponential linear unit (ELU) was first proposed in Clevert et al. (2015) by Clevert et al., where they show that ELU outperforms all variants of ReLU with reduced training time as well as better accuracy in testing. The ELU is defined as

When , it takes on negative values, allowing the unit’s average output to be closer to zero and alleviating the vanishing gradient problem. Also, due to the non-zero gradient for , ELU does not suffer from the problem of dead neurons. Unlike ReLU, ELU bends smoothly at origin, which can be beneficial in the optimization process. Similar to leaky ReLU and parametric ReLU, ELU gives negative output, which pushes the mean value towards zero. Again, is a parameter which needs to be specified. For , the function is smooth everywhere, which in turn helps the gradient descent algorithm to speed-up.

In Klambauer et al. (2017), Klambauer et al., proposed the Scaled ELU (SELU) activation function, where the authors show that for the neural network consisting of a stack of dense layers, the network will self-normalize if all the hidden-layers use the SELU activation function. However, there are some conditions for self-normalization; see Klambauer et al. (2017) for more details.

Activation Function () Derivatives () Range Continutity
Linear : ()
Step : ()
Sigmoid or Logistic : ()
Rectifier Unit (ReLU) : [)
Hyperbolic Tangent : ()
Softplus : ()
Leaky Rectifier Unit (Leaky ReLU) : ()
Exponential Linear Unit (ELU) : ()
Gaussian : (]
Swish () : [)
Oscillatory : []
Table 1: Various classical activation functions, their derivatives, range, and order of continuity.

6.9 Mish function

The Mish Misra (2019) is a self-regularized, non-monotonic activation function defined as

While not evident at first sight, Mish is closely related to Swish Ramachandran et al. (2017), as its first derivative can be written in terms of the Swish function. Similar to Swish, Mish is unbounded above and bounded below. It is non-monotonic, has continuity, and is also self-gated.

6.10 Radial activation functions

The traditional radial basis function (RBF) neural networks uses the Gaussian function, which is given as

The RBF neural networks with Gaussian activation functions have previously been successfully applied to a variety of difficult problems such as function approximation Hartman et al. (1990); Leonard et al. (1992), classification Er et al. (2002); Savitha et al. (2012) , etc. Other radial activation functions include Multiquadratics Lanouette et al. (1999), and Polyharmonic splines. The polyharmonic spline is a linear combination of polyharmonic radial basis functions and can be used as an activation function Hryniowski and Wong (2018).

Table 1 summaries the comparison of different activation functions, their derivatives, range and the order of continuity. The existence of the derivatives is an important feature for the backpropagation algorithm. Figures 5 and 6 show some of the classical activation functions and their derivatives, respectively.

6.11 Oscillatory activation functions

In Gidon et al. (2020), Gidon et al. found a new form of neuron in the human cortex that can learn the XOR function individually (a task that is difficult with single neurons employing other standard activation functions such as sigmoidal, ReLU, leaky ReLU, Softplus, ELU, Swish, Mish activations, etc). This is due to the fact that the zeros of the activation function for which is the decision boundary for a neuron that emits an activation . If the activation function

has only one zero, then the decision boundary is a single hyperplane

. Two hyperplanes are required to distinguish the classes in the XOR data set Minsky and Papert (1969), hence activation functions with multiple zeros, as described in Lotfi and Akbarzadeh-T (2014), must be considered. The oscillatory activation functions fits this criterion perfectly. It is interesting to note that oscillatory activation functions were already present in the literature before the paper by Gidon et al. Gidon et al. (2020). In Nakagawa (1999), Nakagawa proposed the chaos neural network model applied to the chaotic autoassociation memory using sinusoidal activation function. Mingo et al. Mingo et al. (2004) proposed the Fourier neural network with sinusoidal activation function, see also Gashler and Ashmore Gashler and Ashmore (2014). In Parascandolo et al. (2016), Parascandolo et al. used sine activation for deep neural networks. For learning the XOR function with a single neuron, Noel et al. Noel et al. (2021) proposed the Growing Cosine Unit (GCU) .

Figure 7: The oscillatory activation functions (first column), and their derivatives (second column).

Figure 7 shows various oscillatory activation functions along with their derivatives. Recently, Jagtap et al., Jagtap et al. (2021) proposed the Rowdy activation function, where the oscillatory noise is injected over the monotonic base activation function with adaptable parameters, thereby making them oscillatory as the optimization process starts. This adaptively creates multiple hypersurfaces to better learn the data set.

6.12 Non-standard activation functions

This section will cover the Maxout unit and the Softmax functions, which are two non-standard activation functions.

  • Maxout Goodfellow et al. (2013) : The Maxout Unit is a piece-wise linear function that gives the maximum of the inputs, and it is designed to be used in conjunction with dropout Srivastava et al. (2014). It generalizes the ReLU and leaky ReLU activation functions. Given the units input the activation of a maxout unit is computed by first computing linear feature mappings where

    and are weights and biases, whereas is the number of linear sub-units combined by one maxout unit. Later, the output of the maxout hidden unit is given as the maximum over the feature mappings:

    Maxout’s success can be partly attributed to the fact that it supports the optimization process by preventing units from remaining idle; a result of the rectified linear unit’s thresholding. The activation function of the maxout unit, on the other hand, can be viewed as executing a pooling operation across a subspace of k linear feature mappings (referred to as subspace pooling in the following). Each maxout unit is somewhat invariant to changes in its input as a result of this subspace pooling procedure. In Springenberg and Riedmiller (2013)

    , Springenberg and Riedmiller proposed a stochastic generalization of the maxout unit (Probabilistic Maxout Unit) that improves each unit’s subspace pooling operation while preserving its desired qualities. They first defined the probability for each of the

    linear units in the subspace as follows


    is a chosen hyperparameter controlling the variance of the distribution. The activation

    is then sampled as

    As , above equation reduces to maxout unit.

  • Softmax function : Also called as Softargmax function Goodfellow et al. (2016) or the folding activation function or the normalized exponential function Bishop and Nasrabadi (2006)

    is a generalization of logistic function in high dimensions. It normalizes the output and divides it by its sum, which forms a probability distribution. The standard softmax function

    is defined for as

    In other words, it applies the standard exponential function to each element

    of the input vector

    and normalizes these values by dividing them by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector is 1.

7 Complex-valued activation functions

The complex-valued neural network (CVNN) is a very efficient and powerful modeling tool for domains involving data in the complex number form. Due to its suitability, the CVNN is an attractive model for researchers in various fields such as remote sensing, acoustics, opto-electronics, image processing, quantum neural devices, robotics, bioinformatics, etc.; see Hirose Hirose (2012) for more examples. Choosing a suitable complex activation function for CVNN is a difficult task, as a bounded complex activation function that is also complex-differentiable is not feasible. This is due to Liouville’s theorem, which states that the only complex-valued functions that are bounded and analytic everywhere are constant functions.

For CVNN there are various complex activation functions proposed in the literature. In the earlier work of Aizenberg et al. Aizenberg et al. (1973), the authors proposed the concept of multi-valued neuron that uses the activation function

which divides the complex plain into equal sectors thereby maps the entire complex plane onto the unit circle. The most approaches to design CVNN preferred bounded but non-analytic functions, also called split activation function Nitta (1997), where real-valued activation functions are applied separately to real and imaginary part. Leung and Haykin Leung and Haykin (1991) used sigmoid function as , where is complex number. Later, the following sigmoid function

was proposed by Birx and Pipenberg Birx and Pipenberg (1992), as well as Benvenuto and Piazza Benvenuto and Piazza (1992). The real and imaginary types of hyperbolic tangent activation function were proposed by Kechriotis and Monalakos Kechriotis and Manolakos (1994); see also, Kinouchi and Hagiwara Kinouchi and Hagiwara (1995). Another class of split activation function for CVNN is phase-amplitude functions. Noest Noest (1988) proposed , Georgious and Koutsougeras Georgiou and Koutsougeras (1992) proposed the phase-amplitude function . These types of phase-amplitude split activation functions termed as phasor networks by Noest Noest (1988). Hirose Hirose (1994) proposed the following activation function , see also Hirose (1992); Hirose and Yoshida (2012). An alternative approach where analytic and bounded (almost everywhere) fully-complex activation functions with a set of singular points were proposed by Kim and Adali Kim and Adali (2002). With such activation, there is a need for careful scaling of inputs and initial weights to avoid singular points during the network training. Apart from these complex versions of conventional activation functions, other complex activation functions have been proposed.

A different approach to chose activation functions using conformal mappings was presented by Clarke Clarke (1990). Kuroe and Taniguchi Kuroe and Taniguchi (2005) proposed the following activation function

A Möbius transformation based activation was used in more general real-valued neural networks by Mandic Mandic (2000). A Möbius transformation based activation function were also proposed by Özdemir et al., Özdemir et al. (2011) as

where are complex numbers, and . It is a conformal mapping of the complex plane, which is also known as bilinear transformation. In the recent years, complex-valued ReLU activation was proposed by Guberman Guberman (2016)

and a modified ReLU was proposed by Arjovsky et al., Arjovsky et al. (2016). In Virtue et al. (2017), Virtue et al. proposed the cardioid activation function as

and it is used for magnetic resonance imaging (MRI) fingerprinting. The cardioid activation function is a complex extension of the ReLU that is phase-sensitive. The complex-valued Kernel activation function was proposed by Scardapane et al., Scardapane et al. (2018).

8 Quantized activation functions

The quantized neural network (QNN) has recently attracted researchers around the world, see Courbariaux et al. (2015); Rastegari et al. (2016); Zhou et al. (2017). The aim of quantization is to compact the neural network model in order to improve its efficiency without degrading the performance of the model. Both forward and back-propagation processes can be executed with quantized neural networks using bitwise operations rather than floating-point operations. Weights, activations, and gradients are the three components of a neural network that may be quantified. The purpose for quantifying these components, as well as the methodologies used to quantify them, varied slightly. The size of the model can be substantially reduced by using quantized weights and activations. Moreover, with quantized gradients, the communication costs can be greatly reduced in a distributed training environment. Quantized activations can be used to replace inner-products with binary operations, allowing for faster network training. By eliminating full-precision activations, a substantial amount of memory can be saved. The spiking neural network

(SNN) is another type of network where activation levels are quantized into temporally sparse, one-bit values, also called as ’

spike events’, which additionally converts the sum over weight-activity products into a simple addition of weights (one weight for each spike).

The activations were quantized to 8 bits by Vanhoucke et al., in Vanhoucke et al. (2011). In particular, they quantized the activations after training the network using a sigmoid function that confines the activations to the range [0, 1]. In Courbariaux et al. (2015); Rastegari et al. (2016); Zhou et al. (2016) the authors used the binary activation as

To approximate the ReLU unit, Cai et al., Cai et al. (2017) proposed a half-wave Gaussian quantizer. They employed a half-wave Gaussian quantization function in the forward approximation.

Figure 8: The expected (left) and real (right) ReLU function in low-precision networks.

In the recent work by Anitha et al.,Anitha et al. (2021) a quantized complex-valued activation function was proposed. The quantized activations are also proved to be successful against the adversarial examples, see Rakin et al., Rakin et al. (2018). Mishra et al., Mishra et al. (2017) presented wide reduced-precision networks (WRPN) to quantize activation and weights. They further found that activations take up more memory than weights. To compensate for the loss of precision due to quantization, they implemented an approach that increased the number of filters in each layer.

One well-known difficulty with quantized activation is that it causes a gradient mismatch problem, see Lin and Talthi Lin and Talathi (2016). As an example, figure 8 shows an expected and real ReLU activation function in low-precision networks. In a fixed point network, the effective activation function is a non-differentiable function. The gradient mismatch problem is the result of this discrepancy between the assumed and real activation functions.

9 A quest towards an optimal activation function

Which activation function should we use? This is one of the most basic and meaningful questions that could be posed. As discussed earlier, there is no rule of thumb for choosing the optimal activation function, which strongly depends on the problem under consideration. This motivates us to ask another meaningful question: do we need an activation function that adapts itself as per the requirements, thereby avoiding the local minima by changing the loss landscape dynamically. In this way, the adaptive activation functions can beat any standard (fixed) activation function of the same type. Despite having various activation functions, the quest for having the best activation function has driven researcher to propose an activation function that is adaptive, also called adaptable, tunable, or trainable activations. In recent years, this has triggered the surge in papers on adaptive activation functions. The adaptive nature of the activation can be injected in many ways. For example, one can introduce the parameters that adapt accordingly, or another way is to use ensemble activation functions from the pre-defined set of functions, which performs better than single activation. This section discusses various adaptive activation functions that perform better than their fixed counterparts.

9.1 Parametric Activation Functions

In the literature, various adaptive activation functions have been proposed. In their earlier work, Chen and Change Chen and Chang (1996) proposed the generalized hyperbolic tangent function parameterized by two additional positive scalar values and as

where the parameters are initialized randomly and then adapted independently for every neuron. Vecci et al., Vecci et al. (1998) suggested a new architecture based on adaptive activation functions that use Catmull-Rom cubic splines. Trentin Trentin (2001) presented empirical evidence that learning the amplitude for each neuron is preferable instead of having unit amplitude for all activation functions (either in terms of generalization error or speed of convergence). Goh et al., Goh and Mandic (2003) proposed a trainable amplitude activation function. Chandra and Singh Chandra and Singh (2004) proposed the activation function adapting algorithm for sigmoidal feed-forward neural network training. Eisenach et al., Eisenach et al. (2016) proposed parametrically learning activation functions. Babu and Edla Naresh Babu and Edla (2017) proposed the algebraic activation functions. Ramchandran et al., Ramachandran et al. (2017) proposed the Swish activation function  defined as:

where is the tuning parameter. Like ReLU, Swish is unbounded above but bounded below. The Swish is shown to give faster convergence and better generalization for many test problems. In Alcaide (2018), Alcaide et al. proposed the E-swish activation function as

Mercioni et al., Mercioni and Holban (2020) proposed the trainable parameter based swish (p-swish) activation that can give more flexibility than the original swish activation. Ertuğrul Ertuğrul (2018) proposed the trained activation function. Choi et al. Choi et al. (2018) proposed PArameterized Clipping acTivation (PACT) that uses an activation clipping parameter that is optimized during training to find the right quantization scale. Apicella et al., Apicella et al. (2019) proposed an efficient architecture for trainable activation functions.

Figure 9: Adaptive activation functions for different values of adaptive parameter .

Jagtap et al. Jagtap et al. (2020b) proposed globally adaptive activation function where they introduced a single slope parameter in the activation function as

where is a trainable parameter and is pre-defined scaling factor. They initialized . Figure 9 shows these globally adaptive activation functions for different values of . Based on this idea, the layer-wise and neuron-wise locally adaptive activation functions were proposed Jagtap et al. (2020a) that can be trained faster. The main idea is to introduce trainable parameter for every hidden-layer (layer-wise adaptive activation functions) as well as for every neuron in each layer (neuron-wise adaptive activation functions). Along with these locally adaptive activations, the additional slope recovery term is added in the activation function, which is given by

where is the depth of the network, is the number of neurons in the hidden-layer. The authors in Nader and Azar (2020)

proposed self-adaptive evolutionary algorithms for searching new activation functions. The

Soft-Root-Sign activation function Zhou et al. (2020) is defined as

that gives range from . Both and are a pair of trainable non-negative parameters. Pratama and Kang Pratama and Kang (2021) proposed trainable neural networks. Universal activation functions are proposed by Yuen et al. in Yuen et al. (2021) as

where and controls the slope, horizontal shift and vertical shift, respectively. The parameter approximates the slope of leaky ReLU, and

introduces additional degrees of freedom.

9.1.1 Adaptive family of rectifier and exponential units

  • Parametric ReLU He et al. (2015): The parametric ReLU (PReLU) is similar to the leaky ReLU, and it is defined as

    In parametric ReLU, is the learning parameter, which is learned during the optimization process. Both, leaky and parametric ReLU still face the problem of exploding gradients.

  • S-Shaped ReLU : Abbreviated as SReLU Jin et al. (2016) is defined as a combination of three linear functions, which perform a mapping with the following formulation

    where are four learnable parameters. The subscript indicates that we allow SReLU to vary in different channels.

  • Parametric ELU : The parametric ELU Shah et al. (2016) was proposed in order to remove the need to specify the parameter , which can be learned during the training to get the proper activation shape at every CNN layer. Klambauer et al. Klambauer et al. (2017) proposed Scaled exponential linear unit (SELU)

  • Parametric Tanh Linear Unit (P-TELU) Duggal and Gupta (2017): It is defined as

    which has range from [), and both and are trainable parameters.

  • Continuously Differentiable ELU : The Continuously Differentiable ELU Barron (2017) is simply the ELU activation where the negative values have been modified to ensure that the derivative exists (and equal to 1) at for all values of . The Continuously Differentiable ELU is defined as

    where is tunable parameter.

  • Flexible ReLU Qiu et al. (2018): The Flexible ReLU, or FReLU, is defined as

    which has a range . The FReLU captures the negative values with a rectified point.

  • Paired ReLU Tang et al. (2018):: The Paired ReLU is defined as

    where and represents scale parameters, which are initialized with the values of and -0.5, respectively . and are a pair of trainable thresholds.

  • Multiple Parametric ELU : The multiple parametric ELU Li et al. (2018) is given as

    Here, is greater than zero.

  • Parametric Rectified Exponential Unit : The parametric rectified EU Ying et al. (2019) is defined as

  • Fast ELU : The Fast ELU Qiumei et al. (2019) is given as

    The curve trend with fast ELU is consistent with the ELU due to the fast approximation achieved, ensuring that the fast ELU does not alter the original ELU’s accuracy advantage.

  • Multi-bin Trainable Linear Units (MTLU) Gu et al. (2019): The MTLU is given as

    with range (). The MTLU considers different bins with different ranges of hyperparameters. All the parameters are trainable.

  • Mexican ReLU: The Mexican ReLU Maguolo et al. (2021) uses Mexican hat type function and it is defined as:

    where are tunable parameter and are real numbers.

  • Lipschitz ReLU Basirat and ROTH (2020): The L* ReLU is defined as

    where and are nonlinear functions.

  • Parametric Deformable ELU : First defined in Cheng et al. (2020), the parametric deformable ELU is given as

    with range .

  • Elastic ELU : The Elastic ELU Kim et al. (2020) is designed to utilize the properties of both ELU and ReLU together, and it is defined as

    where both and are tunable parameters.

9.2 Stochastic/probabilistic adaptive activation functions

The stochastic/probabilistic approach is another way to introduce an adaptive activation function. In Gulcehre et al. (2016), Gulcehre et al. proposed a noisy activation function where a structured and bounded noise-like effect is added to allow the optimizer to exploit more and learn faster. This is particularly effective for the activation functions, which strongly saturate for large values of their inputs. In particular, they learn the level of injected noise in the saturated regime of the activation function. They consider the noisy activation functions of the following form:

where . Here is an independent and identically distributedrandom variable drawn from generating distribution, and the parameters and are used to generate a location scale family from . When the unit saturates, they pin its output to the threshold value , and the noise is added. The type of noise and the values of and , which can be chosen as functions of to allow some gradients to propagate even in the saturating regime, determine the exact behavior of the method.

Figure 10: ReLU (left) and Probabilistic ReLU (right) activation functions.

Urban et al., Urban et al. (2017) proposed Gaussian process-based stochastic activation functions. Based on a similar idea of Gulcehre et al. Gulcehre et al. (2016), Shridhar Shridhar et al. (2019) proposed the probabilistic activation functions (ProbAct), which are not only trainable but also stochastic in nature, see Figure 10. The ProbAct is defined as:

where is a static or learnable mean, say, for static ReLU, and the perturbation term , where perturbation parameter is a fixed or trainable value which specifies the range of stochastic perturbation and is a random value sampled from a normal distribution . Because the ProbAct generalizes the ReLU activation, it has the same drawbacks as the ReLU activation.

The Rand Softplus activation function Chen et al. (2019) models the stochaticity-adaptibility as

where is a stochastic hyperparameter.

9.3 Fractional adaptive activation functions

The fractional activation function can encapsulate many existing and state-of-the-art activation functions. In the earlier study of Ivanov Ivanov (2018) the fractional activation functions are presented. This is motivated by the potential benefits of having more tunable hyperparameters in a neural network and achieving different behaviors, which can be more suitable for some kinds of problems. In particular, the author used the following Mittag-Leffler types of functions:

which is the class of parametric transcendental functions that generalize the exponential function. Setting and to unity gives the Taylor series expansion of the exponential function. The authors proposed several fractional activation functions by replacing the exponential function with several standard activation functions such as tanh, sigmoid, logistic, etc.

Figure 11: Fractional ReLU (left), and fractional hyperbolic tangent (right) activation functions for different values of fractional derivative .

In Zamora Esquivel et al. (2019), the authors proposed activation functions leveraging the fractional calculus. In particular, they used the Gamma function (). The fractional ReLU is then defined as

where the fractional derivative is given by

Similarly, the fractional derivative of hyperbolic tangent is given by

The fractional derivative produces a family of functions for different values of . Figure  11 depicts the fractional ReLU (left) and fractional hyperbolic tangent (right) functions for different values. At , the ReLU becomes a step function, whereas the hyperbolic tangent function () becomes a quadratic hyperbolic secant function.

9.4 Ensemble adaptive activation functions

An alternative approach to adaptive activation functions is to create a combination of different activation functions. Chen Chen (2016) used multiple activation functions for each neuron for the problems related to stochastic control. Jin et al Jin et al. (2016) proposed the combination of set of linear functions with open parameters. Agostinelli et al., Agostinelli et al. (2014) constructed the activation functions during network training. They used the following general framework:

where is predefined hyperparameter, is the neuron, and are trained variables.

Ramachandran et al Ramachandran et al. (2017)

used a reinforcement learning controller to combine preset unary and binary functions for learning new activation functions. In

Qian et al. (2018), Qian et al. proposed adaptive activations in convolutional neural networks, where they focus on learning activation functions via combining basic activation functions in a data-driven way. In particular, they used ReLU and other variants of ReLU, such as Leaky ReLU and ELU combinations. In Klabjan and Harmon (2019), Klabjan and Harmon show that the ensemble activation function can be created by choosing the suitable activation functions from the predefined set . Here, the activations are combined together to get the best performance out of the neural networks. A similar approach was used by Nandi et al. in Nandi et al. (2020), where an ensemble of activation functions is used to improve the performance of the network. The previously discussed Mexican ReLU Maguolo et al. (2021) also comes under the category of ensemble activation functions.

A polynomial activation function was proposed by Piazza et al. (1992) where they built the activation function over powers of the activation. They used

where is a hyper-parameter. Because a polynomial of degree can pass through all points exactly, this polynomial activation function can theoretically approximate any smooth function. The main disadvantage was the global influence of coefficient , which causes the function’s output to grow excessively large. Similarly, Scardapane et al. Scardapane et al. (2019b) proposed the kernel-based non-parametric activation function. In particular, they model each activation function in terms of a kernel expansion over a finite number of terms as

where are the mixing coefficients, are the called the dictionary elements, and is a one-dimensional kernel function. Further, this idea was extended with a multi-kernel approach in Scardapane et al. (2019a). On similar grounds, the adaptive blending units Sütfeld et al. (2020) were proposed by Sütfeld et al. to combine a set of functions in a linear way. In Basirat and Roth (2018)

, Basirat and Roth presented a genetic algorithm based learning of activation functions where the hybrid crossover of various operators results in a hybrid activation function.

References Parametric Ensemble Stochastic/ Fractional Complex-Valued Quantized
Chen and Chang Chen and Chang (1996), Vecci Vecci et al. (1998), Trentin Trentin (2001)
Goh Goh and Mandic (2003), Chandra and Singh Chandra and Singh (2004), Shah et al., Shah et al. (2016),
Jin et al.,Jin et al. (2016), Shah Shah et al. (2016), Jin Jin et al. (2016)
He et al., He et al. (2015), Barron Barron (2017), Duggal and Gupta Duggal and Gupta (2017)
Li et al., Li et al. (2018), Tang et al.,Tang et al. (2018), Qiu et al.,Qiu et al. (2018)
Grelsson and Felsberg Grelsson and Felsberg (2018), Ying et al.,Ying et al. (2019), Qiumei et al.,Qiumei et al. (2019)
Gu et al., Gu et al. (2019), Goyal et al., Goyal et al. (2019), Jagtap et al., Jagtap et al. (2020b, a)
Basirat and Roth Basirat and ROTH (2020), Cheng et al.,Cheng et al. (2020), Kim et al.,Kim et al. (2020)
Piazza et al., Piazza et al. (1992), Agostinelli et al., Agostinelli et al. (2014) Jagtap et al., Jagtap et al. (2021)
Scardapane et al., Scardapane et al. (2019b)
Gulcehre et al., Gulcehre et al. (2016), Shridhar et al., Shridhar et al. (2019), Urban et al.,Urban et al. (2017)
Zamora et al., Zamora Esquivel et al. (2019), Ivanov et al., Ivanov (2018)
Ramchandran et al., Ramachandran et al. (2017), Klabjan and Harmon Klabjan and Harmon (2019)
Sütfeld et al., Sütfeld et al. (2020) ,Basirat and Roth Basirat and Roth (2018)
Nanni et al Nanni et al. (2020)
Choi et al., Choi et al. (2018), Rakin et al., Rakin et al. (2018)
Table 2: Adaptive activation functions under various settings.

In recent years, Nanni et al Nanni et al. (2020) proposed stochastic selection of activation layers for convolutional neural networks. In Jagtap et al. (2021)

Jagtap et al. proposed the Kroneker Neural Networks (KNN), which is a general framework for adaptive activation functions that can generalize a class of existing feed-forward neural networks that utilize adaptive activation functions. In particular, the output of KNN is given as

where the network parameters are: for and . Different classes of adaptive activation functions are produced by varying these parameters and the hyperparameter . For example,

  • If , for all , the Kronecker network becomes a standard feed-forward network.

  • If , the Kronecker network becomes a feed-forward neural network with layer-wise locally adaptive activation functions Jagtap et al. (2020b, a).

  • If , , for all , , and , the Kronecker network becomes a feed-forward network with Parametric ReLU activation He et al. (2015).

  • If , for all , , and , the Kronecker network becomes a feed-forward network with Exponential Linear Unit (ELU) activation Clevert et al. (2015) if for all , and becomes a feed-forward network with Scaled Exponential Linear Unit (SELU) activation Klambauer et al. (2017) if for all .

  • If for all and for all , the Kronecker network becomes a feed-forward neural network with self-learnable activation functions (SLAF) Goyal et al. (2019). Similarly, a FNN with smooth adaptive activation function Hou et al. (2017) can be represented by a Kronecker network.

Every type of adaptive activation that has been discussed in this section has pros and cons of its own. Table 2 shows the adaptive activation function referenced under various settings.

10 Performance of some fixed and adaptive activation functions for classification tasks

This section discusses the performance of different activation functions for classification tasks. In particular, we used MNIST LeCun et al. (1998), CIFAR-10, and CIFAR-100 Krizhevsky et al. (2009) data sets for the same. The MNIST database of handwritten digits contains the training set of 60k samples, while the test set contains 10k examples. The CIFAR-10 data set contains 50k training and 10k testing images from 10 object categories, whereas the CIFAR-100 data set contains 50k training and 10k testing images from 100 object categories. In particular, for comparison, we chose some fixed as well as some adaptive activation functions that have been discussed previously.

Sigmoid 97.9 Pedamonti (2018) 89.43 0.51 (M), 85.42 0.47 (V) 61.64 0.56 (M), 59.25 0.45 (V)
Tanh 98.21Eisenach et al. (2016) 88.19 1.21 (M), 87.53 0.67 (V) 57.06 2.03 (M), 62.32 0.82 (V)
Swish 99.75 0.11 (M), 99.45 0.26 (V) 95.5 Ramachandran et al. (2017) 83.9 Ramachandran et al. (2017)
ReLU 99.1 Apicella et al. (2019), 99.15 Scardapane et al. (2019b), 99.53 Jin et al. (2016) 95.3 Ramachandran et al. (2017), 94.59 Trottier et al. (2017), 92.27 Jin et al. (2016) 83.7 Ramachandran et al. (2017), 75.45 Trottier et al. (2017), 67.25 Jin et al. (2016)
Leaky ReLU 98.2 Pedamonti (2018),99.58 Jin et al. (2016) 95.6Ramachandran et al. (2017), 92.32 Qian et al. (2018) 83.3Ramachandran et al. (2017), 67.3Jin et al. (2016)
ELU 98.3 Pedamonti (2018) 94.4 Ramachandran et al. (2017), 94.01 Trottier et al. (2017) 80.6 Ramachandran et al. (2017), 74.92Trottier et al. (2017)
PReLU 99.64 0.24 (M), 99.18 0.17 (V) 92.46 0.44 (M), 91.63 0.31 (V) 69.46 0.74 (M), 66.53 0.69 (V)
SELU 98.42 0.53 (M), 99.02 0.37(V) 93.52 0.63 (M), 90.53 0.36 (V) 70.42 0.75 (M), 68.02 1.29 (V)
RReLU 99.23 0.53 (M), 99.63 0.6 (V) 90.52 2.14 (M), 90.18 0.91 (V) 68.62 0.42 (M), 65.32 1.74 (V)
GELU 99.72 0.35 (M), 99.26 0.42 (V) 94.76 0.55 (M), 92.67 0.89 (V) 71.73 1.09 (M), 69.61 1.53 (V)
CELU 99.36 0.68 (M), 99.37 0.38 (V) 90.26 0.12 (M), 90.37 0.23 (V) 70.26 1.53 (M), 68.35 0.87 (V)
Softplus 98.69 0.5 (M), 97.36 0.77 (V) 94.9 Ramachandran et al. (2017) 83.7 Ramachandran et al. (2017)
Mish 97.9 Pedamonti (2018) 90.26 0.52 (M), 86.05 0.76 (V) 68.53 0.86 (M), 67.03 1.39 (V)
Maxout 99.55 Goodfellow et al. (2013) 90.62 Goodfellow et al. (2013) 61.43 Goodfellow et al. (2013)
SRS 98.04 0.97 (M), 98.06 0.84 (V) 89.35 0.85 (M), 87.26 1.38 (V) 65.20 1.53 (M), 63.65 2.63 (V)
Ensemble (Klabjan and Harmon (2019)) 99.40 Klabjan and Harmon (2019) 85.05 0.28 (M), 84.96 0.87 (V) 74.20 Klabjan and Harmon (2019)
LiSHT 98.74 0.17 (M), 98.32 0.19 (V) 90.78 0.43 (M),87.74 0.36 (V) 56.35 0.58 (M), 58.74 0.99 (V)
Sine 99.10 0.82 (M), 98.63 0.27 (V) 91.64 0.36 (M), 90.59 0.88 (V) 80.98 0.81 (M), 78.42 0.67 (V)
GCU (Noel et al. (2021)) 97.80 1.02 (M), 90.73 0.87 (V) 90.24 1.96 (M), 89.56 0.18 (V) 77.46 2.91 (M), 76.46 1.61 (V)
Gaussian 98.60 2.22 (M), 96.43 2.57 (V) 90.64 2.76 (M), 89.59 0.28 (V) 76.96 0.41 (M), 74.62 2.69 (V)
Adaptive sine 99.56 0.56 (M), 98.76 0.12 (V) 92.64 0.36 (M), 92.72 0.37 (V) 81.15 0.81 (M), 78.73 0.64 (V)
Adaptive tanh 99.52 0.36 (M), 99.57 0.20 (V) 91.14 0.45 (M), 90.52 0.82 (V) 83.82 0.83 (M), 76.21 1.95 (V)
Adaptive ReLU 99.75 0.25 (M), 98.43 0.32 (V) 96.89 0.93 (M), 94.77 1.16 (V) (M), 80.30 0.60 (V)
Rowdy Sine 99.74 0.29 (M), 99.24 0.43 (V) 93.87 0.78 (M), 92.82 0.82 (V) 83.73 0.83 (M), 82.14 0.13 (V)
Rowdy tanh 99.45 0.93 (M), 98.65 0.57 (V) 93.78 0.13 (M), 93.26 0.42 (V) 82.34 1.01 (M), 80.97 0.32 (V)
Rowdy ReLU 99.43 0.68 (M), 98.23 0.25 (V) (M), 94.47 0.92 (V) (M), 79.41 0.75 (V)
Rowdy ELU (M), 99.06 1.07 (V) 95.45 0.92 (M), 93.27 0.13 (V) 84.29 0.13 (M), 82.40 0.81 (V)
Table 3: Accuracy comparison of different activation functions for MNIST, CIFAR-10, and CIFAR-100 data sets. Here, we report the best accuracy from the literature without considering the architecture used to perform the experiments. Also, we used MobileNet Howard et al. (2017) (denoted by M) and VGG16 Simonyan and Zisserman (2014)

(denoted by V) architectures with different activation functions, where we report the mean and standard deviations of 10 different realizations. The

Adaptive and Rowdy (with ) refers Jagtap et al. (2020b) and Jagtap et al. (2022a), respectively. The best results are highlighted in bold.

In this work, we report the best accuracy from the literature without considering the architecture used to perform the experiments. We also conducted a performance comparison of two models, namely, MobileNet Howard et al. (2017) and VGG16 Simonyan and Zisserman (2014). Both MobileNet and VGG16 have a convolution neural net (CNN) architecture. The MobileNet is a lightweight deep neural network with better classification accuracy and fewer parameters, whereas the VGG16 has 16 deep layers. In both cases, the learning rate is 1e-4, and the batch sizes for MNIST, CIFAR-10 and CIFAR-100 are 64, 128, and 64, respectively. During training and testing, data normalization is performed. The Adam Kingma and Ba (2014)

optimizer is employed with a cross-entropy loss function. All the experiments are computed using a desktop computer system having 16 GB of RAM with an Nvidia GPU Card, and performed in the Pytorch

Paszke et al. (2019) environment, which is a popular machine learning library.

Table 3 gives the accuracy comparison of different activation functions for all the three data sets, where we have given the mean and standard deviation of accuracy for 10 different realizations. For MNIST, almost all activation functions perform well. It can be observed that the Swish, ReLU, Leaky ReLU, ELU, and Sine activation functions work well for the CIFAR-10 and CIFAR-100 data sets. Furthermore, the adaptive and rowdy activation functions outperform in all three data sets.

11 Activation functions for physics-informed machine learning

In recent years, physics-based machine learning (PIML) Karniadakis et al. (2021) algorithms for scientific calculations have been revived at a considerably faster speed owing to the success of automatic differentiation (AD) Baydin et al. (2018). These methods can solve not only the forward problems but also the notoriously difficult inverse problems governed by almost any type of differential equation. Various physics-informed machine learning algorithms has been proposed that can seamlessly incorporate governing physical laws in machine learning framework. The SINDy framework Brunton et al. (2016) for analyzing dynamical systems, physics-informed machine learning for modeling turbulence Wang et al. (2017), physics-guided machine learning framework Karpatne et al. (2017), the Neural ODE Chen et al. (2018), and most recently physics-informed neural networks (PINNs) Raissi et al. (2019a) are just a few examples of the many physics-informed machine learning algorithms that have been proposed and successfully used for solving problems from computational science and engineering fields.

Most of our previous discussion about activation function was centered around various classification problems in the computer science community. In this section, we focus on the activation functions for the PIML framework. Apart from the accurate and efficient AD algorithm, PIML methods also need tailored activation functions where the output of the network not only satisfies the given data but also satisfies the governing physical laws in the form of differential equations where various spatio-temporal derivative terms are involved. In this work, we are using PINNs for our investigation. PINN is a very efficient method for solving both forward and ill-posed inverse differential equations involving sparse and multi-fidelity data. The main feature of the PINN is that it can seamlessly incorporate all the information like governing equation, experimental data, initial/boundary conditions, etc., into the loss function thereby recast the original PDE problem into an optimization problem. PINNs achieved remarkable success in many applications in science and engineering; see Raissi et al. (2019b); Jagtap et al. (2022b); Raissi et al. (2020); Shukla et al. (2021a); Jagtap and Karniadakis (2020); Jagtap et al. (2020c); Shukla et al. (2021b); Jagtap et al. (2022a); Hu et al. , and the references therein for more details. The first comprehensive theoretical analysis of PINNs for a prototypical nonlinear PDE, the Navier-Stokes equations, has been presented in De Ryck et al. .

11.1 Differentiability

Differentiability is an important property of activation function. The activation function must be differentiable everywhere. Consider the example of one-dimensional Laplace equation with forcing as

Here we use a one-hidden-layer neural network with a single neuron, see figure 12.

Figure 12: Left part: One-hidden layer, single neuron, feed-forward neural network, and right part: physics-informed part where neural network’s output is forced to satisfy the governing physical laws. In particular, the automatic differentiation (AD) is employed to find the derivatives present in the governing differential equation.

The input to the network is the independent variable and the network output is . The network output satisfies both data as well as physics, as shown in the figure. The loss function is given as