1 Introduction
1.1 Aim
In recent years, artificial neural networks (ANN) have achieved remarkable success in both academia as well as industry. The ANN has found success in a variety of industries, such as cybersecurity Shaukat et al. (2020); Sarker (2021), manufacturing technologies Taha and Rostam (2011); Casalino et al. (2016), healthcare Wiens and Shenoy (2018), financial services Coakley and Brown (2000), food industry Guiné (2019), and energy Ahmad et al. (2014), to name a few, where they perform a range of tasks, including logistics, inventory management, etc. With various scientific applications, ANN is now the most sophisticated algorithm for understanding complicated sensory input, such as images Turkmen (2016), video Jiang et al. (2008), audio Heckmann et al. (2002), etc. The first computational model for neural networks was created by McCulloch and Pitts McCulloch and Pitts (1943), whereas the first ANN, also called as Perceptron was invented by Rosenblatt Rosenblatt (1958). Ivakhnenko and Lapa Ivakhnenko and Lapa (1967) released the Group Method of Data Handling, which was the first functional neural network with multiple layers. However, a seminal book written by Minsky and Papert Minsky and Papert (1969) showed that these early neural networks were incapable of processing simple tasks, and computers lacked the capability to process usable neural networks. Early neural networks developed by Rosenblatt only comprised one or two trainable layers. Such simplistic networks cannot mathematically represent complicated realworld phenomena. Deep neural networks (DNN) are ANNs that have a large number of trainable hiddenlayers. The DNN was theoretically more versatile. However, without success, researchers spent several years trying to figure out how to train the DNN, as it could not scale with the first neural network’s straightforward hillclimbing algorithm. In a groundbreaking study, Rumelhart et al. Rumelhart et al. (1985) created the backpropagation
training algorithm to train the DNN. The backpropagation algorithm is used to effectively train any ANN using a gradient descent method that takes advantage of the chain rule. Since then, ANN research has opened up all of the exciting and transformative developments in computer vision
Sebe et al. (2005), speech recognition Deng and Li (2013)Chowdhary (2020), and even in scientific computations called PhysicsInformed Machine Learning Karniadakis et al. (2021).Artificial neurons are a set of connected units or nodes in an ANN that loosely replicate the neurons in a biological brain. The Threshold Logic Unit, also known as the Linear Threshold Unit, was the first artificial neuron and was first put forth by McCulloch and Pitts McCulloch and Pitts (1943) in 1943. The model was intended to serve as a computational representation of the brain’s ’nerve net’. Each artificial neuron receives inputs and generates a single output that can be delivered to a number of other neurons. Artificial neurons do not simply output the raw data they receive. Instead, there is one more step, which is analogous to the rate of action potential firing in the brain and is called an activation function. The introduction of the activation function in ANN was inspired by biological neural networks whose purpose is to decide whether a particular neuron fires or not. The simple addition of such a nonlinear function can tremendously help the network to exploit more, thereby learning faster. There are various activation functions proposed in the literature, and it is difficult to find the optimal activation function that can tackle any problem. In this survey, our aim is to discuss various advantages as well as disadvantages or limitations of classical (fixed) as well as modern (adaptive) activation functions.
1.2 Contributions of this work
To the best of our knowledge, this is the first comprehensive survey of activation functions for both classification and regression problems. Apart from that, we also present several original contributions, summarized below.

First, we have done a comprehensive survey of the classical (fixed) activation functions like the realvalued activation functions, including the rectifier units. This also includes the oscillatory as well as the nonstandard activation functions. We further discuss various properties of the classical activation functions that make them best suited for the particular tasks.

We present a taxonomy based on applications in addition to discussing a taxonomy based on the characterization of activation functions like fixed, adaptive, and multifold activations. In particular, we discuss various complexvalued activation functions, which have many applications in remote sensing, acoustics, optoelectronics, image processing, quantum neural devices, robotics, bioinformatics, etc. Furthermore, the quantized activations, which are a special type of activation function, are also discussed in detail. The quantized activations are typically used to improve the efficiency of the network without degrading the performance of the model.

The stateoftheart adaptive activation functions, which outperform their classical counterparts, are thoroughly discussed. Such adaptive activation not only accelerates the training of the network but also increases the prediction accuracy. Various adaptive activation functions such as stochastic or probabilistic, ensemble, fractional, etc. are discussed in detail. To this end, we also compare various fixed and adaptive activation functions systematically for classification data sets including MNIST, CIFAR10, and CIFAR100.

Physicsinformed machine learning is an intriguing approach that seamlessly incorporates the governing physical laws into the machine learning framework. Such incorporation of physical laws sets additional requirements from the activation function. To this end, we discuss these requirements for specifically solving scientific problems using a physicsinformed machine learning framework. We used various fixed and adaptive activation functions to solve different PDEs. Furthermore, we also compare the predictive accuracy of the solution for different activation functions using various machine learning libraries such as TensorFlow, PyTorch, and JAX using clean and noisy data set. A runtime comparison is made among TensorFlow, PyTorch, and JAX machine learning libraries.
1.3 Organization
This paper is organized as follows: In Section 2, we discussed the historical perspective of the activation functions. In section 3, we compare biological and artificial neurons in detail, followed by section 4, where some of the desired features of the neurons are discussed in depth. Section 5 gives a detailed discussion about the taxonomy of the activation functions. Section 6 covers several classical activation functions with their improved versions in detail. Section 7 is devoted to the motivation and historical development of complexvalued activation functions. Similarly, section 8 discuss the efficient quantized activations for quantized neural networks. In section 9, we discuss various types of adaptive or trainable activation functions, ranging from stochastic or probabilistic to ensemble activations. Various fixed and adaptive activation functions are compared in terms of accuracy for the MNIST, CIFAR10, and CIFAR100 data sets in Section 10 using two convolutional neural networkbased models, namely,
MobileNet and VGG16. Section 11 present the discussion on the activation function required for solving regression problems using a physicsinformed machine learning framework. Finally, we summarize our findings in section 12.2 Activation functions: A historical perspective
The activation function is a function that acts on the output of the hidden layer and the output layer (optional), which helps the neural network to learn complex features of the data. The motivation to employ the activation function came from the biological neural networks where this activation function decides whether the particular neuron activates (fires) or not. The important feature of the activation function is the ability to introduce nonlinearity in the network in order to capture nonlinear features, without which the neural network acts as a linear regression model. Cybenko
Cybenko (1989) and Hornik Hornik et al. (1989) argue for the activation function’s nonlinearity, demonstrating that the activation function must be bounded, nonconstant, monotonically rising, and continuous to ensure the neural network’s universal approximation property.Figure 1 shows some of the notable historical developments related to activation functions that we shall discuss here. In the literature, the term activation function has been referred to with different names, such as squashing function Haykin and Lippmann (1994), and output function or transfer function Duch and Jankowski (1999). In DasGupta and Schnitger (1993), Dasgupta and Schnitger defined the activation as a realvalued functions defined on subset of . In Goodfellow et al. (2016), Goodfellow et al. described the activation function as a fixed nonlinear function. The activation function was first used in the work of Fukushima Fukushima (1969)
for visual feature extraction in hierarchical neural networks. Later, Fukushima and Miyake
Fukushima and Miyake (1982)proposed it again for visual pattern recognition. Hahnloser et al.,
Hahnloser et al. (2000, 2003) argued that the activation function is strongly related to biological neurons. In the seminal work of Glorot et al., Glorot et al. (2011), it was found that the presence of activation function enables the fast learning of neural networks. During the early 1990’s, the Sigmoid activation function Han and Moraga (1995)was one of the most popular activation functions. Due to its vanishing gradient problem, the notable improvement of Sigmoid such as the
improved logistic Sigmoid function
Qin et al. (2018) has recently been proposed. In the late 1990’s, researchers widely used the hyperbolic tangent function LeCun et al. (2012)as an activation function. Both Sigmoid and hyperbolic tangent face vanishing gradient problems and have difficulties with large inputs as they saturate for large input values. Later researchers proposed some alternative functions, and one of the most popular gradient vanishingproof activation functions was the rectified linear unit (ReLU). As of 2017, the ReLU was the most popular activation function
Ramachandran et al. (2017) compared to widely used activation functions such as sigmoid and hyperbolic tangent functions. Although the ReLU was very successful in many applications such as speech recognition Maas et al. (2013) and computer vision Glorot et al. (2011), it suffered from issues such as nondifferentiability at zero, unboundedness, and the most famous, the dying ReLU problem Maas et al. (2013). Several other linear as well as nonlinear variants of the rectifier unit were proposed. The notable ones are leaky ReLU Maas et al. (2013) and parametric ReLU He et al. (2015), which are linear variants. Some of the nonlinear variants are softplus Dugas et al. (2000), Exponential linear units (ELU) Clevert et al. (2015), Gaussian Error Linear Units (GELU) Hendrycks and Gimpel (2016), Swish Ramachandran et al. (2017) and Mish Misra (2019).Recent years have seen an increase in studies on adaptive activation functions. In the early studies, the generalized hyperbolic tangent function was suggested by Chen and Change Chen and Chang (1996). It is parameterized by two additional positive scalar values. Vecci et al. Vecci et al. (1998) proposed an adaptive spline activation function. Trentin Trentin (2001) gives empirical evidence that learning the amplitude for each neuron is superior to having a unit amplitude for all activation functions (either in terms of generalization error or speed of convergence). Goh et al. Goh and Mandic (2003)
proposed a trainable amplitude activation function. The activation function adaptation algorithm is suggested for sigmoidal feedforward neural network training by Chandra and Singh
Chandra and Singh (2004). Recently, Agostinelli et al. Agostinelli et al. (2014) proposed learning activation functions. Eisenach et al. Eisenach et al. (2016) proposed a nonparametric method for learning activation functions. The algebraic activation functions were proposed by Babu and Edla Naresh Babu and Edla (2017). Urban et al. proposed stochastic activation functions based on Gaussian processes in Urban et al. (2017). Alcaide et al., Alcaide (2018) proposed the Eswish activation function. The trained activation function was proposed by Ertuğrul Ertuğrul (2018). Convolutional neural networks with adaptive activations were proposed by Qian et al., Qian et al. (2018). In order to discover the ideal quantization scale, Choi et al, Choi et al. (2018) introduced PArameterized Clipping acTivation (PACT), which makes use of an activation clipping parameter that is tuned during training. Apicella et al. Apicella et al. (2019) proposed an effective design for trainable activation functions. Several variants of adaptive ReLU activation function have been proposed recently, such as Parametric ReLU He et al. (2015), Sshaped ReLU Jin et al. (2016), Flexible ReLU Qiu et al. (2018), Paired ReLU Tang et al. (2018), etc. Similarly, adaptive ELUs like Parametric ELU Shah et al. (2016), Continuously Differentiable ELU Barron (2017), Shifted ELU Grelsson and Felsberg (2018), Fast ELU Qiumei et al. (2019), Elastic ELU Kim et al. (2020), etc. are also proposed in the literature. Recently, Jagtap et al. proposed a series of papers on global Jagtap et al. (2020b) and local Jagtap et al. (2020a) adaptive activation functions. The other way to introduce adaptivity is through the use of stochastic activation functions. Gulcehre et al. Gulcehre et al. (2016) proposed a noisy activation function where a structured bounded noiselike effect is added to allow the optimizer to exploit more and learn faster. On a similar idea, Shridhar Shridhar et al. (2019)proposed the probabilistic activation functions, which are not only trainable but also stochastic in nature. In this work, the authors replicated uncertain behavior in the information flow to the neurons by injecting stochastic sampling from a normal distribution into the activations. The fractional activation functions with trainable fractional parameter is proposed by Ivanov
Ivanov (2018). Later Esquivel et al., Zamora Esquivel et al. (2019) provided adaptable activation functions based on fractional calculus. The ensemble technique is another way to adapt activation functions. Earlier work of Chen Chen (2016) used multiple activation functions for each neuron for the problems related to stochastic control. Agostinelli et al., Agostinelli et al. (2014) constructed the activation functions during the network tranining. Jin et al Jin et al. (2016) proposed the combination of a set of linear functions with open parameters. In recent years, a generalized framework, namely Kronecker Neural Networks for any adaptive activation functions, has been proposed by Jagtap et al., Jagtap et al. (2021). Good reviews of classical activation functions Nwankpa et al. (2018); Szandała (2021) and the modern trainable activation Apicella et al. (2021) are available for various classification problems.3 Biological vs artificial neurons
A neuron in the human brain is a biological cell that processes information. The biological neuron is depicted in the top figure 2. There are three basic elements of a biological neuron. The dendrites receive singles as an input from other neurons. A cell body (nucleus) that controls the activity of neurons and an axon that transmits signals to the neighboring neurons. The axon’s length may be several times or even tens of thousands of times longer than the cell body. The axon is divided into various branches near its extremity, which are connected to the dendrites of other neurons. There are millions of massively connected neurons (around ), which is approximately equal to the number of stars in the Milky way Brunak and Lautrup (1990). Every neuron is connected to thousands of neighboring neurons, and these neurons are organized into successive layers in the cerebral cortex.
Such a massively parallel neural network communicates through a very short train of pulses, in milliseconds, and has an ability that includes parallel processing, fast learning ability, adaptivity, generalization ability, very low energy consumption, and fault tolerance. The artificial neuron in the bottom figure 2 attempts to resemble the biological neuron. The artificial neuron, like a biological neuron, takes input , and is made up of three basic parts: weights and bias , which act as a dendrites. The activation function denoted by that act as a cell body and nucleus, and the output . The artificial neuron has been generalized in many ways. Among all the components, the most obvious is the activation function. The activation function plays an important role in replicating the biological neuron firing behaviour above some threshold value. Unlike biological neurons, which send binary values, the artificial neurons send continuous values, and depending on the activation function, the firing behavior of artificial neurons changes significantly.
4 Desired characteristics of the activation functions
There is no universal rule for determining the best activation function; it varies depending on the problem under consideration. Nonetheless, some of the desirable qualities of activation functions are well known in the literature. The following are the essential characteristics of any activation function.

Nonlinerity: One of the most essential characteristics of an activation function is nonlinearity. In comparison to linear activation functions, the nonlinearity of the activation function significantly improves the learning capability of neural networks. In Cybenko (1989) Cybenko and Hornik et al. (1989) Hornik advocate for the nonlinear property of the activation function, demonstrating that the activation function must be bounded, nonconstant, monotonically growing, and continuous in order to ensure the neural network’s universal approximation property. In Morita (1993, 1996) Morita later discovered that neural networks with nonmonotonic activation functions perform better in terms of memory capacity and retrieval ability.

Computationally cheap: The activation function must be easy to evaluate in terms of computation. This has the potential to greatly improve network efficiency.

The vanishing and exploding gradient Problems
: The vanishing and exploding gradient problems are the important problems of activation functions. The variation of the inputs and outputs of some activation functions, such as the logistic function (Sigmoid), is extremely large. To put it another way, they reduce and transform a bigger input space into a smaller output space that falls between [0,1]. As a result, the backpropagation algorithm has almost no gradients to propagate backward in the network, and any residual gradients that do exist continue to dilute as the program goes down through the top layers. As a result, the initial layers are left with nothing. For hyperbolic tangent and sigmoid activation functions, it has been observed that the saturation region for large input (both positive and negative) is a major reason behind the vanishing of gradient. One of the important remedies to this problem is the use of nonsaturating activation functions. Other nonsaturating functions, such as ReLU, leaky ReLU, and other variants of ReLU, have been proposed to solve this problem.

Finite range/boundedness: Gradientbased training approaches are more stable when the range of the activation function is finite, because pattern presentations significantly affect only limited weights.

Differentiability: The most desirable quality for using gradientbased optimization approaches is continuously differentiable activation functions. This ensures that the backpropagation algorithm works properly.
5 Taxonomy of activation functions
5.1 Characterization based taxonomy
The activation functions can be broadly divided into linear and nonlinear functions. For all realistic problems, nonlinear activation functions are often employed.
Figure 3 shows the taxonomy based on characterization of activation functions, which divides the activation function into three major categories, first is the fixed activation function that contains all the classical fixed activations, including the rectifier units. The second category is the adaptive or modern activation functions that can adapt itself, which is further divided into parametric as well as ensemble activations. The third category is nonstandard activation, such as multifold activation from previous layers, which may be adaptive or fixed.
5.2 Application based taxonomy
Although, the taxonomy of activation functions based on their characterization encompasses all the activation functions, it is important to define the taxonomy based on the applications that can further broaden our understanding of activation functions. The activation function can also be divided into real and complexvalued activation functions; see Fig. 4. The realvalued activations are wellknown in the literature, and we shall discuss them in detail later. The complexvalued activation functions are another set of activation functions whose output is a complex number. These activation functions are often required due to the complexvalued output, which has many applications in science and engineering, such as bioinformatics, acoustics, robotics, optoelectronics, quantum neural devices, image processing, etc.
The process of mapping continuous values from an infinite set to discrete finite values is known as Quantization. Both real and complexvalued activation functions can be quantized in order to reduce memory requirements. Quantized activations are the class of activation functions that is efficient in terms of memory requirements as well as increased compute efficiency. The output of the quantized activation is integers rather than floating point values. Note that, just like realvalued activation functions, both complexvalued as well as quantized activations can be made adaptive by introducing the tunable variables. The upcoming sections provide a clear discussion of the various complexvalued and quantized activation functions.
6 Classical activation functions in artificial neural networks
This section gives details about the various fixed activation functions that have been proposed in the literature. These activation functions do not contain any tunable parameters. The following are the most common fixed activation functions used.
6.1 Linear and piecewise linear functions
Linear function is the simplest form of activation function. It has a constant gradient, and the descent is based on this constant value of gradient. The range of linear function is , and has a order of continuity.
The piecewise linear activation can be defined as
where is a constant. The derivative of piecewise linear activation is not defined at , and it is zero for and . The range of linear function is , and has a order of continuity.
6.2 Step function
Also called Heaviside, or the unit step function, it is defined as
Step function is one of the most basic forms of activation function. The derivative of a step function is zero when and at , the derivative is not defined. The step function has a range from and has order of continuity.
6.3 Sigmoid function
The sigmoid function Han and Moraga (1995) also called the logistic function was a very popular choice of activation function till the early 1990’s. It is defined as
In Hinton et al. (2012)
, Hinton et al, used a sigmoid activation function for automatic speech recognition. The major advantage of sigmoid activation is its boundedness. The disadvantages are: the vanishing gradient problem, the output not being zerocentered, and the saturation for large input values. In
Nair and Hinton (2010), Nair and Hinton showed that as networks became deeper, training with sigmoid activations proved less effective. The range of the sigmoid function is [0,1] and has a order of continuity.Some improvements to Sigmoid functions were proposed in the literature as follows.

An improved logistic Sigmoid function is proposed by Qin et al. (2018) to overcome the vanishing gradient problem as
which has a [] range.
6.4 Hyperbolic tangent (tanh) function
The tanh activation function is defined as
From the late 1990’s till early 2000s, tanh was extensively used to train neural networks, and was a preferred choice over the classical sigmoid activation function. The tanh activation has a range from , and in general, is mostly used for regression problems. It has advantage due to the zerocentered structure. The main problem with the tanh activations is the saturation region. Once saturated, it is really challenging for the learning algorithm to adapt the parameters and learn faster. This problem is the vanishing gradient problem.
Some improved tanh activation functions have been proposed to eliminate the problems related to the original tanh activation function.

A scaled hyperbolic tangent function is defined by LeCun et al. (1998) as
Here is the amplitude of the function, and determines its slope at the origin. The output of the activation function is the range .

A rectified hyperbolic secant activation function Samatin Njikam and Zhao (2016) is proposed as .

The Hexpo function Kong and Takatsuka (2017) has a similar structure as the hyperbolic tangent function, which is defined as
and has range from [].

The penalized hyperbolic tangent function Eger et al. (2019) is defined as
which gives output in the range [].

The linearly scaled tanh (LiSHT) activation function Roy et al. (2019) is defined as . The LiSHT scales the hyperbolic tangent function linearly to tackle its gradient diminishing problem.
6.5 Rectified Linear Unit (ReLU)
ReLU was primarily used to overcome the vanishing gradient problem. ReLU is the most common activation function used for classification problems. It is defined as
The derivative of ReLU is zero when , unity when , and at , the derivative is not defined. The ReLU function has a range from and has order of continuity. Apart from overcoming the vanishing gradient problem, the implementation of ReLU is very easy and thus cheaper, unlike tanh and sigmoid, where an exponential function is needed. Despite having some advantages over classical activations, ReLU still has a saturation region, which can prevent the learning of the networks. In particular, ReLU always discards the negative values. This makes the neurons stop responding to the gradient based optimizer. This problem is known as dead or dying ReLU problem Maas et al. (2013); Lu et al. (2019), meaning the neurons stop outputting other than zero. This is one of the serious problems for ReLU, where most of the neurons become dead, especially when using high learning rate. To overcome these problems, various variants of ReLU have been proposed.

Leaky ReLU Maas et al. (2013): The leaky ReLU is defined as
The hyperparameter
defines the leakage in the function (slope of the function). By adding the small slope in the region , the leaky ReLU overcomes the dying ReLU problem. Moreover, it has all the advantages of ReLU activation. One of the disadvantage of leaky ReLU is the hyperparameter , which needs to be defined appropriately. In most of the cases is used. In Xu et al. (2015), Xu et al., compared the ReLU and leaky ReLU activation with the conclusion that the latter outperforms the former always. 
Randomized leaky ReLU Xu et al. (2015): In this case, the is picked randomly in a given range during training and is fixed to its average value during testing.

Concatenated ReLU Shang et al. (2016): The concatenated ReLU (CReLU) is given as
which has range of [). The disadvantage of CReLU over ReLU is increased model complexity.

Bounded ReLU Liew et al. (2016): The ReLU function has unbounded outputs for nonnegative inputs. The bounded version of the ReLU function is defined as
where is the maximum output value the function can produce.

Vshaped ReLU Hu (2018): It is defined as

Dual ReLU Godin et al. (2018): The Dual ReLU has two dimensions, and it is defined as
where are the inputs in two different dimensions. The range of Dual ReLU is .

Randomly translated ReLU Cao et al. (2018): It is given by
which has range of [), where
is randomly sampled from Gaussian distribution.

Displaced ReLU (DReLU) Macêdo et al. (2019): DReLU is a diagonally displaced ReLU function that generalizes both ReLU and SReLU Clevert et al. (2015) by allowing its inflection to move diagonally from the origin to any point of the form. It is defined as
with range [). If DReLU becomes ReLU, and if DReLU becomes SReLU.

NaturalLogarithm ReLU Liu et al. (2019): The NaturalLogarithm ReLU uses a logarithmic function to modify ReLU output for positive input values that increase the degree of nonlinearity. It is defined as
This activation has output range from 0 to , and the is constant.

Average Biased ReLU Dubey and Chakraborty (2021): The AverageBiased ReLU is given by
which has range of [), and the variable is the average of input activation map to activation function.
6.6 Gaussian Error Linear Units (GELUs)
In GELU activation Hendrycks and Gimpel (2016) the neuron input is multiplied by where
. The reason to choose Bernoulli distribution is because the neuron’s input follow a normal distribution, especially after Batch Normalization. The output of any activation function should be deterministic, and not stochastic. So, the expected value of the transformation can be found as
Since is a cumulative distribution of Gaussian distribution, it is frequently computed with the error function. Thus, GELU activation is defined as
which can be approximated as .
To improve its capacity for bidirectional convergence, the GELU was upgraded to the symmetrical GELU (SGELU) by Yu and Su Yu and Su (2019).
6.7 Softplus function
The softplus function Dugas et al. (2000) approximates the ReLU activation function in a smooth way, and it is defined as
Softplus function is infinitely differentiable, and it has a range from ). In Liu and Furber (2016), Liu and Fuber proposed the noisy softplus activation. With Noisy Softplus, which is wellmatched to the response function of LIF (Leaky IntegrateandFire) neurons, the performance of spiking neural networks can be improved, see Liu et al. (2017). The authors of Liu and Furber (2016) proposed the following formula:
where is the mean input, defines the noise level, and controls the curve scaling that can be determined by the neuron parameters.
6.8 Exponential linear unit
The Exponential linear unit (ELU) was first proposed in Clevert et al. (2015) by Clevert et al., where they show that ELU outperforms all variants of ReLU with reduced training time as well as better accuracy in testing. The ELU is defined as
When , it takes on negative values, allowing the unit’s average output to be closer to zero and alleviating the vanishing gradient problem. Also, due to the nonzero gradient for , ELU does not suffer from the problem of dead neurons. Unlike ReLU, ELU bends smoothly at origin, which can be beneficial in the optimization process. Similar to leaky ReLU and parametric ReLU, ELU gives negative output, which pushes the mean value towards zero. Again, is a parameter which needs to be specified. For , the function is smooth everywhere, which in turn helps the gradient descent algorithm to speedup.
In Klambauer et al. (2017), Klambauer et al., proposed the Scaled ELU (SELU) activation function, where the authors show that for the neural network consisting of a stack of dense layers, the network will selfnormalize if all the hiddenlayers use the SELU activation function. However, there are some conditions for selfnormalization; see Klambauer et al. (2017) for more details.
Activation Function ()  Derivatives ()  Range  Continutity 

Linear :  ()  
Step :  ()  
Sigmoid or Logistic :  ()  
Rectifier Unit (ReLU) :  [)  
Hyperbolic Tangent :  ()  
Softplus :  ()  
Leaky Rectifier Unit (Leaky ReLU) :  ()  
Exponential Linear Unit (ELU) :  ()  
Gaussian :  (]  
Swish () :  [)  
Oscillatory :  [] 
6.9 Mish function
The Mish Misra (2019) is a selfregularized, nonmonotonic activation function defined as
While not evident at first sight, Mish is closely related to Swish Ramachandran et al. (2017), as its first derivative can be written in terms of the Swish function. Similar to Swish, Mish is unbounded above and bounded below. It is nonmonotonic, has continuity, and is also selfgated.
6.10 Radial activation functions
The traditional radial basis function (RBF) neural networks uses the Gaussian function, which is given as
The RBF neural networks with Gaussian activation functions have previously been successfully applied to a variety of difficult problems such as function approximation Hartman et al. (1990); Leonard et al. (1992), classification Er et al. (2002); Savitha et al. (2012) , etc. Other radial activation functions include Multiquadratics Lanouette et al. (1999), and Polyharmonic splines. The polyharmonic spline is a linear combination of polyharmonic radial basis functions and can be used as an activation function Hryniowski and Wong (2018).
Table 1 summaries the comparison of different activation functions, their derivatives, range and the order of continuity. The existence of the derivatives is an important feature for the backpropagation algorithm. Figures 5 and 6 show some of the classical activation functions and their derivatives, respectively.
6.11 Oscillatory activation functions
In Gidon et al. (2020), Gidon et al. found a new form of neuron in the human cortex that can learn the XOR function individually (a task that is difficult with single neurons employing other standard activation functions such as sigmoidal, ReLU, leaky ReLU, Softplus, ELU, Swish, Mish activations, etc). This is due to the fact that the zeros of the activation function for which is the decision boundary for a neuron that emits an activation . If the activation function
has only one zero, then the decision boundary is a single hyperplane
. Two hyperplanes are required to distinguish the classes in the XOR data set Minsky and Papert (1969), hence activation functions with multiple zeros, as described in Lotfi and AkbarzadehT (2014), must be considered. The oscillatory activation functions fits this criterion perfectly. It is interesting to note that oscillatory activation functions were already present in the literature before the paper by Gidon et al. Gidon et al. (2020). In Nakagawa (1999), Nakagawa proposed the chaos neural network model applied to the chaotic autoassociation memory using sinusoidal activation function. Mingo et al. Mingo et al. (2004) proposed the Fourier neural network with sinusoidal activation function, see also Gashler and Ashmore Gashler and Ashmore (2014). In Parascandolo et al. (2016), Parascandolo et al. used sine activation for deep neural networks. For learning the XOR function with a single neuron, Noel et al. Noel et al. (2021) proposed the Growing Cosine Unit (GCU) .Figure 7 shows various oscillatory activation functions along with their derivatives. Recently, Jagtap et al., Jagtap et al. (2021) proposed the Rowdy activation function, where the oscillatory noise is injected over the monotonic base activation function with adaptable parameters, thereby making them oscillatory as the optimization process starts. This adaptively creates multiple hypersurfaces to better learn the data set.
6.12 Nonstandard activation functions
This section will cover the Maxout unit and the Softmax functions, which are two nonstandard activation functions.

Maxout Goodfellow et al. (2013) : The Maxout Unit is a piecewise linear function that gives the maximum of the inputs, and it is designed to be used in conjunction with dropout Srivastava et al. (2014). It generalizes the ReLU and leaky ReLU activation functions. Given the units input the activation of a maxout unit is computed by first computing linear feature mappings where
and are weights and biases, whereas is the number of linear subunits combined by one maxout unit. Later, the output of the maxout hidden unit is given as the maximum over the feature mappings:
Maxout’s success can be partly attributed to the fact that it supports the optimization process by preventing units from remaining idle; a result of the rectified linear unit’s thresholding. The activation function of the maxout unit, on the other hand, can be viewed as executing a pooling operation across a subspace of k linear feature mappings (referred to as subspace pooling in the following). Each maxout unit is somewhat invariant to changes in its input as a result of this subspace pooling procedure. In Springenberg and Riedmiller (2013)
, Springenberg and Riedmiller proposed a stochastic generalization of the maxout unit (Probabilistic Maxout Unit) that improves each unit’s subspace pooling operation while preserving its desired qualities. They first defined the probability for each of the
linear units in the subspace as followswhere
is a chosen hyperparameter controlling the variance of the distribution. The activation
is then sampled asAs , above equation reduces to maxout unit.

Softmax function : Also called as Softargmax function Goodfellow et al. (2016) or the folding activation function or the normalized exponential function Bishop and Nasrabadi (2006)
is a generalization of logistic function in high dimensions. It normalizes the output and divides it by its sum, which forms a probability distribution. The standard softmax function
is defined for asIn other words, it applies the standard exponential function to each element
of the input vector
and normalizes these values by dividing them by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector is 1.
7 Complexvalued activation functions
The complexvalued neural network (CVNN) is a very efficient and powerful modeling tool for domains involving data in the complex number form. Due to its suitability, the CVNN is an attractive model for researchers in various fields such as remote sensing, acoustics, optoelectronics, image processing, quantum neural devices, robotics, bioinformatics, etc.; see Hirose Hirose (2012) for more examples. Choosing a suitable complex activation function for CVNN is a difficult task, as a bounded complex activation function that is also complexdifferentiable is not feasible. This is due to Liouville’s theorem, which states that the only complexvalued functions that are bounded and analytic everywhere are constant functions.
For CVNN there are various complex activation functions proposed in the literature. In the earlier work of Aizenberg et al. Aizenberg et al. (1973), the authors proposed the concept of multivalued neuron that uses the activation function
which divides the complex plain into equal sectors thereby maps the entire complex plane onto the unit circle. The most approaches to design CVNN preferred bounded but nonanalytic functions, also called split activation function Nitta (1997), where realvalued activation functions are applied separately to real and imaginary part. Leung and Haykin Leung and Haykin (1991) used sigmoid function as , where is complex number. Later, the following sigmoid function
was proposed by Birx and Pipenberg Birx and Pipenberg (1992), as well as Benvenuto and Piazza Benvenuto and Piazza (1992). The real and imaginary types of hyperbolic tangent activation function were proposed by Kechriotis and Monalakos Kechriotis and Manolakos (1994); see also, Kinouchi and Hagiwara Kinouchi and Hagiwara (1995). Another class of split activation function for CVNN is phaseamplitude functions. Noest Noest (1988) proposed , Georgious and Koutsougeras Georgiou and Koutsougeras (1992) proposed the phaseamplitude function . These types of phaseamplitude split activation functions termed as phasor networks by Noest Noest (1988). Hirose Hirose (1994) proposed the following activation function , see also Hirose (1992); Hirose and Yoshida (2012). An alternative approach where analytic and bounded (almost everywhere) fullycomplex activation functions with a set of singular points were proposed by Kim and Adali Kim and Adali (2002). With such activation, there is a need for careful scaling of inputs and initial weights to avoid singular points during the network training. Apart from these complex versions of conventional activation functions, other complex activation functions have been proposed.
A different approach to chose activation functions using conformal mappings was presented by Clarke Clarke (1990). Kuroe and Taniguchi Kuroe and Taniguchi (2005) proposed the following activation function
A Möbius transformation based activation was used in more general realvalued neural networks by Mandic Mandic (2000). A Möbius transformation based activation function were also proposed by Özdemir et al., Özdemir et al. (2011) as
where are complex numbers, and . It is a conformal mapping of the complex plane, which is also known as bilinear transformation. In the recent years, complexvalued ReLU activation was proposed by Guberman Guberman (2016)
and a modified ReLU was proposed by Arjovsky et al., Arjovsky et al. (2016). In Virtue et al. (2017), Virtue et al. proposed the cardioid activation function as
and it is used for magnetic resonance imaging (MRI) fingerprinting. The cardioid activation function is a complex extension of the ReLU that is phasesensitive. The complexvalued Kernel activation function was proposed by Scardapane et al., Scardapane et al. (2018).
8 Quantized activation functions
The quantized neural network (QNN) has recently attracted researchers around the world, see Courbariaux et al. (2015); Rastegari et al. (2016); Zhou et al. (2017). The aim of quantization is to compact the neural network model in order to improve its efficiency without degrading the performance of the model. Both forward and backpropagation processes can be executed with quantized neural networks using bitwise operations rather than floatingpoint operations. Weights, activations, and gradients are the three components of a neural network that may be quantified. The purpose for quantifying these components, as well as the methodologies used to quantify them, varied slightly. The size of the model can be substantially reduced by using quantized weights and activations. Moreover, with quantized gradients, the communication costs can be greatly reduced in a distributed training environment. Quantized activations can be used to replace innerproducts with binary operations, allowing for faster network training. By eliminating fullprecision activations, a substantial amount of memory can be saved. The spiking neural network
(SNN) is another type of network where activation levels are quantized into temporally sparse, onebit values, also called as ’
spike events’, which additionally converts the sum over weightactivity products into a simple addition of weights (one weight for each spike).The activations were quantized to 8 bits by Vanhoucke et al., in Vanhoucke et al. (2011). In particular, they quantized the activations after training the network using a sigmoid function that confines the activations to the range [0, 1]. In Courbariaux et al. (2015); Rastegari et al. (2016); Zhou et al. (2016) the authors used the binary activation as
To approximate the ReLU unit, Cai et al., Cai et al. (2017) proposed a halfwave Gaussian quantizer. They employed a halfwave Gaussian quantization function in the forward approximation.
In the recent work by Anitha et al.,Anitha et al. (2021) a quantized complexvalued activation function was proposed. The quantized activations are also proved to be successful against the adversarial examples, see Rakin et al., Rakin et al. (2018). Mishra et al., Mishra et al. (2017) presented wide reducedprecision networks (WRPN) to quantize activation and weights. They further found that activations take up more memory than weights. To compensate for the loss of precision due to quantization, they implemented an approach that increased the number of filters in each layer.
One wellknown difficulty with quantized activation is that it causes a gradient mismatch problem, see Lin and Talthi Lin and Talathi (2016). As an example, figure 8 shows an expected and real ReLU activation function in lowprecision networks. In a fixed point network, the effective activation function is a nondifferentiable function. The gradient mismatch problem is the result of this discrepancy between the assumed and real activation functions.
9 A quest towards an optimal activation function
Which activation function should we use? This is one of the most basic and meaningful questions that could be posed. As discussed earlier, there is no rule of thumb for choosing the optimal activation function, which strongly depends on the problem under consideration. This motivates us to ask another meaningful question: do we need an activation function that adapts itself as per the requirements, thereby avoiding the local minima by changing the loss landscape dynamically. In this way, the adaptive activation functions can beat any standard (fixed) activation function of the same type. Despite having various activation functions, the quest for having the best activation function has driven researcher to propose an activation function that is adaptive, also called adaptable, tunable, or trainable activations. In recent years, this has triggered the surge in papers on adaptive activation functions. The adaptive nature of the activation can be injected in many ways. For example, one can introduce the parameters that adapt accordingly, or another way is to use ensemble activation functions from the predefined set of functions, which performs better than single activation. This section discusses various adaptive activation functions that perform better than their fixed counterparts.
9.1 Parametric Activation Functions
In the literature, various adaptive activation functions have been proposed. In their earlier work, Chen and Change Chen and Chang (1996) proposed the generalized hyperbolic tangent function parameterized by two additional positive scalar values and as
where the parameters are initialized randomly and then adapted independently for every neuron. Vecci et al., Vecci et al. (1998) suggested a new architecture based on adaptive activation functions that use CatmullRom cubic splines. Trentin Trentin (2001) presented empirical evidence that learning the amplitude for each neuron is preferable instead of having unit amplitude for all activation functions (either in terms of generalization error or speed of convergence). Goh et al., Goh and Mandic (2003) proposed a trainable amplitude activation function. Chandra and Singh Chandra and Singh (2004) proposed the activation function adapting algorithm for sigmoidal feedforward neural network training. Eisenach et al., Eisenach et al. (2016) proposed parametrically learning activation functions. Babu and Edla Naresh Babu and Edla (2017) proposed the algebraic activation functions. Ramchandran et al., Ramachandran et al. (2017) proposed the Swish activation function defined as:
where is the tuning parameter. Like ReLU, Swish is unbounded above but bounded below. The Swish is shown to give faster convergence and better generalization for many test problems. In Alcaide (2018), Alcaide et al. proposed the Eswish activation function as
Mercioni et al., Mercioni and Holban (2020) proposed the trainable parameter based swish (pswish) activation that can give more flexibility than the original swish activation. Ertuğrul Ertuğrul (2018) proposed the trained activation function. Choi et al. Choi et al. (2018) proposed PArameterized Clipping acTivation (PACT) that uses an activation clipping parameter that is optimized during training to find the right quantization scale. Apicella et al., Apicella et al. (2019) proposed an efficient architecture for trainable activation functions.
Jagtap et al. Jagtap et al. (2020b) proposed globally adaptive activation function where they introduced a single slope parameter in the activation function as
where is a trainable parameter and is predefined scaling factor. They initialized . Figure 9 shows these globally adaptive activation functions for different values of . Based on this idea, the layerwise and neuronwise locally adaptive activation functions were proposed Jagtap et al. (2020a) that can be trained faster. The main idea is to introduce trainable parameter for every hiddenlayer (layerwise adaptive activation functions) as well as for every neuron in each layer (neuronwise adaptive activation functions). Along with these locally adaptive activations, the additional slope recovery term is added in the activation function, which is given by
where is the depth of the network, is the number of neurons in the hiddenlayer. The authors in Nader and Azar (2020)
proposed selfadaptive evolutionary algorithms for searching new activation functions. The
SoftRootSign activation function Zhou et al. (2020) is defined asthat gives range from . Both and are a pair of trainable nonnegative parameters. Pratama and Kang Pratama and Kang (2021) proposed trainable neural networks. Universal activation functions are proposed by Yuen et al. in Yuen et al. (2021) as
where and controls the slope, horizontal shift and vertical shift, respectively. The parameter approximates the slope of leaky ReLU, and
introduces additional degrees of freedom.
9.1.1 Adaptive family of rectifier and exponential units

Parametric ReLU He et al. (2015): The parametric ReLU (PReLU) is similar to the leaky ReLU, and it is defined as
In parametric ReLU, is the learning parameter, which is learned during the optimization process. Both, leaky and parametric ReLU still face the problem of exploding gradients.

SShaped ReLU : Abbreviated as SReLU Jin et al. (2016) is defined as a combination of three linear functions, which perform a mapping with the following formulation
where are four learnable parameters. The subscript indicates that we allow SReLU to vary in different channels.

Parametric ELU : The parametric ELU Shah et al. (2016) was proposed in order to remove the need to specify the parameter , which can be learned during the training to get the proper activation shape at every CNN layer. Klambauer et al. Klambauer et al. (2017) proposed Scaled exponential linear unit (SELU)

Parametric Tanh Linear Unit (PTELU) Duggal and Gupta (2017): It is defined as
which has range from [), and both and are trainable parameters.

Continuously Differentiable ELU : The Continuously Differentiable ELU Barron (2017) is simply the ELU activation where the negative values have been modified to ensure that the derivative exists (and equal to 1) at for all values of . The Continuously Differentiable ELU is defined as
where is tunable parameter.

Flexible ReLU Qiu et al. (2018): The Flexible ReLU, or FReLU, is defined as
which has a range . The FReLU captures the negative values with a rectified point.

Paired ReLU Tang et al. (2018):: The Paired ReLU is defined as
where and represents scale parameters, which are initialized with the values of and 0.5, respectively . and are a pair of trainable thresholds.

Multiple Parametric ELU : The multiple parametric ELU Li et al. (2018) is given as
Here, is greater than zero.

Parametric Rectified Exponential Unit : The parametric rectified EU Ying et al. (2019) is defined as

Fast ELU : The Fast ELU Qiumei et al. (2019) is given as
The curve trend with fast ELU is consistent with the ELU due to the fast approximation achieved, ensuring that the fast ELU does not alter the original ELU’s accuracy advantage.

Multibin Trainable Linear Units (MTLU) Gu et al. (2019): The MTLU is given as
with range (). The MTLU considers different bins with different ranges of hyperparameters. All the parameters are trainable.

Mexican ReLU: The Mexican ReLU Maguolo et al. (2021) uses Mexican hat type function and it is defined as:
where are tunable parameter and are real numbers.

Lipschitz ReLU Basirat and ROTH (2020): The L* ReLU is defined as
where and are nonlinear functions.

Parametric Deformable ELU : First defined in Cheng et al. (2020), the parametric deformable ELU is given as
with range .

Elastic ELU : The Elastic ELU Kim et al. (2020) is designed to utilize the properties of both ELU and ReLU together, and it is defined as
where both and are tunable parameters.
9.2 Stochastic/probabilistic adaptive activation functions
The stochastic/probabilistic approach is another way to introduce an adaptive activation function. In Gulcehre et al. (2016), Gulcehre et al. proposed a noisy activation function where a structured and bounded noiselike effect is added to allow the optimizer to exploit more and learn faster. This is particularly effective for the activation functions, which strongly saturate for large values of their inputs. In particular, they learn the level of injected noise in the saturated regime of the activation function. They consider the noisy activation functions of the following form:
where . Here is an independent and identically distributedrandom variable drawn from generating distribution, and the parameters and are used to generate a location scale family from . When the unit saturates, they pin its output to the threshold value , and the noise is added. The type of noise and the values of and , which can be chosen as functions of to allow some gradients to propagate even in the saturating regime, determine the exact behavior of the method.
Urban et al., Urban et al. (2017) proposed Gaussian processbased stochastic activation functions. Based on a similar idea of Gulcehre et al. Gulcehre et al. (2016), Shridhar Shridhar et al. (2019) proposed the probabilistic activation functions (ProbAct), which are not only trainable but also stochastic in nature, see Figure 10. The ProbAct is defined as:
where is a static or learnable mean, say, for static ReLU, and the perturbation term , where perturbation parameter is a fixed or trainable value which specifies the range of stochastic perturbation and is a random value sampled from a normal distribution . Because the ProbAct generalizes the ReLU activation, it has the same drawbacks as the ReLU activation.
The Rand Softplus activation function Chen et al. (2019) models the stochaticityadaptibility as
where is a stochastic hyperparameter.
9.3 Fractional adaptive activation functions
The fractional activation function can encapsulate many existing and stateoftheart activation functions. In the earlier study of Ivanov Ivanov (2018) the fractional activation functions are presented. This is motivated by the potential benefits of having more tunable hyperparameters in a neural network and achieving different behaviors, which can be more suitable for some kinds of problems. In particular, the author used the following MittagLeffler types of functions:
which is the class of parametric transcendental functions that generalize the exponential function. Setting and to unity gives the Taylor series expansion of the exponential function. The authors proposed several fractional activation functions by replacing the exponential function with several standard activation functions such as tanh, sigmoid, logistic, etc.
In Zamora Esquivel et al. (2019), the authors proposed activation functions leveraging the fractional calculus. In particular, they used the Gamma function (). The fractional ReLU is then defined as
where the fractional derivative is given by
Similarly, the fractional derivative of hyperbolic tangent is given by
The fractional derivative produces a family of functions for different values of . Figure 11 depicts the fractional ReLU (left) and fractional hyperbolic tangent (right) functions for different values. At , the ReLU becomes a step function, whereas the hyperbolic tangent function () becomes a quadratic hyperbolic secant function.
9.4 Ensemble adaptive activation functions
An alternative approach to adaptive activation functions is to create a combination of different activation functions. Chen Chen (2016) used multiple activation functions for each neuron for the problems related to stochastic control. Jin et al Jin et al. (2016) proposed the combination of set of linear functions with open parameters. Agostinelli et al., Agostinelli et al. (2014) constructed the activation functions during network training. They used the following general framework:
where is predefined hyperparameter, is the neuron, and are trained variables.
Ramachandran et al Ramachandran et al. (2017)
used a reinforcement learning controller to combine preset unary and binary functions for learning new activation functions. In
Qian et al. (2018), Qian et al. proposed adaptive activations in convolutional neural networks, where they focus on learning activation functions via combining basic activation functions in a datadriven way. In particular, they used ReLU and other variants of ReLU, such as Leaky ReLU and ELU combinations. In Klabjan and Harmon (2019), Klabjan and Harmon show that the ensemble activation function can be created by choosing the suitable activation functions from the predefined set . Here, the activations are combined together to get the best performance out of the neural networks. A similar approach was used by Nandi et al. in Nandi et al. (2020), where an ensemble of activation functions is used to improve the performance of the network. The previously discussed Mexican ReLU Maguolo et al. (2021) also comes under the category of ensemble activation functions.A polynomial activation function was proposed by Piazza et al. (1992) where they built the activation function over powers of the activation. They used
where is a hyperparameter. Because a polynomial of degree can pass through all points exactly, this polynomial activation function can theoretically approximate any smooth function. The main disadvantage was the global influence of coefficient , which causes the function’s output to grow excessively large. Similarly, Scardapane et al. Scardapane et al. (2019b) proposed the kernelbased nonparametric activation function. In particular, they model each activation function in terms of a kernel expansion over a finite number of terms as
where are the mixing coefficients, are the called the dictionary elements, and is a onedimensional kernel function. Further, this idea was extended with a multikernel approach in Scardapane et al. (2019a). On similar grounds, the adaptive blending units Sütfeld et al. (2020) were proposed by Sütfeld et al. to combine a set of functions in a linear way. In Basirat and Roth (2018)
, Basirat and Roth presented a genetic algorithm based learning of activation functions where the hybrid crossover of various operators results in a hybrid activation function.
References  Parametric  Ensemble  Stochastic/  Fractional  ComplexValued  Quantized 
Probabilistic  
Chen and Chang Chen and Chang (1996), Vecci Vecci et al. (1998), Trentin Trentin (2001)  
Goh Goh and Mandic (2003), Chandra and Singh Chandra and Singh (2004), Shah et al., Shah et al. (2016),  
Jin et al.,Jin et al. (2016), Shah Shah et al. (2016), Jin Jin et al. (2016)  ✓  ✗  ✗  ✗  ✗  ✗ 
He et al., He et al. (2015), Barron Barron (2017), Duggal and Gupta Duggal and Gupta (2017)  
Li et al., Li et al. (2018), Tang et al.,Tang et al. (2018), Qiu et al.,Qiu et al. (2018)  
Grelsson and Felsberg Grelsson and Felsberg (2018), Ying et al.,Ying et al. (2019), Qiumei et al.,Qiumei et al. (2019)  
Gu et al., Gu et al. (2019), Goyal et al., Goyal et al. (2019), Jagtap et al., Jagtap et al. (2020b, a)  
Basirat and Roth Basirat and ROTH (2020), Cheng et al.,Cheng et al. (2020), Kim et al.,Kim et al. (2020)  
Piazza et al., Piazza et al. (1992), Agostinelli et al., Agostinelli et al. (2014) Jagtap et al., Jagtap et al. (2021)  ✓  ✓  ✗  ✗  ✗  ✗ 
Scardapane et al., Scardapane et al. (2019b)  
Gulcehre et al., Gulcehre et al. (2016), Shridhar et al., Shridhar et al. (2019), Urban et al.,Urban et al. (2017)  ✓  ✗  ✓  ✗  ✗  ✗ 
Zamora et al., Zamora Esquivel et al. (2019), Ivanov et al., Ivanov (2018)  ✓  ✗  ✗  ✓  ✗  ✗ 
Ramchandran et al., Ramachandran et al. (2017), Klabjan and Harmon Klabjan and Harmon (2019)  ✗  ✓  ✗  ✗  ✗  ✗ 
Sütfeld et al., Sütfeld et al. (2020) ,Basirat and Roth Basirat and Roth (2018)  
Nanni et al Nanni et al. (2020)  ✗  ✓  ✓  ✗  ✗  ✗ 
Choi et al., Choi et al. (2018), Rakin et al., Rakin et al. (2018)  ✓  ✗  ✗  ✗  ✗  ✓ 
In recent years, Nanni et al Nanni et al. (2020) proposed stochastic selection of activation layers for convolutional neural networks. In Jagtap et al. (2021)
Jagtap et al. proposed the Kroneker Neural Networks (KNN), which is a general framework for adaptive activation functions that can generalize a class of existing feedforward neural networks that utilize adaptive activation functions. In particular, the output of KNN is given as
where the network parameters are: for and . Different classes of adaptive activation functions are produced by varying these parameters and the hyperparameter . For example,

If , for all , the Kronecker network becomes a standard feedforward network.

If , , for all , , and , the Kronecker network becomes a feedforward network with Parametric ReLU activation He et al. (2015).
Every type of adaptive activation that has been discussed in this section has pros and cons of its own. Table 2 shows the adaptive activation function referenced under various settings.
10 Performance of some fixed and adaptive activation functions for classification tasks
This section discusses the performance of different activation functions for classification tasks. In particular, we used MNIST LeCun et al. (1998), CIFAR10, and CIFAR100 Krizhevsky et al. (2009) data sets for the same. The MNIST database of handwritten digits contains the training set of 60k samples, while the test set contains 10k examples. The CIFAR10 data set contains 50k training and 10k testing images from 10 object categories, whereas the CIFAR100 data set contains 50k training and 10k testing images from 100 object categories. In particular, for comparison, we chose some fixed as well as some adaptive activation functions that have been discussed previously.
MNIST  CIFAR10  CIFAR100  

Sigmoid  97.9 Pedamonti (2018)  89.43 0.51 (M), 85.42 0.47 (V)  61.64 0.56 (M), 59.25 0.45 (V) 
Tanh  98.21Eisenach et al. (2016)  88.19 1.21 (M), 87.53 0.67 (V)  57.06 2.03 (M), 62.32 0.82 (V) 
Swish  99.75 0.11 (M), 99.45 0.26 (V)  95.5 Ramachandran et al. (2017)  83.9 Ramachandran et al. (2017) 
ReLU  99.1 Apicella et al. (2019), 99.15 Scardapane et al. (2019b), 99.53 Jin et al. (2016)  95.3 Ramachandran et al. (2017), 94.59 Trottier et al. (2017), 92.27 Jin et al. (2016)  83.7 Ramachandran et al. (2017), 75.45 Trottier et al. (2017), 67.25 Jin et al. (2016) 
Leaky ReLU  98.2 Pedamonti (2018),99.58 Jin et al. (2016)  95.6Ramachandran et al. (2017), 92.32 Qian et al. (2018)  83.3Ramachandran et al. (2017), 67.3Jin et al. (2016) 
ELU  98.3 Pedamonti (2018)  94.4 Ramachandran et al. (2017), 94.01 Trottier et al. (2017)  80.6 Ramachandran et al. (2017), 74.92Trottier et al. (2017) 
PReLU  99.64 0.24 (M), 99.18 0.17 (V)  92.46 0.44 (M), 91.63 0.31 (V)  69.46 0.74 (M), 66.53 0.69 (V) 
SELU  98.42 0.53 (M), 99.02 0.37(V)  93.52 0.63 (M), 90.53 0.36 (V)  70.42 0.75 (M), 68.02 1.29 (V) 
RReLU  99.23 0.53 (M), 99.63 0.6 (V)  90.52 2.14 (M), 90.18 0.91 (V)  68.62 0.42 (M), 65.32 1.74 (V) 
GELU  99.72 0.35 (M), 99.26 0.42 (V)  94.76 0.55 (M), 92.67 0.89 (V)  71.73 1.09 (M), 69.61 1.53 (V) 
CELU  99.36 0.68 (M), 99.37 0.38 (V)  90.26 0.12 (M), 90.37 0.23 (V)  70.26 1.53 (M), 68.35 0.87 (V) 
Softplus  98.69 0.5 (M), 97.36 0.77 (V)  94.9 Ramachandran et al. (2017)  83.7 Ramachandran et al. (2017) 
Mish  97.9 Pedamonti (2018)  90.26 0.52 (M), 86.05 0.76 (V)  68.53 0.86 (M), 67.03 1.39 (V) 
Maxout  99.55 Goodfellow et al. (2013)  90.62 Goodfellow et al. (2013)  61.43 Goodfellow et al. (2013) 
SRS  98.04 0.97 (M), 98.06 0.84 (V)  89.35 0.85 (M), 87.26 1.38 (V)  65.20 1.53 (M), 63.65 2.63 (V) 
Ensemble (Klabjan and Harmon (2019))  99.40 Klabjan and Harmon (2019)  85.05 0.28 (M), 84.96 0.87 (V)  74.20 Klabjan and Harmon (2019) 
LiSHT  98.74 0.17 (M), 98.32 0.19 (V)  90.78 0.43 (M),87.74 0.36 (V)  56.35 0.58 (M), 58.74 0.99 (V) 
Sine  99.10 0.82 (M), 98.63 0.27 (V)  91.64 0.36 (M), 90.59 0.88 (V)  80.98 0.81 (M), 78.42 0.67 (V) 
GCU (Noel et al. (2021))  97.80 1.02 (M), 90.73 0.87 (V)  90.24 1.96 (M), 89.56 0.18 (V)  77.46 2.91 (M), 76.46 1.61 (V) 
Gaussian  98.60 2.22 (M), 96.43 2.57 (V)  90.64 2.76 (M), 89.59 0.28 (V)  76.96 0.41 (M), 74.62 2.69 (V) 
Adaptive sine  99.56 0.56 (M), 98.76 0.12 (V)  92.64 0.36 (M), 92.72 0.37 (V)  81.15 0.81 (M), 78.73 0.64 (V) 
Adaptive tanh  99.52 0.36 (M), 99.57 0.20 (V)  91.14 0.45 (M), 90.52 0.82 (V)  83.82 0.83 (M), 76.21 1.95 (V) 
Adaptive ReLU  99.75 0.25 (M), 98.43 0.32 (V)  96.89 0.93 (M), 94.77 1.16 (V)  (M), 80.30 0.60 (V) 
Rowdy Sine  99.74 0.29 (M), 99.24 0.43 (V)  93.87 0.78 (M), 92.82 0.82 (V)  83.73 0.83 (M), 82.14 0.13 (V) 
Rowdy tanh  99.45 0.93 (M), 98.65 0.57 (V)  93.78 0.13 (M), 93.26 0.42 (V)  82.34 1.01 (M), 80.97 0.32 (V) 
Rowdy ReLU  99.43 0.68 (M), 98.23 0.25 (V)  (M), 94.47 0.92 (V)  (M), 79.41 0.75 (V) 
Rowdy ELU  (M), 99.06 1.07 (V)  95.45 0.92 (M), 93.27 0.13 (V)  84.29 0.13 (M), 82.40 0.81 (V) 
(denoted by V) architectures with different activation functions, where we report the mean and standard deviations of 10 different realizations. The
Adaptive and Rowdy (with ) refers Jagtap et al. (2020b) and Jagtap et al. (2022a), respectively. The best results are highlighted in bold.In this work, we report the best accuracy from the literature without considering the architecture used to perform the experiments. We also conducted a performance comparison of two models, namely, MobileNet Howard et al. (2017) and VGG16 Simonyan and Zisserman (2014). Both MobileNet and VGG16 have a convolution neural net (CNN) architecture. The MobileNet is a lightweight deep neural network with better classification accuracy and fewer parameters, whereas the VGG16 has 16 deep layers. In both cases, the learning rate is 1e4, and the batch sizes for MNIST, CIFAR10 and CIFAR100 are 64, 128, and 64, respectively. During training and testing, data normalization is performed. The Adam Kingma and Ba (2014)
optimizer is employed with a crossentropy loss function. All the experiments are computed using a desktop computer system having 16 GB of RAM with an Nvidia GPU Card, and performed in the Pytorch
Paszke et al. (2019) environment, which is a popular machine learning library.Table 3 gives the accuracy comparison of different activation functions for all the three data sets, where we have given the mean and standard deviation of accuracy for 10 different realizations. For MNIST, almost all activation functions perform well. It can be observed that the Swish, ReLU, Leaky ReLU, ELU, and Sine activation functions work well for the CIFAR10 and CIFAR100 data sets. Furthermore, the adaptive and rowdy activation functions outperform in all three data sets.
11 Activation functions for physicsinformed machine learning
In recent years, physicsbased machine learning (PIML) Karniadakis et al. (2021) algorithms for scientific calculations have been revived at a considerably faster speed owing to the success of automatic differentiation (AD) Baydin et al. (2018). These methods can solve not only the forward problems but also the notoriously difficult inverse problems governed by almost any type of differential equation. Various physicsinformed machine learning algorithms has been proposed that can seamlessly incorporate governing physical laws in machine learning framework. The SINDy framework Brunton et al. (2016) for analyzing dynamical systems, physicsinformed machine learning for modeling turbulence Wang et al. (2017), physicsguided machine learning framework Karpatne et al. (2017), the Neural ODE Chen et al. (2018), and most recently physicsinformed neural networks (PINNs) Raissi et al. (2019a) are just a few examples of the many physicsinformed machine learning algorithms that have been proposed and successfully used for solving problems from computational science and engineering fields.
Most of our previous discussion about activation function was centered around various classification problems in the computer science community. In this section, we focus on the activation functions for the PIML framework. Apart from the accurate and efficient AD algorithm, PIML methods also need tailored activation functions where the output of the network not only satisfies the given data but also satisfies the governing physical laws in the form of differential equations where various spatiotemporal derivative terms are involved. In this work, we are using PINNs for our investigation. PINN is a very efficient method for solving both forward and illposed inverse differential equations involving sparse and multifidelity data. The main feature of the PINN is that it can seamlessly incorporate all the information like governing equation, experimental data, initial/boundary conditions, etc., into the loss function thereby recast the original PDE problem into an optimization problem. PINNs achieved remarkable success in many applications in science and engineering; see Raissi et al. (2019b); Jagtap et al. (2022b); Raissi et al. (2020); Shukla et al. (2021a); Jagtap and Karniadakis (2020); Jagtap et al. (2020c); Shukla et al. (2021b); Jagtap et al. (2022a); Hu et al. , and the references therein for more details. The first comprehensive theoretical analysis of PINNs for a prototypical nonlinear PDE, the NavierStokes equations, has been presented in De Ryck et al. .
11.1 Differentiability
Differentiability is an important property of activation function. The activation function must be differentiable everywhere. Consider the example of onedimensional Laplace equation with forcing as
Here we use a onehiddenlayer neural network with a single neuron, see figure 12.
The input to the network is the independent variable and the network output is . The network output satisfies both data as well as physics, as shown in the figure. The loss function is given as