1 Introduction
Better architectures, newer activations and faster optimization techniques being proposed everyday have fueled the huge success of deep learning. Most of the past work on activations has been with the goal of achieving superior empirical performance with justifications that are more intuitive, lacking proper mathematical investigation. As far as the definition of activation functions is concerned, universal function approximation theorem (UFAT) lays out the most widely accepted one. It defines it as a “non constant, bounded and continuous function”. While most of the activations conform to this criterion, NNs with linear units (unbounded) and its variants, still satisfy UFAT (unboundedproof
). The central goal in most of the application oriented research invloving neural networks (NNs) is to get higher performance which can be in terms of more complex underlying functions or NNs’ better generalization over unseen data. The performance of NNs is heavily affected by the optimization method—stochastic gradient descent (SGD,
sgd), used for training them. It gives rise to a set of desirable properties for the activations which can potentially improve training time or guarantee convergence. Monotonic variation, zero centred nature, and appropriate gradient range are some of their widely studied properties. In this paper, we take a different route with objective of viewing activations from a theoretical perspective and characterize their role in NNs.While employing NN models, it is indeed valid to ask which activation function would perform the best. Deep learning community has attempted to answer this problem in variety of ways, some people analyze the gradients of activations addressing the vanishing and exploding gradient problems. Some also try to study the variance arising in NNs due to the shape of activation functions, for example, ELU (
elu) saturates for the negative input values which contributes to its robustness to noisy inputs. A more rigorous way could be to do a grid search over all possible activations. swishuses automated reinforcement learning based search on composite combination of existing activation functions. However, the resulting space of functions undergoing search is finite and therefore small. Here we want to develop analytic and more insightful ways to answer this question. We aim to formalize this concept of activation function without any prior assumptions or borrowing motivation from biology. Finally, what we want is a generalized form of activation function, spanning a much bigger space. Lastly, there remains one more factor which is relatively sparsely studied and needs to be accounted for—nature of the task and training data. Ideally the NN should be able to incorporate the variability arising in a learning problem due to the type of task and properties of data via adaptation. The same should hold for the activation, but evidently not all perform equivalently on any of the task. Therefore, we propose approximating the optimal activation function (unique for each task and each data set) by assuming a functional form at each neuron with learnable parameters which are updated during training, thereby entering the domain of adaptive activation functions (AAFs).
prelu; maxout; NinN; apla represent some of the important milestones in the past research on AAFs. SAAF proposes a piecewise defined polynomial activation function specifically for regression tasks. It also highlights difficulties experienced while training NNs with AAF. If the AAF is too simplistic, it might not learn a good approximation of the optimal activation function (the one with smallest training error). On the other hand, a highly flexible activation function, with many parameters, will result in severe overfitting. In light of the above issues, a major contribution of this paper is (i) a working adaptive activation function called ‘SLAF’ motivated from polynomial approximation of univariate continuous functions. With respect to optimization, it is difficult to train AAFs with non monotoic behaviour (tamingsine). Hence, we also provide a model setup for SLAF which allows SLNNs to be trained with SGD, along with training routines for practical purposes. While we achieve similar performances on many of the tasks, and outperform existing activations on synthetic datasets, the contributions hold importance in an analytic sense. We provide interesting insights into the nature of neural networks activated with SLAF (SLNNs) along with (ii) mathematical bounds on number of parameters needed to represent an SLNN as a function of their degree (Theorem 3). Since, SLNNs can be approximated to any other NN activated with conventional activation functions, characteristics of conventional NNs can be related to those of SLNNs’. We emphasize on the multilayered architecture of NNs (SLAF) by various experiments and provide theoretical grounds for our inferences. Importantly, we assume no constraints on our activation function except continuity and differentiability for generic characterization of activation function. Note: We do not claim UFAT holds for SLNNs, which theoretically is not disadvantageous. We show experiments on standard datasets to demonstrate the approximation capabilities of SLNNs.2 Motivation and Model Setup
This section describes the idea of defining best learnable activation as a good approximation of the optimal activation function on a predefined basis. Consider a neural network model with learnable activation F and for simplicity, let’s assume that be its optimal function for the specific task and data distribution. denotes the projection of on a defined set of basis functions, with elements, represented as (eq. (1)). Hence, for a fixed basis the whole problem of learning boils down to learning the right set of coefficients of basis elements which post training should ideally converge to .
(1) 
Since the performance of resulting NN is contingent on how close and are, one would like to select a basis which can provide a good approximation of using only fewer number of basis elements, and thus reducing the computational complexity. Moreover, the goodness of basis will also depend upon how easily the SLNN can be optimized. In our work, we explore three different basis, viz. (i) Even Mirror Fourier (EMFN) Basis (emfn) with truncated input, (ii) Taylor Polynomial Basis, and (iii) Chebyshev Basis. To compare these basis, we approximate existing activation functions ReLU, Sigmoid and tanh. It turns out that both EMFN basis and Taylor polynomial basis provide good enough approximation of existing activation functions and hence SLNN using either Taylor or EMFN basis would approximate these activations, if they were the optimal ones. The functional form of EMFN basis elements contain sinusoidal terms, thereby making it difficult to train using gradient descent (tamingsine). Taylor polynomials have easier analytic form and its basis elements can be computed in polynomial time. Henceforth, we will only use Taylor approximation for all experimental and theoretical purposes. Figure 1 shows how activations are calculated with SLAF on a hidden layer of standard NN.
2.1 Using Taylor Polynomial Basis
Although Taylor basis puts no restriction on the range of input , it can not be directly employed in a NN being optimized through SGD, because of the nature of gradients corresponding to each basis element. SLAF and its gradients can be written as
(2)  
(3) 
The proportionality of gradient to and its powers can lead to problems of exploding and vanishing gradients as the scale of input changes. This effect is more pronounced as the depth or the degree of SLNN increases. To handle this issue, we perform mean variance normalization on each basis function. The transformed basis functions are then used in SLAF:
(4)  
(5) 
where mean and variance are computed over the training data set. The coefficients (’s) can serve as means for recovering original mean and variance resulting in information preservation. The technique corrects the scale of very large or very small value of basis functions and helps in faster convergence (effbpp). For CNN,
’s would denote a channel rather than one feature. We store the exponentially averaged means and variances calculated from training data statistics for performing normalization at test time similar to batch normalization (
bnorm).3 Representation of SLNN
It is first important to understand the is the significance behind studying SLNNs. We prove that any neural network with activation function which is Lipschitz continuous can be completely approximated by a SLNN to any arbitrarily low error (refer to appendix A.3).
One of the interesting outcome of employing polynomial activation is the resulting elegant polynomial representation of SLNN. Specifically, a fully connected NN activated with SLAF can be completely expressed as a collection of multivariate polynomial of input features up to the penultimate layer. The same representation would hold for the complete NN if the final activation is affine (regression tasks), however for classification tasks where the final activation is a sigmoid or a softmax, the claims on representation are valid only up to the second last layer of NN.
Definitions: Consider a set . Let denote the set of elements of polynomial basis with degree constructed using elements of set x. Then,
We call as the polynomial basis defined over set x with degree .
For the sake of readability and clarity, the proofs of the following theorems have been shifted to the appendix.
The cardinality of , the set of elements of polynomial basis with degree constructed using elements of set x, denoted by is equal to , where is the cardinality of set x.
Let denote the matrix containing the elements of set . Now we show that any general neural network having a structure as mentioned below can be completely represented in a polynomial form.
Consider a NN with hidden layers and input denoted as and output as . If the activation at the final layer used is SLAF (degree = ) and all the hidden layers are activated with SLAF of degree , where i is the index of the hidden layer. Then, the output of this NN can be reparametrized and written as
(6) 
where, , called as degree of SLNN and are the new parameters. The subscripts in the notation denote matrix size.
Note: SLAF with degree equal to one is equivalent to linear/no activation. Therefore, this result directly holds for regression tasks.
Theorem 3 shows that the output of SLNN can be easily represented as a collection of polynomials with degree defined by SLAF applied at each layer of NN. Note that the learnable parameters corresponding to all the linear operations and activation are absorbed into one single matrix given by
. As a result, theoretically training a SLNN becomes equivalent to finding out the optimal value of the new weight matrix. Although we do not provide a generalization of the above theorem for classification tasks, it remains valid up to the penultimate layer of a SLNN (with softmax or sigmoid at the final layer). It is easy to see that the resulting reparametrized form would be equivalent to a linear classifier over
and polynomial regression for classification and regression tasks respectively. The direct result of this reparametrization is a hard upper bound on total number of variables that the SLNN will have, given by , where is one for regression or the number of classes for classification. One should be careful that this bound is only due to the theoretical equivalence and ignores the effect of optimization algorithm. Different parametrization (weightshare), normalization techniques (bnorm) can hugely affect the empirical performance of NNs (SLNNs here).Though it might seem that SLNNs are redundant for regression tasks and one would prefer performing polynomial regression which is simpler and much more efficient, the scalability issues severe its compatibilty with high dimensional data. As the degree increases, their is a surge in number of features which bottlenecks its practical applicability. Fortunately this is much simpler in SLNNs, where the polynomial representation is implicitly learned relaxing computational issues with memory limitations.
4 Experiments
To demonstrate the effectiveness of SLNN and gauge its performance with BP algorithm, we perform experiments on regression, classification and learning sparse polynomials. In all the experiments, we apply L2 regularization penalty on the activation coefficients.
4.1 SLNN as Polynomial regression
The section empirically validates the claim of theorem 3 on both regression and classification tasks. In addition, it also shows the effectiveness of SLNN in learning sparse polynomials.

Regression  Boston Housing:
In a regression setting, an SLNN can be completely reduced to a polynomial. Hence, it becomes equivalent to a linear regression applied on the polynomial basis of original input features. We compare the performance of four algorithms to verify the empirical validity of the above statement taking into account the optimization method as well. For this experiment, we take boston housing dataset, which has thirteen features originally and the output is to predict the house price. We test the following algorithms: (i) Standard Neural Network with two hidden layers each activated with ReLU and accelerated with batch normalization
(NNRELUBN) along with L2 regularization (ii) SLNN with two hidden layers each activated with SLAF of degree four and optimized with SGD (SLNN) (iii) Linear Regression on Polynomial features with SGD (LRSGD) with L1 regularization (iv) Lasso Linear Regression (LLS) optimized with coordinate descent.Algorithm Degree/Description Training RMSE Testing RMSE NNRELUBN 2 Hidden Layers 1.32 3.78 SLNN , , 2.09 3.98 LRSGD Degree=8 , Penalty=0.01 22.03 22.69 LLS Degree=8, Penalty=0.01 1.59 3.06 Table 1: Comparision of the four algorithms on boston housing dataset. NNRELUBN with model similar to SLNN provides a baseline for comparison with other methods. Note: We use Adam optimizer in place of the vanilla SGD for optimization. RMSE stands for root mean squared error. As a result of theorem 3, in all three methods SLNN, LRSGD, and LLS, the underlying representation is same. However, the performance of LRSGD is quite poor as compared to the other two algorithms. Being theoretically same, the global minima of LRSGD is the same as of the other two. The reason is the suboptimality of SGD in converging to the global minima. Since, SLNN has more number of parameters than those required for its polynomial representation, the number of global minimas are more in space of learnable parameters. While on the other hand, LRSGD has exactly the same number of parameters needed for its representation. Due to this redundancy in the network parameters (globalminima), SLNN tends to easily converge with SGD and exhibit performance similar to LLS (derivativefree optimization method). One should note that both LLS and LRSGD are not scalable unlike SLNN with higher dimensional inputs (Theorem 3).

Classification  Two Spiral:
In the previous case, we observed that LLS and SLNN perform similar in terms of the test error, where LLS converges much faster. However, for classification tasks, we observe that logistic regression over the polynomial basis combined with other optimization methods (
sklearn) doesn’t turn out to be as beneficial. At the same time, SLNN performs significantly better and converges most of the times. Hence, SLNN doesn’t limit learning a polynomial feature space even in a classification setting and therefore bcomese advantageous. To demonstrate the same, we employ two spiral classification problem tested with (i) conventional NN with batch normalization and two hidden layers with ReLU activation, (ii) SLNN with two hidden layers each with SLAF activation of degree seven each, and (iii) Logistic Regression on polynomial basis of degree fourteen with SAGA (saga) optimization (LRSAGA). We skip results of other approaches for lack of relevance. Note that we specifically choose a smaller two layered architecture to demonstrate how a higher degree SLAF can compensate for deeper/wider NN with the same number of parameters. Training and Test dataset are synthetically generated and randomly chosen. Figure 2 shows the classification boundary learnt by SLNN (a) and NN (b). We can see that due to the underlying assumptions on the activation function’s differentiability, the resulting boundary learned is itself smooth and provides good generalization (extending both the spirals will decrease classification accuracy in the case of NN). However, the boundary learnt by NN with ReLU activation displays sharp turns and aesthetically unpleasing boundary. This is again due to form of ReLU which has a bent at origin.Figure 2: (a) SLNN’s () classification map (b) Standard NN’s classification map. Algorithm Train Accuracy (%) Test Accuracy (%) NN 80.14 80.98 SLNN 99.41 99.69 LRSAGA 74.69 74.60 Table 2: Comparison 
Regression  Learning Sparse Polynomial:
Now we shift our focus to the task of learning sparse polynomials (which have small number of monomial terms with non zero coefficients). NNs have been theoretically studied to estimate their ability to approximate polynomials
polyapprox. nnpolylearnproves that irrespective of an activation function, a single layered neural network can learn ksparse polynomial (with k monomial terms) of small degrees in finite iterations with appropriate number hidden nodes. On similar lines we design experiments to show that, practically, the choice of activation affects the generalization of the neural network on unseen data points. We experiment with polynomials of degrees three and four on hundred variables (with standard normal distribution) having ten monomial terms. A three layered architecture will be used for this experiment. NN uses hyperbolic tan as activation, accelerated with batch normalization, and SLNN uses same architecture with SLAF activation. We employ L1 regularization on the first layer to take into account the sparsity for both the models. Note that here we do not use the same activation weights/coefficients for each hidden node across a layer.
Model Degree Training MSE Testing MSE NN(Tanh/ReLU) 3 0.06/0.03 2.25/0.40 4 0.12/0.12 16.24/12.90 SLNN 3 0.03 0.03 4 0.03 0.03 Table 3: Comparing NN and SLNN on learning polynomials.
4.2 Standard Classification Tasks
Since the basis chosen restricts the subspace of activations to only polynomial approximations of finite degrees, it might seem that the model capacity would be greatly reduced. Even though it is a challenge to optimize SLNNs with SGD, we show that SLNNs can perform considerably similar even with finite degree polynomial representations. In this section, we test and compare the performance of SLNNs on standard classification datasets to NNs activated with ReLU activation. Note that we want to showcase the approximation power of SLNNs, therefore avoid experimenting with other activations.

MNIST
: MNIST is standard hand digit image classification dataset. We experiment with a custom convolutional neural network (NNRELU) with two convolutional layers involving ReLU activation, batch normalization and maxpooling followed by two fully connected layers, the latter one being a standard softmax layer. We replace all the RelU activations with SLAF and call that model as SLNN.
Algorithm Degree Test Accuracy (%) NNRELU  99.34 SLNN 99.55 Table 4: Comparison on MNIST Dataset 
CIFAR10: This is another image classification dataset consisting of 60000 images labeled in one of the ten classes. We use Resnet architecture with ReLU activation. We show two variants of Resnet where the activation of first layer is replaced by SLAF of degree two, and another where all the activations are replaced with SLAFs of degree two.
Architecture # Layers # Activation functions replaced by SLAF # Parameter Error(%) ResNet 32  0.46M 7.51 ResNet 44  0.66M 7.17 ResNet 32 1 (SLAF ) 0.46M 7.12 ResNet 32 31 (SLAF ) 0.46M 8.50 Table 5: Testing error on CIFAR10 using different architectures and activation functions. is the order of Taylor series used. 
Fashion MNIST: This is another benchmarking dataset developed as drop in replacement of MNIST dataset. We use a small residual network with two residual blocks and two fully connected layers followed by a softmax layer. NNRELU uses only ReLU activation at all the layers where as SLNN uses SLAF of degree 2 at each layer. We also consider the case where only the final activation is replaced by SLAF (NN(ReLU+SLAF)).
Algorithm Degree Architecture Acc (%) NNRELU  2 Conv + 2 Res. Blocks + 2 Fc 93.56 SLNN 2 Conv + 2 Res. Blocks + 2 Fc 92.97 NN(ReLU+SLAF) 2* 2 Conv + 2 Res. Blocks + 2 Fc(ReLU, SLAF) 93.71 Table 6: Comparison on Fashion MNIST Dataset with no data augmentation.
* Only describes the sum of degrees of SLAF used and not for the entire NN.
5 Discussion
We perform experiments on three standard benchmarking datasets—MNIST, FMNIST, and Cifar10 with same architecture but different activations. The NN with all ReLU activations turns ranks second among three approaches chosen for experimentation. On the other hand NN with all SLAF activations gives approximately similar classification accuracies. According to authors, this can be attributed to the optimization difficulties experienced while training SLNNs due to non monotonic and unbounded nature of SLAF. Also, unlike other activations, the polynomials have higher degree terms which causes the input to grow at a very large rate. This results in the exploding activations on test data, which is somewhat minimzed with L2 regularization applied on the network and activation weights. The third method which involves replacing only one ReLU activation with SLAF provides incremental improvement on classification accuracy. Since the SLAF activation is adaptive, the regularization of activation coefficients is observed to minimze overfitting (empirically verified). Note that we don’t experiment with the third method on MNIST dataset because the architecture used is much smaller thereby making the optimization of SLNN (all activations replaced) easier and therefore rendering the third one irrelevant.
One might expect that employing SLAF activation of higher degrees can compensate for more number of layers in the deep neural networks. However, practically we observe that doubling the degree doesn’t yield the same performance as adding one layer with SLAF of degree two does. We only provide an intuitive explanation here based on the assumption that increasing the number of global minimas in the parameter space allows SGD to converge to one of them (globalminima). Consider a NN with inputs and with only one hidden layer having hidden units activated with SLAF of degree four s.t . The output is weighted sum of the activations. If the hidden layer is replaced with two hidden layers each with hidden nodes having SLAF activation of degree two. The number of extra parameters introduced in the newer architecture would be (much greater than the coefficients for polynomial representation) while the underlying representation is same i.e. a polynomial of degree four. This must mean that any polynomial of degree four must have more than one configuration (such that the resulting polynomial has the same coefficient values) of the SLNN. This implies that the duplicates of the global minima (which will also be a polynomial) are introduced in the newer parameter space thereby making optimization easier.
6 Conclusion
We present a new form of activation function which is motivated from polynomial approximation of univariate functions. The activation is learned during training while searching a space of finite degree polynomials. We provide in depth analysis of NNs activated with polynomial activation referred to as SLAF while providing the bounds on the number of parameters of SLNN theoretically required for its underlying polynomial representation. Finally, we show that SLNNs perform at par with standard NNs with experimentation on standard benchmarking datasets. In the end, we provide an intuitive explanation of how different parametrization of SLNNs improve the empirical performance possibly due to properties of SGD algorithm.
References
Appendix A
Assumptions: Consider a set . Let denote the set of elements of polynomial basis with degree constructed using elements of set x. Then,
We call as the basis set on x having degree . We also define monomial set on x as the set which contain all monomials of degree , i.e.,
Clearly, , where denotes the union operator over sets.
Theorem A.1
The cardinality of , the set of elements of polynomial basis with degree constructed using elements of set x, denoted by is equal to , where is the cardinality of set x.
To find the cardinality of the set , consider the following inequality,
(7) 
The cardinality of the set is equal to number of non negative solutions of (7). To find the number of solutions, this inequality can be broken down into, equalities, as follows,
(8) 
Now, it is straight forward to see that for a fixed j, the number of solutions to (8) is equal to , where is equal to the number of ways of choosing items from identical items. Now, we can easily write the cardinality as the following summation
(9) 
Consider the following recurrence relation, which is true
(10) 
,
We can write (9) as, where
(11) 
Claim A.1
Let denotes a polynomial of degree in x. If is transformed by the function , then the resulting polynomial has a degree in x.
The function can be easily seen as multiplied with itself times.
(12) 
Now, if a polynomial is multiplied with itself it must remain a polynomial in the same input. Therefore, will be a polynomial in x. Now, consider the monomial term in with highest power , when this is multiplied with itself times, it will results in the power . Clearly, there can not exist a monomial term with power higher than . Hence, the degree of polynomial must be .
Theorem A.2
Consider an SLNN with hidden layers and input denoted as and output as . If the activation at the final layer used linear and all the hidden layers are activated with SLAF of degree , where is the index of the hidden layer. Then, the output of this NN can be reparametrized and written as
(13) 
where, , called as degree of SLNN, are the new parameters and
is the vector containing polynomial features of degree
in . The subscripts in the notation denote matrix size.Note: SLAF with degree equal to one is equivalent to linear/no activation. Therefore, this result directly holds for regression tasks.
As a direct result of claim A.1, it is easy to see that any layer of an SLNN can be expressed as a collection of polynomial in SLNN’s inputs. Without loss of generality, let the degree of the polynomial obtained as the output of th layer be . Now, if degree of SLAF used in th is , then its output will be a polynomial of degree (using claim A.1). Now, given that , each output node of SLNN denoted by is expressible as a polynomial of degree and therefore can be reparametrized as
Or,
where is a matrix with constants which can be easily obtained from the weights of the SLNN.
Theorem A.3
A neural network with SLAF can approximate any neural network architecture given its input domain is bounded and the activation function is Lipschitz continuous, to any desired degree of error as a function of degree of SLAF.
First, let us look at Weierstrass Approximation Theorem. It states that for any continuous and real valued function defined on the interval , for every , there exists polynomial s.t. for , we have
(14) 
It is also well known that if the function f(x) is not a polynomial, then the degree of the polynomial approaches infinity as approaches zero. Let’s denote the approximation error by if the polynomial, has degree less than equal to . Then for a fixed , on the interval , it is easy to see that,
(15) 
Consider a neural network with activation function which follows lipschitz continuity. Let’s assume K to be the lipschitz constant for s.t. it follows:
(16) 
The layer of the NN has hidden units and its linear component be denoted by followed by activation which follows:
(17)  
(18) 
Now consider an Approximate NN (ANN) with all activations replaced by different approximations based on polynomial approximation denoted by . The th layer with hidden units has a linear component followed by activation which follows:
(19) 
(20) 
Now, to get the recursion in error propagated from layer to layer, assume that the approximation error at layer at activation is upper bounded by . Then we can write,
(21)  
(22) 
Now,
(23)  
(24)  
(25) 
Now, consider ,
(26) 
Now from Weierstrass approximation theorem we know that for every there exists a polynomial of degree denoted by which will satisfy :
(27)  
(28)  
(29) 
The interval over which approximation holds is easy to calculate for bounded activations. For , we can write
(30)  
(31) 
We will now drop the super scripts and sub scripts for sake of clarity. The length of , depends upon the weights in layer, width of the hidden layer and the range of . It is easy to see that as the width increases the length of increases requiring higher polynomial degree to maintain approximation error. Therefore, the degree of the polynomial in SLNNs acts as a sort of proxy for width in standard NNs. Note that this bound would still hold even if the activation function is unbounded (for ReLU, SeLU, ELU etc) since the input domain is restricted. Now, we can write:
(32) 
Hence, we get
(33) 
From eq. (33), we can see the recursive expression of approximation error is a function of , the degree of polynomial used for approximation . Since, (for the input layer), the expression for would be proportional to . This means that by varying , any approximation error can be achieved.
Remainder omitted in this sample. See http://www.jmlr.org/papers/ for full paper.