Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks

07/15/2019 ∙ by Alejandro Molina, et al. ∙ Technische Universität Darmstadt 8

The performance of deep network learning strongly depends on the choice of the non-linear activation function associated with each neuron. However, deciding on the best activation is non-trivial and the choice depends on the architecture, hyper-parameters, and even on the dataset. Typically these activations are fixed by hand before training. Here, we demonstrate how to eliminate the reliance on first picking fixed activation functions by using flexible parametric rational functions instead. The resulting Padé Activation Units (PAUs) can both approximate common activation functions and also learn new ones while providing compact representations. Our empirical evidence shows that end-to-end learning deep networks with PAUs can increase the predictive performance and reduce the training time of common deep architectures. Moreover, PAUs pave the way to approximations with provable robustness. The source code can be found at



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Padé Activation Units: End-to-end Learning of Activation Functions in Deep Neural Network

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An important building block of deep learning is the non-linearities introduced by the activation functions

. They play a mayor role in the success of training deep neural networks, both in terms of training time and predictive performance. Consider e.g. Rectified Linear Unit (ReLU) due to

Nair and Hinton [2010]. The demonstrated benefits in training deep networks, see e.g. [Glorot et al., 2011], brought renewed attention to the development of new activation functions. Since then, several ReLU variations with different properties have been introduced such as LeakyReLUs [Maas et al., 2013], PReLUs [He et al., 2015], ELUs [Clevert et al., 2016], RReLUs [Xu et al., 2015], among others. However, the shape of the activation function is rather rigid, except for some cases where minor parameters introduce variations in the class of the activation.

Therefore, another line of research such as [Ramachandran et al., 2018] automatically searches for activation functions. It identified the Swish unit empirically as a good candidate. However, for a given dataset, there are no guarantees that Swish unit behaves well and the proposed search algorithm is computationally quite demanding. Consequently, learnable activation functions have been proposed. They exploit parameterized activation functions that adapt in an end-to-end fashion to the datasets at hand during parameter training. For instance, Goodfellow et al. [2013] and Zhao et al. [2017] used a fixed set of piecewise linear components and optimize their parameters. Although they are theoretically universal function approximators, they strongly depend on hyper-parameters such as the number of components to realize this potential. Vercellino and Wang [2017] used a meta-learning approach for learning task specific activation functions (hyperactivations). However, as the authors described, the implementation of hyperactivations, while easy to express notationally, can be frustrating to implement for generalizability over any given activation network. Recently, Goyal et al. [2019] proposed a learnable activation function based on a Taylor approximation and suggest a transformation strategy to avoid exploding gradients on deep networks. However, relying on polynomials suffers from well-known limitations such as exploding values in the limits and a tendency to oscillate [Trefethen, 2012]. Furthermore, and more importantly, it constraints the network so that it is no longer a universal function approximator [Leshno et al., 1993].

Here, we introduce a learnable activation function based on the Padé approximation, i.e., the “best” approximation of a function by a rational function of a given order. In contrast to approximations for high accuracy hardware implementation of the hyperbolic tangent and the sigmoid activation functions [Hajduk, 2018], we do not assume fixed coefficients. The resulting Padé Activation Units (PAU) can be learned using standard stochastic gradient and, hence, seamlessly integrated into the deep learning stack. This provides high flexibility, faster training and better performance of deep learning, as we will demonstrate.

We proceed as follows. We start off by introducing PAUs. Then we sketch that they are universal approximators. Before concluding, we present our empirical evluation on image classification.

2 Padé Activation Units (PAU)

The line of research investigating the ability of neural networks to approximate functions dates back at least to 1980s. The universal approximation theorem states that depth-2 neural networks with suitable activation function can approximate any continuous function on a compact domain to any desired accuracy, see e.g. [Hornik et al., 1989]. Unfortunately, polynomial activation functions are not enough [Leshno et al., 1993]. Even practically speaking, polynomial approximations tend to oscillate and overshoot [Trefethen, 2012]. Indeed, a higher order of polynomials could considerably reduce the oscillation, but doing so also increases the computational cost. In contrast, neural networks and rational functions efficiently approximate each other [Telgarsky, 2017]. Motivated by this, we propose a new type of activation function: the Padé Activation Unit (PAU). As Fig. 1 illustrates, common activation functions can be well represented using PAUs.

Figure 1: Approximations of common activation functions (ReLU, Sigmoid, Tanh, Swish and Leaky ReLU ()) using PAUs (marked with *). As one can see, PAUs can encode common activation functions very well.

2.1 Padé Approximation of Activation Functions

Assume for the moment that we start with a fixed activation function

. The Padé approximant is then the “best” approximation of a function by a rational function of given orders and . More precisely, given and the orders and , the Padé approximant [Brezinski and Van Iseghem, 1994] of order , is the rational function over polynomials , of order , of the form


which agrees with the best. The Padé approximant often gives better approximation of a function than truncating its Taylor series, and it may still work where the Taylor series does not converge. For these reasons it has been used before in the context of graph convolutional networks [Chen et al., 2018]. For general deep networks, however, they have not been considered so far.

Indeed, the flexibility of Padé is not only a blessing but might also be a curse: it can model processes that contain poles. For a learnable activation function, however, a pole may produce Nan values depending on the input as well as instabilities at learning and inference time. Therefore we consider a restriction, called safe Padé approximation, that guarantees that the polynomial is larger or equal to 1, i.e., , preventing poles and allowing for safe computation on :


2.2 Learning Safe Padé Approximations using Backpropagation

In contrast to the standard way of fitting Padé approximants where the coefficients are found via derivatives and algebraic manipulation against a pre-defined function, we are interested in optimizing the polynomials via (stochastic) gradient descent, so that we can put the rational approximation onto the standard differentiable programming stack and simply learn the coefficients from data. To this end, we now provide all the partial derivatives needed for the update of the coefficients using backpropagation. Based on the polynomial gradients:

Then the partial derivatives required for backpropagation are:

To avoid divisions by zero in the computation of the gradients, we replace operations of the form by the sign of .

Having the function and the gradients, we can put PAUs onto the differentiable programming stack. Indeed, as every PAU contains additional tunable parameters, the number of activations increases the complexity of the model and the learning time. To ameliorate this, and inspired by the idea of weight-sharing as introduced by Teh and Hinton [2001], we propose to share the PAU parameters across all neurons in a layer, significantly reducing the extra number of parameters required.

2.3 Initializing Padé Activation Unitis (PAUs)

Although one can do random initialization of the coefficients and allow the optimizer to train the network end-to-end, we obtained better results after initializing the activation function to approximate previously known activation functions. This initialization involves a previous optimization step. For continuous activation functions, we employ the standard Padé fitting methods. For functions with discontinuities, we do optimization with an loss over an initial line range over .

3 Padé Networks are Universal Approximators

Before presenting the results of our image classification experiments, let us touch upon the expressive power of PAUs. A standard multi-layer perceptron (MLP) with enough hidden units and nonpolynomial activation functions is a universal approximator, see e.g. 

[Hornik et al., 1989, Leshno et al., 1993]. Similarly, Padé networks — feedforward networks with (potentially unsafe) PAUs that may include convolutional and residual architectures with max- or sum-pooling layers — are universal approximators. This can be sketched as follows. Lu et al. [2017] have shown a universal approximation theorem for width-bounded ReLU networks: width- ReLU networks, where is the input dimension, are universal approximators. ReLU networks, however, can be -approximated using rational functions, requiring a representation whose size is polynomial in [Telgarsky, 2017]. Thus, it follows that any continuous function can be approximated arbitrarily well on a compact domain by a Padé network with one (potentially unsafe) PAU. Since ReLU networks also -approximate rational functions [Telgarsky, 2017], Padé networks can also be reduced to ReLU networks. This link paves the ways to globally optimal training [Arora et al., 2018], under certain conditions, as well as to provable robustness [Croce et al., 2019] of Padé networks.

Figure 2: Estimated activation functions after training the VGG network with PAU on MNIST. As one can see, PAUs differ from common activation functions but capture characteristics of them. Illustrations of the learned PAUs of other networks can be found in Figs. 5 and 6 in the Appendix.

4 Image Classification Experiments

Our intention here is to investigate the performance of PAUs, both in terms of running time and predictive performance, compared to standard deep neural networks. To this end, we took well-established deep architectures with different activation functions. Then, we replaced the activation functions by PAUs with layer-wise weight sharing and trained both variants. All our experiments are implemented in PyTorch ( with PAU implemented as an extension in CUDA. The computations were executed on an NVIDIA DGX-2 system.

More precisely, we considered the datasets MNIST [LeCun et al., 2010] and Fashion-MNIST [Xiao et al., 2017] and the following deep architectures for image classification. For more details see Tab. 2 in the Appendix.

  • LeNet [LeCun et al., 1998] with 61746 parameters for the network and 40 for PAU,

  • VGG [Simonyan and Zisserman, 2015] with 9224508 parameters and 50 for PAU, and,

  • CONV

    with 562728 parameters for the network, 30 for PAU. This convolutional network uses batch-normalization and dropout, c.f. Tab. 

    2 in the appendix.

We compared the different network architectures and replaced all the activation functions by PAUs and the common activation functions:

  • ReLU [Nair and Hinton, 2010]:

  • ReLU6 [Krizhevsky and Hinton, 2010]: a variation of ReLU with an upper bound.

  • Leaky ReLU [Maas et al., 2013]: with the negative slope, which is defined by the parameter . Leaky ReLU enables a small amount of information to flow when .

  • Tanh:

  • Swish [Ramachandran et al., 2018]: which tends to work better than ReLU on deeper models across a number of challenging datasets.

  • Parametric ReLU (PReLU) [He et al., 2015] where the leaky parameter is a learn-able parameter of the network.

The parameters of the networks, both the layer weights and the coefficients of the PAUs, were trained over 100 epochs using Adam

[Kingma and Ba, 2015] with a learning rate of or SGD [Qian, 1999] with a learning rate of . In all experiments we used a batch size of samples. The weights of the networks were initialized randomly and the coefficients of the PAUs were initialized with the initialization constants of Leaky ReLU, see Tab. 3. We report the mean of 5 different runs for both the accuracy on the test-set and the loss on the train-set after each training epoch.

4.1 Results on MNIST and Fashion-MNIST Benchmarks

As can be seen in Fig. 3 and Fig. 4, PAU consistently outperforms the baseline activations on every network in terms of predictive performance and training speed. Furthermore, PAUs also enable the networks to achieve a lower loss during training compared to all baselines on all networks, see second column of Fig. 3 and Fig. 4. These results are more prominent both on the VGG network and even more so on our own defined network (CONV), which achieves the best performance of all networks.

An important observation is that, compared to baseline activation functions on the MNIST dataset on the different architectures (Fig. 3), there is no clear choice of activation that achieves the best performance. However, PAU always matches or even outperforms the best performing baseline activation function. This shows that a learnable activation function relieves the network designer of having to commit to a potentially underperforming choice.

Figure 3: PAU compared to baseline activation function units on 5 runs of MNIST using the VGG, LeNet and CONV architectures: first column mean test-accuracy, second column mean train-loss. PAU consistently outperforms or matches the best performances of the baseline activations. Moreover, PAUs enable the networks to achieve a lower loss during training compared to all baselines.
Figure 4: PAU compared to baseline activation function units on 5 runs of Fashion-MNIST using the VGG, LeNet and CONV architectures: first column mean test-accuracy, second column mean train-loss. PAU consistently outperforms the baselines activation functions in terms of performance and training time, especially on the VGG and CONV architectures.
mean std best mean std best mean std best
Table 1: Summarized benchmark comparison of PAU against the baselines activation function on different deep neural networks. Best results in average per benchmark and architecture are shown in bold. The best result out of 5 runs are denoted using “”. As one can see, PAU consistently outperforms the other activation functions on average. On MNIST (Top), where the performance of most activation functions has minor deviations, PAU consistently achieves a stable performance (c.f. mean std) and outperforms the baselines activation functions in average. (Bottom) Here, on Fashion-MNIST, PAU consistently achieves the best performance, both on average as well as the best result over all runs.

The reported results are further summarized in Tab. 1. As one can see, PAU consistently outperforms the baseline activation functions on average. On MNIST, where the performance of most activation functions has minor deviations, PAU consistently achieves a stable performance (c.f. mean std) and outperforms the other functions in average. On Fashion-MNIST, PAU consistently achieves the best performance in average and provides also the best results over all runs.

4.2 Illustration of the Activation Functions learned by PAUs

Fig. 2 shows the learned PAUs of the trained VGG network (MNIST). The first activation function, Fig. 2(a), is akin to a ReLU with a negative slope in the range of and to a Log-Sigmoid, starting at . The following activation units (Fig. 2(a), 2(b), 2(c), 2(d)) behave similarly. The last learned activation function before the classification layer behaves like a Leaky ReLU with a negative value. This shows that PAUs can learn new activation functions that differ from them but also capture some of their characteristics.

5 Conclusions

We have presented a novel learnable activation function, called Padé Activation Unit (PAU). PAUs encode activation functions as rational functions, trainable in an end-to-end fashion using backpropagation. The results of our empirical evaluation for image classification shows that PAUs can indeed learn new activation functions. More importantly, the resulting Padé networks can outperform classical deep networks that use fixed activation functions, both in terms of training time and predictive performance. Actually, across all activation functions and architectures, Padé networks achieved the best performances. This clearly shows that the reliance on first picking fixed, hand-engineered activation functions can be eliminated and that learning activation functions is actually beneficial. Moreover, our results provide the first empirically evidence that the open question “Can rational functions be used to design algorithms for training neural networks?” raised by Telgarsky [2017] can be answered affirmatively for common deep architectures.


PS and KK acknowedge the supported by funds of the German Federal Ministry of Food and Agriculture (BMEL) based on a decision of the Parliament of the Federal Republic of Germany via the Federal Office for Agriculture and Food (BLE) under the innovation support program, FKZ 2818204715.


  • Arora et al. [2018] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
  • Brezinski and Van Iseghem [1994] C. Brezinski and J. Van Iseghem. Padé approximations. Handbook of numerical analysis, 3:47–222, 1994.
  • Chen et al. [2018] Z. Chen, F. Chen, R. Lai, X. Zhang, and C.-T. Lu.

    Rational neural networks for approximating graph convolution operator on jump discontinuities.

    In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018.
  • Clevert et al. [2016] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In 4th International Conference on Learning Representations (ICLR), 2016.
  • Croce et al. [2019] F. Croce, M. Andriushchenko, and M. Hein. Provable robustness of relu networks via maximization of linear regions. In

    The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pages 2057–2066, 2019.
  • Glorot et al. [2011] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315–323, 2011.
  • Goodfellow et al. [2013] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1319–1327, 2013.
  • Goyal et al. [2019] M. Goyal, R. Goyal, and B. Lall. Learning activation functions: A new paradigm of understanding neural networks. arXiv preprint arXiv:1906.09529, 2019.
  • Hajduk [2018] Z. Hajduk. Hardware implementation of hyperbolic tangent and sigmoid activation functions. Bulletin of the Polish Academy of Sciences. Technical Sciences, 66(5), 2018.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.


    Proceedings of the IEEE international conference on computer vision

    , pages 1026–1034, 2015.
  • Hornik et al. [1989] K. Hornik, M. B. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, (ICLR), 2015.
  • Krizhevsky and Hinton [2010] A. Krizhevsky and G. Hinton.

    Convolutional deep belief networks on cifar-10.

    Unpublished manuscript, 40(7), 2010.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278–2324, 1998.
  • LeCun et al. [2010] Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. at&t labs, 2010.
  • Leshno et al. [1993] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
  • Lu et al. [2017] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 6232–6240, 2017.
  • Maas et al. [2013] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc, volume 30, page 3, 2013.
  • Nair and Hinton [2010] V. Nair and G. E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • Qian [1999] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  • Ramachandran et al. [2018] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. In Proceedings of the Workshop Track of the 6th International Conference on Learning Representations (ICLR), 2018.
  • Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations(ICLR), 2015.
  • Teh and Hinton [2001] Y. W. Teh and G. E. Hinton.

    Rate-coded restricted boltzmann machines for face recognition.

    In Prcoeedings of Neural Information Processing Systems (NIPS), pages 908–914, 2001.
  • Telgarsky [2017] M. Telgarsky. Neural networks and rational functions. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3387–3393, 2017.
  • Trefethen [2012] L. N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2012. ISBN 978-1-611-97239-9.
  • Vercellino and Wang [2017] C. J. Vercellino and W. Y. Wang. Hyperactivations for activation function exploration. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Workshop on Meta-learning. Long Beach, USA, 2017.
  • Xiao et al. [2017] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, 2017.
  • Xu et al. [2015] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. CoRR, 2015.
  • Zhao et al. [2017] H.-Z. Zhao, F.-X. Liu, and L.-Y. Li.

    Improving deep convolutional neural networks with mixed maxout units.

    PloS one, 12(7), 2017.

Appendix A Network architectures

Here we describe the architectures for the networks VGG, LeNet and CONV along with the number of trainable parameters. The number of parameters of the activation function is reported for using PAU. Common not trainable activation functions don’t have trainable parameters. PReLU has one trainable parameter. In total the VGG network as 9224508 parameters with 50 for PAU. The LeNet network has 61746 parameters with 40 for PAU and the CONV network has 562728 parameters with 30 for PAU.

T No. VGG # params LeNet # params CONV # params 1 [l]Convolutional 3x3x64 640 [l]Convolutional 5x5x6 156 [l]Convolutional 5x5x128 3328 2 Activation 10 Activation 10 Batch-normalization 256 3 Max-Pooling 0 Max-Pooling 0 Activation 10 4 [l]Convolutional 3x3x128 73856 [l]Convolutional 5x5x16 2416 Dropout 2.0 0 5 Activation 10 Activation 10 Max-Pooling 0 6 Max-Pooling 0 Max-Pooling 0 [l]Convolutional 5x5x128 409728 7 [l]Convolutional 3x3x256 295168 [l]Convolutional 5x5x120 48120 Batch-normalization 256 8 [l]Convolutional 3x3x256 590080 Activation 10 Activation 10 9 Activation 10 [l]Linear 84 10164 Dropout 0.2 0 10 Max-Pooling 0 Activation 10 Max-Pooling 0 11 [l]Convolutional 3x3x512 1180160 [l]Linear 10 850 [l]Convolutional 3x3x128 147584 12 [l]Convolutional 3x3x512 2359808 Softmax 0 Batch-normalization 256 13 Activation 10 Activation 10 14 Max-Pooling 0 Dropout 0.2 0 15 [l]Convolutional 3x3x512 2359808 Avg-Pooling 0 16 [l]Convolutional 3x3x512 2359808 [l]Linear 10 1290 17 Activation 10 Softmax 0 18 Max-Pooling 0 19 [l]Linear 10 5130 20 Softmax 0

Table 2: Architecture of Simple Convolutional Neural Network

Appendix B Initialization coefficients

As show in Table 3 we compute initial coefficients for PAU approximations to different known activation functions. We predefined the orders to be [5,4] and for Sigmoid, Tanh and Swish, we have computed the Padé approximant using the standard techniques. For the different variants of PRelu, LeakyRelu and Relu we optimized the coefficients using least squares over the line range between [-3,3] in steps of 0.000001.

Sigmoid Tanh Swish ReLU LReLU(0.01) LReLU(0.20) LReLU(0.25) LReLU(0.30) LReLU(-0.5)
0 0 0.02996348 0.02979246 0.02557776 0.02423485 0.02282366 0.02650441
1 0.61690165 0.61837738 0.66182815 0.67709718 0.69358438 0.80772912
0 2.37539147 2.32335207 1.58182975 1.43858363 1.30847432 13.56611639
3.06608078 3.05202660 2.94478759 2.95497990 2.97681599 7.00217900
0 1.52474449 1.48548002 0.95287794 0.85679722 0.77165297 11.61477781
0.25281987 0.25103717 0.23319681 0.23229612 0.23252265 0.68720375
0 0 0 1.19160814 1.14201226 0.50962605 0.41014746 0.32849543 13.70648993
4.40811795 4.39322834 4.18376890 4.14691964 4.11557902 6.07781733
0 0 0 0.91111034 0.87154450 0.37832090 0.30292546 0.24155603 12.32535229
0.34885983 0.34720652 0.32407314 0.32002850 0.31659365 0.54006880
Table 3: Initial coefficients to approximate different activation functions.

A visualization of the different approximations can be found in Fig. 1.

Appendix C Illustration of the Activation Functions learned by PAUs

When looking at the activation functions learned from the data, we can clearly see similarities to standard functions. In particular, we can see in Fig. 5 and Fig. 6 that many learned activations seem to be smoothed versions of Leaky ReLUs. Consider that piecewise V-shaped activation function are simply Leaky ReLUs with negative values. Furthermore, we see that unlike the standard activations, PAUs also choose to move the center of what would be the the piecewise transition in the ReLU family. As shown in Fig. 5j, the transition between the two ”linear” modes is negative. However, this can only be achieved by a Leaky ReLU with a negative and then applying a shifting to the right, a non-standard procedure.

Figure 5: Resulting activation functions after training the networks VGG, LeNet and CONV with PAU on MNIST.
Figure 6: Resulting activation functions after training the networks VGG, LeNet and CONV with PAU on Fashion-MNIST.