Padé Activation Units: End-to-end Learning of Activation Functions in Deep Neural Network
The performance of deep network learning strongly depends on the choice of the non-linear activation function associated with each neuron. However, deciding on the best activation is non-trivial and the choice depends on the architecture, hyper-parameters, and even on the dataset. Typically these activations are fixed by hand before training. Here, we demonstrate how to eliminate the reliance on first picking fixed activation functions by using flexible parametric rational functions instead. The resulting Padé Activation Units (PAUs) can both approximate common activation functions and also learn new ones while providing compact representations. Our empirical evidence shows that end-to-end learning deep networks with PAUs can increase the predictive performance and reduce the training time of common deep architectures. Moreover, PAUs pave the way to approximations with provable robustness. The source code can be found at https://github.com/ml-research/pauREAD FULL TEXT VIEW PDF
Deep Neural Networks have been shown to be beneficial for a variety of t...
We have proposed orthogonal-Padé activation functions, which are trainab...
This submission has been withdrawn by arXiv administrators because it is...
Analysing and computing with Gaussian processes arising from infinitely ...
Researchers have proposed various activation functions. These activation...
We present in this paper a systematic study on how to morph a well-train...
Activation functions and attention mechanisms are typically treated as h...
Padé Activation Units: End-to-end Learning of Activation Functions in Deep Neural Network
An important building block of deep learning is the non-linearities introduced by the activation functions2010]. The demonstrated benefits in training deep networks, see e.g. [Glorot et al., 2011], brought renewed attention to the development of new activation functions. Since then, several ReLU variations with different properties have been introduced such as LeakyReLUs [Maas et al., 2013], PReLUs [He et al., 2015], ELUs [Clevert et al., 2016], RReLUs [Xu et al., 2015], among others. However, the shape of the activation function is rather rigid, except for some cases where minor parameters introduce variations in the class of the activation.
Therefore, another line of research such as [Ramachandran et al., 2018] automatically searches for activation functions. It identified the Swish unit empirically as a good candidate. However, for a given dataset, there are no guarantees that Swish unit behaves well and the proposed search algorithm is computationally quite demanding. Consequently, learnable activation functions have been proposed. They exploit parameterized activation functions that adapt in an end-to-end fashion to the datasets at hand during parameter training. For instance, Goodfellow et al.  and Zhao et al.  used a fixed set of piecewise linear components and optimize their parameters. Although they are theoretically universal function approximators, they strongly depend on hyper-parameters such as the number of components to realize this potential. Vercellino and Wang  used a meta-learning approach for learning task specific activation functions (hyperactivations). However, as the authors described, the implementation of hyperactivations, while easy to express notationally, can be frustrating to implement for generalizability over any given activation network. Recently, Goyal et al.  proposed a learnable activation function based on a Taylor approximation and suggest a transformation strategy to avoid exploding gradients on deep networks. However, relying on polynomials suffers from well-known limitations such as exploding values in the limits and a tendency to oscillate [Trefethen, 2012]. Furthermore, and more importantly, it constraints the network so that it is no longer a universal function approximator [Leshno et al., 1993].
Here, we introduce a learnable activation function based on the Padé approximation, i.e., the “best” approximation of a function by a rational function of a given order. In contrast to approximations for high accuracy hardware implementation of the hyperbolic tangent and the sigmoid activation functions [Hajduk, 2018], we do not assume fixed coefficients. The resulting Padé Activation Units (PAU) can be learned using standard stochastic gradient and, hence, seamlessly integrated into the deep learning stack. This provides high flexibility, faster training and better performance of deep learning, as we will demonstrate.
We proceed as follows. We start off by introducing PAUs. Then we sketch that they are universal approximators. Before concluding, we present our empirical evluation on image classification.
The line of research investigating the ability of neural networks to approximate functions dates back at least to 1980s. The universal approximation theorem states that depth-2 neural networks with suitable activation function can approximate any continuous function on a compact domain to any desired accuracy, see e.g. [Hornik et al., 1989]. Unfortunately, polynomial activation functions are not enough [Leshno et al., 1993]. Even practically speaking, polynomial approximations tend to oscillate and overshoot [Trefethen, 2012]. Indeed, a higher order of polynomials could considerably reduce the oscillation, but doing so also increases the computational cost. In contrast, neural networks and rational functions efficiently approximate each other [Telgarsky, 2017]. Motivated by this, we propose a new type of activation function: the Padé Activation Unit (PAU). As Fig. 1 illustrates, common activation functions can be well represented using PAUs.
Assume for the moment that we start with a fixed activation function. The Padé approximant is then the “best” approximation of a function by a rational function of given orders and . More precisely, given and the orders and , the Padé approximant [Brezinski and Van Iseghem, 1994] of order , is the rational function over polynomials , of order , of the form
which agrees with the best. The Padé approximant often gives better approximation of a function than truncating its Taylor series, and it may still work where the Taylor series does not converge. For these reasons it has been used before in the context of graph convolutional networks [Chen et al., 2018]. For general deep networks, however, they have not been considered so far.
Indeed, the flexibility of Padé is not only a blessing but might also be a curse: it can model processes that contain poles. For a learnable activation function, however, a pole may produce Nan values depending on the input as well as instabilities at learning and inference time. Therefore we consider a restriction, called safe Padé approximation, that guarantees that the polynomial is larger or equal to 1, i.e., , preventing poles and allowing for safe computation on :
In contrast to the standard way of fitting Padé approximants where the coefficients are found via derivatives and algebraic manipulation against a pre-defined function, we are interested in optimizing the polynomials via (stochastic) gradient descent, so that we can put the rational approximation onto the standard differentiable programming stack and simply learn the coefficients from data. To this end, we now provide all the partial derivatives needed for the update of the coefficients using backpropagation. Based on the polynomial gradients:
Then the partial derivatives required for backpropagation are:
To avoid divisions by zero in the computation of the gradients, we replace operations of the form by the sign of .
Having the function and the gradients, we can put PAUs onto the differentiable programming stack. Indeed, as every PAU contains additional tunable parameters, the number of activations increases the complexity of the model and the learning time. To ameliorate this, and inspired by the idea of weight-sharing as introduced by Teh and Hinton , we propose to share the PAU parameters across all neurons in a layer, significantly reducing the extra number of parameters required.
Although one can do random initialization of the coefficients and allow the optimizer to train the network end-to-end, we obtained better results after initializing the activation function to approximate previously known activation functions. This initialization involves a previous optimization step. For continuous activation functions, we employ the standard Padé fitting methods. For functions with discontinuities, we do optimization with an loss over an initial line range over .
Before presenting the results of our image classification experiments, let us touch upon the expressive power of PAUs. A standard multi-layer perceptron (MLP) with enough hidden units and nonpolynomial activation functions is a universal approximator, see e.g.[Hornik et al., 1989, Leshno et al., 1993]. Similarly, Padé networks — feedforward networks with (potentially unsafe) PAUs that may include convolutional and residual architectures with max- or sum-pooling layers — are universal approximators. This can be sketched as follows. Lu et al.  have shown a universal approximation theorem for width-bounded ReLU networks: width- ReLU networks, where is the input dimension, are universal approximators. ReLU networks, however, can be -approximated using rational functions, requiring a representation whose size is polynomial in [Telgarsky, 2017]. Thus, it follows that any continuous function can be approximated arbitrarily well on a compact domain by a Padé network with one (potentially unsafe) PAU. Since ReLU networks also -approximate rational functions [Telgarsky, 2017], Padé networks can also be reduced to ReLU networks. This link paves the ways to globally optimal training [Arora et al., 2018], under certain conditions, as well as to provable robustness [Croce et al., 2019] of Padé networks.
Our intention here is to investigate the performance of PAUs, both in terms of running time and predictive performance, compared to standard deep neural networks. To this end, we took well-established deep architectures with different activation functions. Then, we replaced the activation functions by PAUs with layer-wise weight sharing and trained both variants. All our experiments are implemented in PyTorch (https://pytorch.org) with PAU implemented as an extension in CUDA. The computations were executed on an NVIDIA DGX-2 system.
More precisely, we considered the datasets MNIST [LeCun et al., 2010] and Fashion-MNIST [Xiao et al., 2017] and the following deep architectures for image classification. For more details see Tab. 2 in the Appendix.
LeNet [LeCun et al., 1998] with 61746 parameters for the network and 40 for PAU,
VGG [Simonyan and Zisserman, 2015] with 9224508 parameters and 50 for PAU, and,
We compared the different network architectures and replaced all the activation functions by PAUs and the common activation functions:
ReLU [Nair and Hinton, 2010]:
ReLU6 [Krizhevsky and Hinton, 2010]: a variation of ReLU with an upper bound.
Leaky ReLU [Maas et al., 2013]: with the negative slope, which is defined by the parameter . Leaky ReLU enables a small amount of information to flow when .
Swish [Ramachandran et al., 2018]: which tends to work better than ReLU on deeper models across a number of challenging datasets.
Parametric ReLU (PReLU) [He et al., 2015] where the leaky parameter is a learn-able parameter of the network.
The parameters of the networks, both the layer weights and the coefficients of the PAUs, were trained over 100 epochs using Adam[Kingma and Ba, 2015] with a learning rate of or SGD [Qian, 1999] with a learning rate of . In all experiments we used a batch size of samples. The weights of the networks were initialized randomly and the coefficients of the PAUs were initialized with the initialization constants of Leaky ReLU, see Tab. 3. We report the mean of 5 different runs for both the accuracy on the test-set and the loss on the train-set after each training epoch.
As can be seen in Fig. 3 and Fig. 4, PAU consistently outperforms the baseline activations on every network in terms of predictive performance and training speed. Furthermore, PAUs also enable the networks to achieve a lower loss during training compared to all baselines on all networks, see second column of Fig. 3 and Fig. 4. These results are more prominent both on the VGG network and even more so on our own defined network (CONV), which achieves the best performance of all networks.
An important observation is that, compared to baseline activation functions on the MNIST dataset on the different architectures (Fig. 3), there is no clear choice of activation that achieves the best performance. However, PAU always matches or even outperforms the best performing baseline activation function. This shows that a learnable activation function relieves the network designer of having to commit to a potentially underperforming choice.
|mean std||best||mean std||best||mean std||best|
The reported results are further summarized in Tab. 1. As one can see, PAU consistently outperforms the baseline activation functions on average. On MNIST, where the performance of most activation functions has minor deviations, PAU consistently achieves a stable performance (c.f. mean std) and outperforms the other functions in average. On Fashion-MNIST, PAU consistently achieves the best performance in average and provides also the best results over all runs.
Fig. 2 shows the learned PAUs of the trained VGG network (MNIST). The first activation function, Fig. 2(a), is akin to a ReLU with a negative slope in the range of and to a Log-Sigmoid, starting at . The following activation units (Fig. 2(a), 2(b), 2(c), 2(d)) behave similarly. The last learned activation function before the classification layer behaves like a Leaky ReLU with a negative value. This shows that PAUs can learn new activation functions that differ from them but also capture some of their characteristics.
We have presented a novel learnable activation function, called Padé Activation Unit (PAU). PAUs encode activation functions as rational functions, trainable in an end-to-end fashion using backpropagation. The results of our empirical evaluation for image classification shows that PAUs can indeed learn new activation functions. More importantly, the resulting Padé networks can outperform classical deep networks that use fixed activation functions, both in terms of training time and predictive performance. Actually, across all activation functions and architectures, Padé networks achieved the best performances. This clearly shows that the reliance on first picking fixed, hand-engineered activation functions can be eliminated and that learning activation functions is actually beneficial. Moreover, our results provide the first empirically evidence that the open question “Can rational functions be used to design algorithms for training neural networks?” raised by Telgarsky  can be answered affirmatively for common deep architectures.
PS and KK acknowedge the supported by funds of the German Federal Ministry of Food and Agriculture (BMEL) based on a decision of the Parliament of the Federal Republic of Germany via the Federal Office for Agriculture and Food (BLE) under the innovation support program, FKZ 2818204715.
Rational neural networks for approximating graph convolution operator on jump discontinuities.In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018.
The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2057–2066, 2019.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
Convolutional deep belief networks on cifar-10.Unpublished manuscript, 40(7), 2010.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
Rate-coded restricted boltzmann machines for face recognition.In Prcoeedings of Neural Information Processing Systems (NIPS), pages 908–914, 2001.
Improving deep convolutional neural networks with mixed maxout units.PloS one, 12(7), 2017.
Here we describe the architectures for the networks VGG, LeNet and CONV along with the number of trainable parameters. The number of parameters of the activation function is reported for using PAU. Common not trainable activation functions don’t have trainable parameters. PReLU has one trainable parameter. In total the VGG network as 9224508 parameters with 50 for PAU. The LeNet network has 61746 parameters with 40 for PAU and the CONV network has 562728 parameters with 30 for PAU.
As show in Table 3 we compute initial coefficients for PAU approximations to different known activation functions. We predefined the orders to be [5,4] and for Sigmoid, Tanh and Swish, we have computed the Padé approximant using the standard techniques. For the different variants of PRelu, LeakyRelu and Relu we optimized the coefficients using least squares over the line range between [-3,3] in steps of 0.000001.
A visualization of the different approximations can be found in Fig. 1.
When looking at the activation functions learned from the data, we can clearly see similarities to standard functions. In particular, we can see in Fig. 5 and Fig. 6 that many learned activations seem to be smoothed versions of Leaky ReLUs. Consider that piecewise V-shaped activation function are simply Leaky ReLUs with negative values. Furthermore, we see that unlike the standard activations, PAUs also choose to move the center of what would be the the piecewise transition in the ReLU family. As shown in Fig. 5j, the transition between the two ”linear” modes is negative. However, this can only be achieved by a Leaky ReLU with a negative and then applying a shifting to the right, a non-standard procedure.