Rational neural networks

04/04/2020 ∙ by Nicolas Boullé, et al. ∙ University of Oxford cornell university 33

We consider neural networks with rational activation functions. The choice of the nonlinear activation function in deep learning architectures is crucial and heavily impacts the performance of a neural network. We establish optimal bounds in terms of network complexity and prove that rational neural networks approximate smooth functions more efficiently than ReLU networks. The flexibility and smoothness of rational activation functions make them an attractive alternative to ReLU, as we demonstrate with numerical experiments.



There are no comments yet.


page 6

page 8

page 18

page 20

Code Repositories


Code for the paper "Rational neural networks", NeurIPS 2020

view repo


Implementation of rational neural graph fingerprints for feature extraction from graph-based input data in tf.keras with tensorflow >= 2.0.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has become an important topic across many domains of science due to its recent success in image recognition, speech recognition, and drug discovery Hinton et al. (2012); Krizhevsky et al. (2012); LeCun et al. (2015); Ma et al. (2015). Deep learning techniques are based on neural networks, which contain a certain number of layers to perform several mathematical transformations to the input. A nonlinear transformation of the input determines the output of each layer in the neural network: , where is a matrix called the weight matrix,

is a bias vector, and

is a nonlinear function called the activation function (also called activation unit). The computational cost of training a neural network depends on the total number of nodes (size) and the number of layers (depth). A key question in designing deep learning architectures is the choice of the activation function to reduce the number of trainable parameters of the network while keeping the same approximation power Goodfellow et al. (2016).

While smooth activation functions such as sigmoid, logistic, or hyperbolic tangent are widely used in machine learning, they suffer from the “vanishing gradient problem” 

Bengio et al. (1994) because their derivatives are zero for large inputs. Neural networks based on polynomial activation functions are an alternative Cheng et al. (2018); Daws Jr. and Webster (2019); Goyal et al. (2019); Guarnieri et al. (1999); Ma and Khorasani (2005); Vecci et al. (1998), but can be numerically unstable due to large gradients for large inputs Bengio et al. (1994). Moreover, polynomials do not approximate non-smooth functions efficiently Trefethen (2013)

, which can yield optimization issues in classification problems. A popular choice of activation function is the Rectified Linear Unit (ReLU) defined as

 Jarrett et al. (2009); Nair and Hinton (2010). It has numerous advantages, such as being fast to evaluate and zero for many inputs Glorot et al. (2011). Many theoretical studies are characterizing and understanding the expressivity of shallow and deep ReLU neural networks from the perspective of approximation theory DeVore et al. (1989); Liang and Srikant (2016); Mhaskar (1996); Telgarsky (2016); Yarotsky (2017).

ReLU networks also suffer from drawbacks, which are most evident during training. The main disadvantage is that the gradient of ReLU is zero for negative real numbers. Therefore, its derivative is zero if the activation function is saturated Maas et al. (2013). To tackle these issues, several adaptions to ReLU are proposed such as Leaky ReLU Maas et al. (2013), Exponential Linear Unit (ELU) Clevert et al. (2015), Parametric Linear Unit (PReLU) He et al. (2015), and Scaled Exponential Linear Unit (SELU) Klambauer et al. (2017)

. These modifications outperform ReLU in image classification applications, and some of these activation functions have trainable parameters, which are learned by gradient descent at the same time as the weights and biases of the network. To obtain significant benefits for image classification and partial differential equation (PDE) solver, one can even perform an exhaustive search over trainable activation functions constructed from standard units 

Jagtap et al. (2020); Ramachandran et al. (2017). However, most of the “exotic” activation functions in the literature are motivated by empirical results and are not supported by theoretical statements on their potentially improved approximation power over ReLU.

In this work, we study so-called rational neural networks, which are neural networks with activation functions that are trainable rational functions. In Section 3, we provide theoretical statements quantifying the advantages of rational neural networks over ReLU networks. In particular, we remark that a composition of low-degree rational functions has a good approximation power but a relatively small number of trainable parameters. Therefore, we show that rational neural networks require fewer nodes than ReLU networks to approximate smooth functions to within a certain accuracy. The experiments conducted in Section 4 demonstrate the potential applications of these rational networks for solving PDEs and Generative Adversarial Networks (GANs).111All code and hyper-parameters are available at Boullé (2020).

The practical implementation of rational networks is straightforward in the TensorFlow framework and consists of replacing the activation functions by trainable rational functions. Finally, we highlight the main benefits provided by rational networks: the fast approximation of functions, the trainability of the activation parameters, and the smoothness of the activation function.

2 Rational neural networks

We consider neural networks whose activation functions consist of rational functions with trainable coefficients and , i.e., functions of the form:


where and are the polynomial degrees of the numerator and denominator, respectively. We say that is of type and degree .

The use of rational functions in deep learning is motivated by the theoretical work of Telgarsky, who proved error bounds on the approximation of ReLU neural networks by high-degree rational functions and vice versa Telgarsky (2017). On the practical side, neural networks based on rational activation functions are considered by Molina et al. Molina et al. (2019)

, who defined a safe Padé Activation Unit (PAU) as

The denominator is selected so that does not have poles located on the real axis. PAU networks can learn new activation functions and are competitive with state-of-the-art neural networks for image classification. However, this choice results in a non-smooth activation function and makes the gradient expensive to evaluate during training. In a closely related work, Chen et al. Chen et al. (2018) propose high-degree rational activation functions in a neural network, which have benefits in terms of approximation power. However, it can significantly increase the number of parameters in the network, causing the training stage to be computationally expensive.

In this paper, we use low-degree rational functions as activation functions, which are then composed together by the neural network to build high-degree rational functions. In this way, we can leverage the approximation power of high-degree rational function without making training expensive. We highlight the approximation power of rational networks and provide optimal error bounds to demonstrate that rational neural networks theoretically outperform ReLU networks. Motivated by our theoretical results, we consider rational activation functions of type , i.e., and . A low-degree activation function keeps the number of trainable parameters small, while the implicit composition in a neural network gives us the approximation power of high-degree rationals. Our experiments on the approximation of smooth functions and GANs suggest that rational neural networks are an attractive alternative to ReLU networks (see Section 4). We observe that a good initialization, motivated by the theory of rational functions, prevents rational neural networks from having arbitrary large values.

3 Theoretical results on rational neural networks

Here, we demonstrate the theoretical benefit of using neural networks based on rational activation functions due to their superiority over ReLU in approximating functions. We derive optimal bounds in terms of the number of parameters (or size) needed by rational networks to approximate ReLU networks as well as functions in the Sobolev space . Throughout this paper, we take to be a small parameter with . We show that an -approximation of a ReLU network by a rational neural network must have the following size (indicated in brackets):


where the constants only depend on the size and depth of the ReLU network. Here, the upper bound means that all ReLU networks can be approximated to within by a rational network of size . The lower bound means that there is an ReLU network that cannot be -approximated by a rational network of size less that . In comparison, the size needed by a ReLU network to approximate a rational neural network within the tolerance of is given by the following inequalities:


This means that all rational networks can be approximated to within by a ReLU network of size , while there is a rational network that cannot be -approximated by a ReLU network of size less than . A comparison between (2) and (3) suggests that rational networks could be more versatile than ReLU.

3.1 Approximation of ReLU networks by rational neural networks

Telgarsky showed that neural networks and rational functions can approximate each other in the sense that there exists a rational function of degree222A polylogarithmic function in , denoted by , is any polynomial in . that is -close to a ReLU network Telgarsky (2017), where is a small number. To prove this statement, Telgarsky used a rational function constructed with Newman polynomials Newman (1964) to obtain a rational approximation to the ReLU function that converges with square-root exponential accuracy. That is, Telgarsky needed a rational of degree to achieve a tolerance of . A degree rational function can be represented in at most coefficients, i.e., and in Equation 1. Therefore, the rational approximation to a ReLU network constructed by Telgarsky requires at least parameters. In contrast, for any rational function, Telgarsky showed that there exists a ReLU network of size that is an -approximation.

Our key observation is that by composing low-degree rational functions together, we can approximate a ReLU network much more efficiently in terms of the size (rather than the degree) of the rational network. Our theoretical work is based on a family of rationals called Zolotarev functions, which are the best rational approximation to the sign function Achieser (2013); Petrushev and Popov (2011), defined as

A composition of Zolotarev functions of type has type but can be represented with at most parameters instead of . This property enables the construction of a rational approximation to ReLU using compositions of low-degree Zolotarev functions with parameters in Lemma .


Let . There exists a rational network of size such that

Moreover, no rational network of size smaller than can achieve this.

The proof of Lemma (see Supplementary Material) shows that the given bound is optimal in the sense that a rational network requires at least parameters to approximate the ReLU function on to within the tolerance . The convergence of the Zolotarev functions to the ReLU function is much faster, with respect to the number of parameters, than the rational constructed with Newman polynomials (see Figure 1 (left)).

[width=0.9]Figure/fig/relu_init.pdf Number of parametersx

Figure 1: Left: Approximation error of the Newman (blue) and Zolotarev functions (red) to ReLU with respect to the number of parameters required to represent . Right: Best rational function of type (red) that approximates the ReLU function (blue). We use this to initialize the rational activation functions when training a rational neural network.

The converse of Lemma , which is a consequence of a theorem proved by Telgarsky (Telgarsky, 2017, Theorem 1.1), shows that any rational function can be approximated by a ReLU network of size at most .


Let . If is a rational function, then there exists a ReLU network of size such that .

To demonstrate the improved approximation power of rational neural networks over ReLU networks ( versus ), it is known that a ReLU networks that approximates , which is rational, to within on must be of size at least  (Liang and Srikant, 2016, Theorem 11).

We can now state our main theorem based on Lemmas and 3.1. Theorem provides bounds on the approximation power of ReLU networks by rational neural networks and vice versa. We regard Theorem as an analogue of (Telgarsky, 2017, Theorem 1.1) for our Zolotarev functions, where we are counting the number of training parameters instead of the degree of the rational functions. In particular, our rational networks have a high degree but can be represented with few parameters due to compositions, making training more computationally efficient. While Telgarsky required a rational function that needed parameters to approximate a ReLU network with fewer than nodes in each of layers to within a tolerance of , we construct a rational network that only has size .


Let and denote the matrix 1-norm. The following two statements hold:

  1. [leftmargin=0cm,itemindent=.5cm,noitemsep]

  2. Let be a rational network with layers and at most nodes per layer, where each node computes and is a rational function with Lipschitz constant (, , and are possibly distinct across nodes). Suppose further that and . Then, there exists a ReLU network of size

    such that .

  3. Let be a ReLU network with layers and at most nodes per layer, where each node computes and the pair (possibly distinct across nodes) satisfies . Then, there exists a rational network of size

    such that .

Theorem highlights the improved approximation power of rational neural networks over ReLU networks. ReLU networks of size are required to approximate rational networks while rational networks of size only are sufficient to approximate ReLU networks.

3.2 Approximation of functions by rational networks

A popular question is the required size and depth of deep neural networks to approximate smooth functions Liang and Srikant (2016); Montanelli et al. (2019); Yarotsky (2017). In this section, we consider the approximation theory of rational network. In particular, we consider the approximation of functions in the Sobolev space , where is the regularity of the functions and . The norm of a function is defined as

where stands for the multi-index notation , and is the corresponding weak derivative of . In this section, we consider the approximation of functions from

By the Sobolev embedding theorem Brezis (2010), this space contains the functions in , which is the class of functions whose first derivatives are Lipschitz continuous. Yarotsky derived upper bounds on the size of neural networks with piecewise linear activation functions needed to approximate functions in  (Yarotsky, 2017, Theorem 1). In particular, he constructed an -approximation to functions in with a ReLU network of size at most . The term is introduced by a local Taylor approximation, while the term is the size of the ReLU network needed to approximate monomials, i.e., for , in the Taylor series expansion.

We now present an analogue of Yarotsky’s theorem for a rational neural network.


Let , , , and . There exists a rational neural network of size

and maximum depth such that .

The proof of Theorem consists of approximating a function by a local Taylor expansion. One needs to approximate the piecewise linear functions and monomials arising in the Taylor expansion by rational networks using Lemma and Proposition (see Supplementary Material). The main distinction between Yarotsky’s argument and the proof of Theorem is that monomials can be represented by rational neural networks with a size that does not depend on the accuracy of . In contrast, ReLU networks require parameters. Meanwhile, while ReLU neural networks can exactly approximate piecewise linear functions with a constant number of parameters, rational networks can approximate them with a size of a most (see Lemma ). That is, rational neural networks approximate piecewise linear functions much faster than ReLU networks approximate polynomials.

A theorem proved by DeVore et al. DeVore et al. (1989) gives a lower bound of on the number of parameters needed by a neural network to express any function in with an error , under the assumption that the weights are chosen continuously. Comparing and , we find that rational neural networks require exponentially fewer nodes than ReLU networks with respect to the optimal bound of to approximate functions in .

4 Experiments using rational neural networks

In this section, we consider neural networks with trainable rational activation functions of type . We select the type based on empirical performance; roughly, a low-degree (but higher than

) rational function is ideal for generating high-degree rational functions by composition, with a small number of parameters. The rational activation units can be easily implemented in the open-source TensorFlow library 

Abadi et al. (2016) by using the polyval and divide

commands for function evaluations. The coefficients of the numerators and denominators of the rational activation functions are trainable parameters, determined at the same time as the weights and biases of the neural network, by backpropagation, and a gradient descent optimization algorithm.

One crucial question is the initialization of the coefficients of the rational activation functions Chen et al. (2018); Molina et al. (2019). A badly initialized rational function might contain poles on the real axis, leading to exploding values, or converge to a local minimum in the optimization process. Our experiments, supported by the empirical results of Molina et al. Molina et al. (2019), show that initializing each rational function with the best rational approximation to the ReLU function (as described in Lemma ) produces good performance. The underlying idea is to initialize rational networks near a network with ReLU activation functions, which are widely used for deep learning. Then, the adaptivity of the rational functions allows for further improvements during the training phase. We represent the initial rational function used in our experiments in Figure 1 (right). The coefficients of this function are obtained by using the minimax command, available in the Chebfun software Driscoll et al. (2014); Filip et al. (2018) for numerically computing rational approximations (see Table 1 in the Supplementary Material).

In the following experiments, we use a single rational activation function of type at each layer, instead of different functions at each node to reduce the number of trainable parameters and the computational training expense.

4.1 Approximation of functions

Raissi, Perdikaris, and Karniadakis Raissi (2018); Raissi et al. (2019) introduce a framework called deep hidden physics models

for discovering nonlinear partial differential equations (PDEs) from observations. This technique requires to solving the following interpolation problem: given the observation data

at the spatio-temporal points , find a neural network

(called the identification network), that minimizes the loss function


This technique has successfully discovered hidden models in fluid mechanics Raissi et al. (2020), solid mechanics Haghighat et al. (2020), and nonlinear partial differential equations such as the Korteweg–de Vries (KdV) equation Raissi et al. (2019). Raissi et al. use an identification network, consisting of layers and nodes per layer, to interpolate samples from a solution to the KdV equation. Moreover, they observe that networks based on smooth activation functions, such as the hyperbolic tangent () or the sinusoid (), outperform ReLU neural networks Raissi (2018); Raissi et al. (2019). However, the performance of these smooth activation functions highly depends on the application.

Moreover, these functions might not be adapted to approximate non-smooth or highly oscillatory solutions. Recently, Jagtap, Kawaguchi, and Karnidakis Jagtap et al. (2020) proposed and analyzed different adaptive activation functions for the approximation of smooth and discontinuous functions with physics-informed neural networks. More specifically, they use an adaptive version of classical activation functions such as sigmoid, hyperbolic tangent, ReLU, and Leaky ReLU. The choice of these trainable activation functions introduces another parameter in the design of the neural network architecture, which may not be ideal for use for a black-box data-driven PDE solver.

[width=0.46]Figure/2d/solution.eps tx[width=0.4]Figure/fig/PDE_loss.pdf Epochs

Figure 2: Solution to the KdV equation used as training data (left) and training loss of a ReLU (blue), sine (green), and rational (red) neural networks with respect to the number of optimization steps (right).

We illustrate that rational neural networks can address the issues mentioned above due to their adaptivity and approximation power (see Section 3). Similarly to Raissi Raissi (2018), we use a solution to the KdV equation:

as training data for the identification network (see the left panel of Figure 2). We train and compare three neural networks, which contain ReLU, sinusoid, and rational activation functions, respectively.333Details of the parameters used for this experiment are available in the Supplementary Material. The mean squared error (MSE) of the neural networks throughout the training phase is reported in the right panel of Figure 2. We observe that the rational neural network outperforms the sinusoid network, despite having the same asymptotic convergence rate. The ReLU, sinusoid, and rational networks achieve the following mean square errors after epochs:

The absolute approximation errors between the different neural networks and the exact solution to the KdV equation is illustrated in Figure 5

of the Supplementary Material. The rational neural network is approximatively five times more accurate than the sinusoid network used by Raissi and twenty times more accurate than the ReLU network. Moreover, the approximation errors made by the ReLU network are not uniformly distributed in space and time and located in specific regions, indicating that a network with non-smooth activation functions is not appropriate to resolve smooth solutions to PDEs.

4.2 Generative adversarial networks

Generative adversarial networks (GANs) are used to generate fake examples from an existing dataset Goodfellow et al. (2014). They usually consist of two networks: a generator to produce fake samples and a discriminator to evaluate the samples of the generator with the training dataset. Radford et al. Radford et al. (2015)

describe deep convolutional generative adversarial networks (DCGANs) to build good image representations using convolutional architectures. They evaluate their model on the MNIST and Imagenet image datasets 

Deng et al. (2009); LeCun et al. (1998)

. In this section, we highlight the simplicity of using rational activation functions in existing neural network architectures by training an Auxiliary Classifier GAN (ACGAN) 

Odena et al. (2017) on the MNIST dataset. In particular, the neural network444We use the TensorFlow implementation available at 27 and provide extended details and results of the experiment in the Supplementary Material., denoted by ReLU network in this section, consists of convolutional generator and discriminator networks with ReLU and Leaky ReLU Maas et al. (2013) activation units (respectively) and is used as a reference GAN. As in the experiment described in Section 4.1, we replace the activation units of the generative and discriminator networks by a rational function with trainable coefficients (see Figure 1). We initialize the activation functions in the training phase with the best rational function that approximates the ReLU function on .

[width=0.225trim=0 952px 140px 0,clip]Figure/gan/relu/plot_epoch_005_generated.png epoch 5[width=0.225trim=0 952px 140px 0,clip]Figure/gan/relu/plot_epoch_010_generated.png epoch 10[width=0.225trim=0 952px 140px 0,clip]Figure/gan/relu/plot_epoch_015_generated.png epoch 15[width=0.225trim=0 952px 140px 0,clip]Figure/gan/relu/plot_epoch_020_generated.png epoch 20 [width=0.225trim=0 952px 140px 0,clip]Figure/gan/rat/plot_epoch_005_generated.png epoch 5[width=0.225trim=0 952px 140px 0,clip]Figure/gan/rat/plot_epoch_010_generated.png epoch 10[width=0.225trim=0 952px 140px 0,clip]Figure/gan/rat/plot_epoch_015_generated.png epoch 15[width=0.225trim=0 952px 140px 0,clip]Figure/gan/rat/plot_epoch_020_generated.png epoch 20 [width=trim=0 756px 140px 0,clip]Figure/gan/mnist_images.png MNIST images
Figure 3: Digits generated by a ReLU (top) and rational (bottom) auxiliary classifier generative adversarial network. The right panel contains samples from the first five classes of the MNIST dataset for comparison.

We show images of digits from the first five classes generated by a ReLU and rational GANs at different epochs of the training in Figure 3 (the samples are generated randomly and are not manually selected). We observe that a rational network can generate realistic images with a broader range of features than the ReLU network, as illustrated by the presence of bold numbers at the epoch 20 in the bottom panel of Figure 3. However, the digits one generated by the rational network are identical, suggesting that the rational GAN suffers from mode collapse. It should be noted that generative adversarial networks are notoriously tricky to train Goodfellow et al. (2016). The hyper-parameters of the reference model are intensively tuned for a piecewise linear activation function (as shown by the use of Leaky ReLU in the discriminator network). Moreover, many stabilization methods have been proposed to resolve the mode collapse and nonconvergence issues in training, such as Wasserstein GAN Arjovsky et al. (2017), Unrolled Generative Adversarial Networks Metz et al. (2016)

, and batch normalization 

Ioffe and Szegedy (2015). These techniques could be explored and combined with rational networks to address the mode collapse issue observed in this experiment.

5 Conclusions

We have investigated rational neural networks, which are neural networks with smooth trainable activation functions based on rational functions. Theoretical statements demonstrate the improved approximation power of rational networks in comparison with ReLU networks. In practice, it seems beneficial to select the activation function to be very low-degree rationals, making training more computationally efficient. We emphasize that it is simple to implement rational networks in existing deep learning architectures, such as TensorFlow, together with the ability to have trainable activation functions.

There are many future research directions in exploring the potential applications of rational networks in fields such as image classification, time series forecasting, and generative adversarial networks. These applications already employ nonstandard activation functions to overcome various drawbacks of ReLU. Another exciting and promising field is the numerical solution and data-driven discovery of partial differential equations with deep learning. We believe that popular techniques such as physics-informed neural networks Raissi et al. (2019) could benefit from rational neural networks to improve the robustness and performances of PDE solvers, both from a theoretical and practical viewpoint.

Broader Impact

Neural networks have applications in diverse fields such as facial recognition, credit-card fraud, speech recognition, and medical diagnosis. There is a growing understanding of the approximation power of neural networks, which is adding theoretical justification to their use in societal applications. We are particularly interested in the future applicability of rational neural networks in the discovery and solution of partial differential equations (PDEs). Neural networks, in particular rational neural networks, have the potential to revolutionize fields where PDE models derived by mechanistic principles are lacking.

The authors thank the National Institute of Informatics (Japan) for funding a research visit, during which this project was initiated. We thank Gilbert Strang for making us aware of Telgarsky’s paper Telgarsky (2017). This work is supported by the EPSRC Centre For Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) in collaboration with Simula Research Laboratory. The work of the third author is supported by the National Science Foundation grant no. 1818757.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. Cited by: §4.
  • [2] N. I. Achieser (2013) Theory of Approximation. Courier Corporation. Cited by: §3.1.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 214–223. Cited by: §4.2.
  • [4] B. Beckermann and A. Townsend (2017)

    On the Singular Values of Matrices with Displacement Structure

    SIAM J. Matrix Anal. A. 38 (4), pp. 1227–1248. Cited by: Proof.
  • [5] Y. Bengio, P. Simard, and P. Frasconi (1994) Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE T. Neural Networ. 5 (2), pp. 157–166. Cited by: §1.
  • [6] N. Boullé (2020) GitHub repository. GitHub. Note: https://github.com/NBoulle/RationalNets/ Cited by: footnote 1.
  • [7] H. Brezis (2010) Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer Science & Business Media. Cited by: §3.2.
  • [8] Z. Chen, F. Chen, R. Lai, X. Zhang, and C. Lu (2018)

    Rational Neural Networks for Approximating Graph Convolution Operator on Jump Discontinuities

    In IEEE International Conference on Data Mining (ICDM), pp. 59–68. Cited by: §2, §4.
  • [9] X. Cheng, B. Khomtchouk, N. Matloff, and P. Mohanty (2018) Polynomial Regression As an Alternative to Neural Nets. arXiv preprint arXiv:1806.06850. Cited by: §1.
  • [10] D. Clevert, T. Unterthiner, and S. Hochreiter (2015) Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289. Cited by: §1.
  • [11] J. Daws Jr. and C. G. Webster (2019) A Polynomial-Based Approach for Architectural Design and Learning with Deep Neural Networks. arXiv preprint arXiv:1905.10457. Cited by: §1.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 248–255. Cited by: §4.2.
  • [13] R. A. DeVore, R. Howard, and C. Micchelli (1989) Optimal nonlinear approximation. Manuscripta Math. 63 (4), pp. 469–478. Cited by: §1, §3.2.
  • [14] T. A. Driscoll, N. Hale, and L. N. Trefethen (2014) Chebfun Guide. Pafnuty Publications, Oxford. Cited by: §A.3, §4.
  • [15] S. Filip, Y. Nakatsukasa, L. N. Trefethen, and B. Beckermann (2018) Rational Minimax Approximation via Adaptive Barycentric Representations. SIAM J. Sci. Comput. 40 (4), pp. A2427–A2455. Cited by: §4.
  • [16] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep Sparse Rectifier Neural Networks. In

    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics

    pp. 315–323. Cited by: §1.
  • [17] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep Learning. MIT press. Cited by: §1, §4.2.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680. Cited by: §4.2.
  • [19] M. Goyal, R. Goyal, and B. Lall (2019) Learning Activation Functions: A new paradigm of understanding Neural Networks. arXiv preprint arXiv:1906.09529. Cited by: §1.
  • [20] S. Guarnieri, F. Piazza, and A. Uncini (1999) Multilayer feedforward networks with adaptive spline activation function. IEEE Trans. Neural Networ. 10 (3), pp. 672–683. Cited by: §1.
  • [21] E. Haghighat, M. Raissi, A. Moure, H. Gomez, and R. Juanes (2020) A deep learning framework for solution and discovery in solid mechanics: linear elasticity. arXiv preprint arXiv:2003.02751. Cited by: §4.1.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §1.
  • [23] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §1.
  • [24] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456. Cited by: §4.2.
  • [25] A. D. Jagtap, K. Kawaguchi, and G. E. Karniadakis (2020) Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. J. Comput. Phys. 404, pp. 109136. Cited by: §1, §4.1.
  • [26] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2009) What is the best multi-stage architecture for object recognition?. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153. Cited by: §1.
  • [27] (2019) Keras Documentation. Note: https://keras.io/examples/mnist_acgan/ Cited by: §A.4, footnote 4.
  • [28] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.4.
  • [29] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 971–980. Cited by: §1.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet Classification with Deep Convolutional Neural Networks

    In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §1.
  • [31] H. W. Kuhn and A. W. Tucker (1951) Nonlinear Programming. In Proc. Second Berkeley Symp. on Math. Statist. and Prob., pp. 481–492. Cited by: Proof.
  • [32] V. I. Lebedev (1977) On a Zolotarev problem in the method of alternating directions. USSR Comp. Math. Math+ 17 (2), pp. 58–76. Cited by: Proof, Proof.
  • [33] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
  • [35] S. Liang and R. Srikant (2016) Why Deep Neural Networks for Function Approximation?. arXiv preprint arXiv:1610.04161. Cited by: §1, §3.1, §3.2.
  • [36] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik (2015) Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 55 (2), pp. 263–274. Cited by: §1.
  • [37] L. Ma and K. Khorasani (2005) Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans. Neural Networ. 16 (4), pp. 821–833. Cited by: §1.
  • [38] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML), Vol. 30, pp. 3. Cited by: §1, §4.2.
  • [39] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled Generative Adversarial Networks. arXiv preprint arXiv:1611.02163. Cited by: §4.2.
  • [40] H. N. Mhaskar (1996) Neural Networks for Optimal Approximation of Smooth and Analytic Functions . Neural Comput. 8 (1), pp. 164–177. Cited by: §1.
  • [41] A. Molina, P. Schramowski, and K. Kersting (2019) Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks. arXiv preprint arXiv:1907.06732. Cited by: §2, §4.
  • [42] H. Montanelli, H. Yang, and Q. Du (2019)

    Deep ReLU networks overcome the curse of dimensionality for bandlimited functions

    arXiv preprint arXiv:1903.00735. Cited by: §3.2.
  • [43] V. Nair and G. E. Hinton (2010)

    Rectified Linear Units Improve Restricted Boltzmann Machines

    In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §1.
  • [44] Y. Nakatsukasa and R. W. Freund (2016) Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Rev. 58 (3), pp. 461–493. Cited by: Proof, Proof.
  • [45] D. J. Newman (1964) Rational approximation to .. Mich. Math. J. 11 (1), pp. 11–14. Cited by: §3.1.
  • [46] A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70, pp. 2642–2651. Cited by: §4.2.
  • [47] P. P. Petrushev and V. A. Popov (2011) Rational Approximation of Real Functions. Vol. 28, Cambridge University Press. Cited by: §3.1.
  • [48] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434. Cited by: §A.4, §4.2.
  • [49] M. Raissi, P. Perdikaris, and G. E. Karniadakis (2019) Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, pp. 686–707. Cited by: §4.1, §5.
  • [50] M. Raissi, A. Yazdani, and G. E. Karniadakis (2020) Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 367 (6481), pp. 1026–1030. Cited by: §4.1.
  • [51] M. Raissi (2018) Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19 (1), pp. 932–955. Cited by: §A.3, §A.3, §4.1, §4.1.
  • [52] M. Raissi (2020) GitHub repository. GitHub. Note: https://github.com/maziarraissi/DeepHPMs/ Cited by: footnote 5.
  • [53] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for Activation Functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
  • [54] M. Telgarsky (2016) Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485. Cited by: §1.
  • [55] M. Telgarsky (2017) Neural networks and rational functions. In Proceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70, pp. 3387–3393. Cited by: §A.1, §2, §3.1, §3.1, §3.1, Broader Impact, Proof, Proof.
  • [56] L. N. Trefethen (2013) Approximation theory and approximation practice. SIAM. Cited by: §1.
  • [57] L. Vecci, F. Piazza, and A. Uncini (1998) Learning and approximation capabilities of adaptive spline activation function neural networks. Neural Networks 11 (2), pp. 259–270. Cited by: §1.
  • [58] N. S. Vyacheslavov (1975) On the uniform approximation of by rational functions.. Sov. Math. Dokl. 16, pp. 100–104. Cited by: §A.1.
  • [59] D. Yarotsky (2017) Error bounds for approximations with deep ReLU networks. Neural Networks 94, pp. 103–114. Cited by: §1, §3.2, Proof, Proof, Proof.

Appendix A Supplementary Material

a.1 Deferred proofs of Section 3.1

We first show that a rational function can approximate the absolute value function on with square-root exponential convergence.


For any integer , we have

where is the space of rational functions of type at most . Thus, is a rational approximant to of type at most .

Moreover, if for some and integers , then can be written as , where .


Let be a real number and consider the sign function on the domain , i.e.,

By [4, Equation (33)], we find that for any ,

Let be the rational function of type that attains the minimum [4, Equation (12)]. We refer to such has the Zolotarev functions. Since we have the following inequality,

The last inequality follows because on . Moreover, since for (see [4, Equation (12)]) we have


Now, select to minimize this upper bound. One finds that and the result follows immediately.

For the final claim, let be the Zolotarev function of type on , with . By definition, is the best rational approximation of degree to the sign function on . We know from [32, 44] that there exist Zolotarev functions , where each is of type , such that


The proof of Lemma is a direct consequence of the previous lemma and the properties of Zolotarev functions.

Proof (Proof of Lemma )

Let , , , and be the Zolotarev function of type . Again from [32, 44], we see that there exist Zolotarev functions of type such that their composition equals , i.e.,


Following the proof of Lemma , we have the inequality


where we chose . Now, we take


so that the right-hand side of Equation 7 is bounded by . Finally, we use the identity

to define a rational approximation to the ReLU function on the interval as

Therefore, we have the following inequalities for ,

Then, is a composition of rational functions of type and can be represented using at most coefficients (see Equation 5). Moreover, using Equation 8, we see that , which means that is representable by a rational network of size . Finally, for .

The upper bound on the complexity of the neural network obtained in Lemma is optimal, as proved by Vyacheslavov [58].

Theorem (Vyacheslavov)

The following inequalities hold:


where is the best rational approximation to in from . Here, are constants that are independent of .

We first deduce the following corollary, giving lower and upper bounds on the optimal rational approximation to the ReLU function.


The following inequalities hold:


where is the best rational approximation to ReLU on in and are constants given by Theorem .


Let be an integer and let be any rational function of degree . Now, define . Since , we have

where the inequality is from Theorem . Now, let be the best rational approximation to on . Now, define . We find that

which proves that the best approximation to ReLU satisfies the upper bound.

We now show that a rational neural network must be at least in size (total number of nodes) to approximate the ReLU function to within .


Let . A rational neural network that approximates the ReLU function on to within has size of at least .


Let be a rational neural network with nodes at each of its layers, and assume that its activation functions are rational functions of type at most . Let be the maximum of the degrees of the activation functions of . Such a network has size . Note that itself is a rational function of degree , where from additions and compositions of rational functions we have . If is an -approximation to the ReLU function on , we know by Corollary that


The statement follows by minimizing the size of , i.e., subject to

That is,


We introduce a Lagrange multiplier and define the Lagrangian of this optimization problem as

One finds using the Karush–Kuhn–Tucker conditions [31] that . Then, using Equation 12, we find that satisfies


Therefore, the rational network with layers that approximates the ReLU function to within on has a size of at least , where is given by Equation 13 and depends on . We now minimize with respect to the number of layers . We remark that minimizing is equivalent of minimizing , where

One finds that one should take and . The result follows.

We now show that ReLU neural networks can approximate rational functions.

Proof (Proof of Lemma )

Let and be a rational function. Take , which is still a rational function. Without loss of generality, we can assume that is an irreducible rational function (otherwise cancel factors till it is irreducible). Since is a rational, it can be written as with . Moreover, we know that for so we can assume that for (it is either positive or negative by continuity). Since is continuous on , there is an integer such that for . Furthermore, we find that for because and for . By [55, Theorem 1.1], there exists a ReLU network of size such that

We now define a scaled ReLU network such that for . Therefore, for all ,

Therefore, is a ReLU neural network of size that is an -approximation to on .

We can now prove Theorem that shows how rational neural networks can approximate ReLU networks and vice versa. The structure of the proof closely follows [55, Lemma 1.3].

Proof (Proof of Theorem )

The statement of Theorem comes in two parts and we prove them separately. 1.  Consider the subnetwork of the rational network , consisting of the layers of up to the th layer for some . Let denote the ReLU network obtained by replacing each rational function in by a ReLU network approximation at a given tolerance for and , such that for (see Lemma 3.1). Let be the output of the rational network at layer and node for . Now, approximate node in the st layer by a ReLU network with tolerance (see Lemma 3.1). The approximation error between the rational and the approximating ReLU network at layer and node satisfies

The first term is bounded by

since by assumption. The second term is bounded as the Lipschitz constant of is at most . That is,

where we used the fact that and for . We find that we have the following set of inequalities:

with . If we select , then we find that . When , the ReLU network approximates the original rational network, , and the ReLU network has size