1 Introduction
For hundreds of years, humans have sought mechanistic descriptions of the rate of change of physical quantities, often in the form of Partial Differential Equations (PDEs). Historically, scientists discovered these PDEs using a firstprinciples approach; seeking an axiomatic logical chain of reasoning to explain observed phenomenology. This approach has yielded myriad scientific and engineering theories, from Newtonian Mechanics to Relativity. Notwithstanding the many successes of this approach to basic and applied sciences, progress has been slow on many important fronts: notably power grids, financial markets, protein modeling, drug interaction ect. PDE discovery
is a frontier field which uses machine learning techniques to discover scientific laws, expressed using governing equations, directly from data. PDE discovery promises to yield predictive models in these contexts by augmenting the first principles approaches with datadriven methods.
Neural networks are a natural tool for PDE discovery. In his landmark paper [pinkus1999approximation]
, Allan Pinkus showed that the set of neural networks with continuous, but nonpolynomial, activation functions are dense in the set of compactly supported continuous functions from
to . Today, this result is referred to as the universal approximation property. Since the continuous functions with compact support are dense in , the set of neural networks must be dense in as well. Since both classical and weak form PDE solutions live in , the universal approximation property suggests neural networks can approximate the solution of any wellposed PDE.
In 2017, [rudy2017data] made a seminal contribution to PDE discovery when they introduced PDEFIND: an algorithm that discovers a PDE from a grid of samples of a system response function, . PDEFIND assumes that the time derivative of occurs within the span of a set of library terms, . Each library term is a function of , along with its spatial partial derivatives. In most cases, each library term is the product of a combination of and its spatial partial derivatives. PDEFIND uses finite difference techniques to approximate partial derivatives of and evaluate the library terms on a suitably specified grid. This engenders a system of linear equations,
(1) 
where is a matrix whose element holds the value of the th library term at the th grid point,
is a vector whose
th element holds the value of the time derivative of at the th grid point, and is a vector of unknown coefficients. Finding a which satisfies equation 1 reveals the PDE that governs . In particular, must approximately equal the time derivative of . PDEFIND determines using STRidge, which solves the penalized leastsquares problem (where is a userdefined constant). It then eliminates every component of whose magnitude is smaller than a user defined threshold, . In their paper, [rudy2017data] showed that PDEFIND can identify several PDEs — including Burgers’ equation, the Korteweg–De Vries equation, and Schrodinger’s equation — directly from data.
PDEFIND suffers from two key limitations: First, since it uses finite differences to approximate the derivatives of , the algorithm only works if the data are collected on a regular grid. This is cumbersome in many practical settings. Second, for the algorithm to work, the user must finetune and . Adjusting these constants changes the PDE that PDEFIND uncovers. Selecting appropriate values is difficult in practice when the correct equation is not known, a priori.
[both2021deepmod] solved the former issue by approximating the sampled function with a neural network, . Their approach, called DeepMOD, learns while simultaneously training to match the samples of . DeepMOD then uses Automatic Differentiation [baydin2018automatic]
to calculate the spatial partial derivatives of the neural network. They evaluate the library terms using these values. Critically, this means that the derivatives they calculate are accurate to within machine precision, something that is impractical with finite differences. Further, since automatic differentiation does not require a stencil, the samples do not need to be collected on a regular grid; they can be arbitrarily dispersed throughout the problem domain. This approach also allows DeepMOD to evaluate the library terms anywhere in the domain (not just at the points with samples, like PDEFIND requires). The associated loss function consists of three parts. The first quantifies how well
matches the samples of the system response function, . The second quantifies how well the time derivative of approximates . The third is the norm of , which encourages to be sparse. The second and third terms effectively embed the LASSO loss function within DeepMOD’s. This three part loss function couples and such that encodes a PDE which must satisfy. learns to approximate the samples of , and learns to be sparse, but both must do so while simultaneously satisfying . This inductive bias yields a remarkably robust algorithm. In their paper, [both2021deepmod] demonstrate that DeepMOD can extract PDEs from remarkably noisy and sparse data sets.
Despite this achievement, however, DeepMOD suffers from a finetuning issue. In particular, the term in the loss function contains a userspecified constant, . Adjusting affects which PDE the algorithm extracts. Thus, DeepMOD only works if the user picks a sage value of , which may be difficult in practical settings.
Separately, Raissi [raissi2018deep] proposed an innovative approach closely related to PDE discovery: Given a set of sparse and noisy samples of a system response function, , simultaneously train two neural networks, and . learns an approximation to , while uses and its spatial partial derivatives (evaluated using automatic differentiation) to learn the PDE itself. To evaluate , and its spatial partial derivatives are evaluated at a randomly selected set of collocation points within the problem domain. These values are the inputs to , which learns the time derivative of as a function of and its spatial partial derivatives. The associated loss function consists of two parts: The first measures how well conforms to the samples of the system response function, . The second evaluates how well matches the time derivative of . This loss function couples and . must learn to match the samples while also satisfying the learned PDE, . cannot simply memorize the samples and overfit to their noise. The principal deficiency of this approach is that the learned PDE, , is not explicit, but lives within a deep neural network, which means that it is difficult to study. This limits the practicality of the approach for discovering new, humanreadable PDEs.
Since PDEFIND, the field of PDE discovery has proliferated, with many proposed algorithms within the liteature. In addition to those discussed above, some notable examples include [schaeffer2017learning], whose algorithm is similar to PDEFIND, [chen2020physics], which builds upon DeepMOD, and [atkinson2019data]
, which approximates the system response function using Gaussian Processes and uses a genetic algorithm to extract a PDE.
Many PDE discovery techniques use a sparse regression algorithm to extract a humanreadable PDE from a library of candidate terms. Sparse regression seeks a sparse vector which satisfies in the leastsquares sense. Here, is a matrix and is a vector. Ideally, satisfies
(2) 
For some wisely chosen . The term encourages sparsity by penalizing the number of nonzero coefficients of . There are, however, a few problems with this formulation. First, it relies on picking an appropriate value. If is too small, will have too many nonzero components, while if it is too large then may be too sparse to sufficiently satisfy . Picking an appropriate can be difficult in practice. Second, solving equation 2 is NPhard. In particular, the only way to find such an is to solve the leastsquares problem for every subset of the library index set, , and then find which solution satisfies equation 2. Techniques such as STRidge or LASSO effectively weaken the constraints of equation 2 (namely the term). These changes often yield suboptimal solutions, but make the problem computationally feasible.
Sparse Regression is an old problem, however. Long before the birth of PDE discovery, [guyon2002gene] introduced a sparse regression algorithm called Recursive Feature Elimination (RFE), to identify genes that predict cancer. RFE uses a userdefined importance metric to find a sequence of progressively sparser candidate solutions. The importance metric, roughly speaking, specifies how important a certain component of is, when minimizing . The first candidate is the leastsquares solution. RFE uses an inductive process to find the other candidates: Given a candidate, , RFE identifies the least important feature (LIF) of , and subsequently defines as the leastsquares solution with the LIF removed. Thus, RFE derives each successive candidate from the previous one.
Existing PDE discovery tools use conventional activation functions like or .^{1}^{1}1The PDE learning community often avoids ReLU. A ReLU network is a piecewise linear function, which means that its second derivative is zero almost everywhere. Thus, a ReLU network can not approximate a function’s second derivative. Recently, [boulle2020rational] introduced Rational Neural Networks (RNNs), whose activation functions are trainable type (3,2) rational functions. They showed that RNNs have excellent theoretical approximating power. More specifically, let be a function defined on a compact set and let . [boulle2020rational] showed that there exists a Rational Neural Network, , with that has far fewer trainable parameters than any given ReLu network, , with . This suggests that Rational Neural Networks have superior approximating capabilities. This result is not purely theoretical; in a recent paper on learning Green’s functions, [boulle2021data] demonstrated that RNNs learn faster than equivalently sized networks with conventional activation functions. Further, unlike ReLU, rational activation functions are almosteverywhere smooth, meaning that they are suitable for use in PDE discovery.
Further, using rational activation functions in place of conventional ones does not add considerable overhead. Each Rational Activation function contains just trainable parameters. To understand just how small that is, suppose we have a network from to with hidden layers and hidden units per layer. Using this architecture, a based network would have trainable parameters, while an RNN would have , a fraction of a percent more. Importantly, however, the RNN’s activation functions adapt over time, which gives them considerable flexibility over networks with conventional activation functions.
One advantage of using RNNs is that their poles can reveal information about discontinuities in the network. [boulle2021data] discusses this in detail. Consider an RNN with two input variables, and . Fix for some . The function for must be rational since the composition of a sequence of rational functions is rational. As the network trains, the poles of this function move. Their locations inform us about the discontinuities in the learned solution at .
In this paper, we propose a novel PDE discovery technique that builds upon [raissi2018deep]’s two network approach. We also introduce a principled sparse regression technique, based on [guyon2002gene]’s RFE, to extract a humanreadable PDE from . We subsequently demonstrate how effective our proposed approach is when dealing with sparse, noisy data sets so as to be more practical for discovering new PDEs from complex, realworld scientific data. In section 3.1, we introduce a two network approach which utilizes RNNs and periodic reselection of collocation points. In section 3.2, we introduce a parameterfree sparse regression algorithm that is based on RFE, and is tailored to PDE discovery. In section 4, we demonstrate that our approach can discover linear and nonlinear PDEs from sparse and noisy data sets. Finally, in section 5, we discuss the rationale behind our algorithm, explore theoretical aspects of our sparse regression technique, and highlight some additional aspects of our approach.
2 Problem Statement and Assumptions
Let be an open, connected set and let be a function on describing the spatiotemporal evolution of a physical system. For example, could represent the heat distribution in a structure, or the depth of water within a shallow body. Suppose we have a set of discrete points that sample at a collection of points . We call the points data points. We assume the samples are noisy. Thus, in general, . We define the noise in a sample as the difference between and
. We assume the noise is additive at the data points and is i.i.d. Gaussian with mean zero. We define the level of noise in a collection of noisy samples as the ratio of the standard deviation of the noise to the standard deviation of the samples without noise. We discuss this further in section
4. We also assume there is some , such that in , satisfies an th order PDE:(3) 
Where denotes the th partial derivative of the function, , with respect to some arbitrary variable, . For brevity, we abbreviate as .
is, in general, a nonlinear function of , and its first spatial derivatives. However, we assume that can be expressed as a multivariable polynomial of , , …. In other words, we assume there exists some (the degree of the multivariable polynomial) and constants such that
(4)  
Which can be expressed more succinctly as
(5) 
We assume that this representation is sparse in the sense that most of the ’s are zero. Thus, only a few terms will be present in equation 4.
For reference, the table below lists the notation in this section:
Symbol  Meaning 

Problem domain. A subset of  
The system response function we are trying to approximate.  
Noisy sample of at .  
Number of data points.  
the th derivative of the function with respect to some variable,  
PDE that satisfies.  
The PDE order of .  
Polynomial degree of the representation of 
3 Methodology
Our goal is to learn the coefficients, , in equation 5. Our algorithm accomplishes this using a two step procedure, depicted in figures 1 and 2. In the first step, we train Neural Networks and to approximate and , respectively. Critically, we use the approach pioneered by [raissi2018deep], in that we require to satisfy the PDE learned by . Section 3.1 discusses this step in detail.
In the second step, we use an adaptation of the Recursive Feature Elimination algorithm [guyon2002gene], to identify a humanreadable PDE from . In line with our assumptions in section 2, the identified PDE is a linear combination of polynomial terms. Section 3.2 describes this step in detail.
3.1 Neural Network approximations to and
We define two neural networks, and . learns to approximate the system response function, , using the sparse and noisy samples , while learns an approximation to the hidden PDE, . We select a set of collocation points, , where we enforce the learned PDE. Here,
is a userspecified constant. To create the collocation points, we sample a uniform distribution, supported on the problem domain.
We reselect the Collocation points every training epochs,^{2}^{2}2A training Epoch is the set of computations required for the networks to train on the entire set of training data once where is another userspecified constant. To reselect the collocation points, we use the same uniform distribution over the problem domain to draw a new set of coordinates. In all of our experiments, is either or (see section 4 for more details.) Periodically reselecting the collocation points helps regularize , and dramatically improves our algorithm’s robustness. We discuss this point in section 5.1.
We evaluate at each data point, , and compare these values to the samples, . We then evaluate , its first time derivative, and its first spatial derivatives at each collocation point. We use these values to evaluate at the collocation points. This allows us to evaluate the following:
(6) 
which we call the PDE Residual. We then use the PDE residual, the samples of , and evaluated at the data points to construct the following loss function:
(7) 
For future reference, we define the first term on the righthand side of equation 7 as the Data Loss, denoted , and the second as the Collocation loss, denoted . Thus,
Our approach thus far mirrors the one established by [raissi2018deep], except that we resample the collocation points. We train the networks using a combination of the Adam [kingma2014adam] and LBFGS optimizers [liu1989limited]. We generally train the networks for epochs using the Adam optimizer, with a learning rate of , followed by up to epochs of the LBFGS optimizer, using a learning rate of . The variation in epochs is problem specific, and is driven by how quickly the networks converge. We discuss this in greater detail in sections 4 and 5.2.
For each hidden layer in our networks, we use a common rational activation function on each hidden unit in that layer. Networks with these activation functions are called Rational Neural Networks (RNNs) [boulle2020rational]
. At the time of writing, no Neural Network framework has native support for rational activation functions; thus, we implement them explicitly in PyTorch. Our Rational functions are of type
(the numerator is a polynomial of order 3, and the denominator is a polynomial of order 2). They are also adaptive, meaning that the coefficients that define the numerator and denominator train along with the network weights and biases. In our experiments, we did not couple the activation functions for different layers. Thus, each layer’s activation function evolves independently. We initialize our Rational activation functions using the coefficients proposed in [boulle2020rational], which give the best type rational approximation to ReLU.
Figure 1 depicts an overview of this step. The table below summarizes the notation in this section.
Symbol  Meaning 

Neural Network to approximate  
Neural Network to approximate  
Collocation points  
Number of collocation points  
Number of epochs between reselecting collocation points  
PDE residual at the collocation point  
The loss function. 
3.2 Sparse regression via Recursive Feature Elimination
After training and , we identify a humanreadable PDE from . Our approach is inspired by that of [rudy2017data] and [schaeffer2017learning]. Recall that we assume satisfies equation 5, which we restate here for convenience:
(8) 
Once trains, we expect it to approximately satisfy equation 8 too.
To determine the coefficients , we use a uniform distribution, supported on the problem domain, to sample a collection of coordinates which we call extraction points. Let denote the number of extraction points. At each extraction point, , we evaluate , along with its first spatial derivatives. Using these values, we calculate each term occurring on the righthand side of equation 8. We subsequently construct the Library of candidate terms, , at these locations. This Library is an by matrix, whose entry holds the value of the th candidate term at the th extraction point. Let be the vector whose th component holds the value of at the th extraction point. Finally, let . Then,
(9) 
We seek a sparse vector, , which satisfies this equation in the leastsquare sense. Finding the coefficient vector, , therefore boils down to a sparse regression problem. To find , we propose a new method that builds on Guyon et al.’s Recursive Feature Elimination method [guyon2002gene]. First, we normalize such that each of its columns has unit norm. To do this, we define as follows:
Where, in general, denotes the th column of some arbitrary matrix . With this, equation 9 becomes the following:
(10) 
Where . We apply the Recursive Feature Elimination algorithm to this linear system and base our feature importance metric on the leastsquares residual of a candidate solution. If is a candidate, then the leastsquares residual of , denoted , is given by
We define a feature’s importance as the amount the residual increases if we set that feature to zero, while holding the other features fixed.^{3}^{3}3 Note that the least important feature is not necessarily the feature, , that would give the smallest residual if we resolve the leastsquares problem subject to the constraint . We do, however, expect these two features to be the same in most cases. We discuss this point further in section 5.1.2. Mathematically, the th feature’s importance is . In the discussion section, we show that because the columns of have unit norm, the least important feature in is the one whose corresponding component of has the smallest magnitude. This critical result is why we normalize .
Applying Recursive Feature Elimination, with this feature importance metric, yields a sequence of increasingly sparse candidate solutions. Let denote the candidate solution from the th step of this procedure (equivalently, the th least sparse solution). Since solves subject to the constraint that the least important feature of is removed, we must have . Thus,
With this in mind, for each , we evaluate the following:
(11) 
Note that when , we set .^{4}^{4}4This is consistent with the case when , since the th candidate will have just one feature, meaning that if we remove its least important feature, then we get the vector. We rank the candidates according to the values given by equation 11. In principle, the highestranking candidates contain only features necessary to describe the hidden dynamics. Critically, this approach uses no tunable parameters. Our algorithm then reports the top five candidate solutions as humanreadable PDEs. It also reports the leastsquares residual and the value of equation 11 (reported as a percent) for each of these top five candidates.
Figure 2 depicts an overview of this step. The table below summarizes this section’s notation.
Symbol  Meaning 

The number of Extraction points  
An element vector whose th entry contains the value of evaluate at the th extraction point.  
A matrix whose entry holds the value of the th library term at the th extraction point.  
A matrix whose th column satisfies  
The leastsquares residual 
4 Results
We implement the approach detailed in section 3 as an opensource Python library. We call our implementation PDE Robust Extraction Algorithm for Data, or PDEREAD for short. Our entire library is available at https://github.com/punkduckable/PDEExtraction, along with with auxiliary Matlab scripts that we use to generate our data sets.
In this section, we test PDEREAD on three equations: the Heat equation, Burgers’ equation, and the Korteweg–De Vries (KdV) equation. These PDEs are popular in the PDE discovery community, and include both linear and nonlinear equations. For each example PDE, we run a sequence of experiments that explore a particular aspect of PDEREAD. In section 4.1, we test PDEREAD on the heat equation. These experiments also reveal a nonuniqueness issue that is fundamental to PDEDiscovery. In section 4.2, we test PDEREAD on Burgers’ equation. These experiments demonstrate PDEREAD’s robustness to both sparsity and noise. Finally, in section 4.3, we test PDEREAD on the Korteweg–De Vries equation. These experiments demonstrate that our sparse regression algorithm works well, even when the library of candidate terms is quite large.
In all of our experiments, we use the following procedure to introduce noise of a given level, :

calculate the standard deviation of the set of all data points.

Select a data point from this set. Sample a Gaussian distribution with mean zero whose standard derivation is
times the standard deviation obtained from step 1. Add this sampled value to that of at the current data point. 
Repeat step 2 for each data point.
These steps yield a noisy data set. To introduce sparsity into a data set, we use a uniform distribution to randomly select a subset of points; to consider as we train the networks.
Further, we use the following architecture in all our experiments: and are RNNs. has hidden layers with hidden units per layer. has hidden layers with hidden units per layer. Furthermore, in all our experiments, once we train , we identify a PDE using extraction points ().
For most of our experiments, we include an accompanying plot to visualize the experiments’ results. In these plots, we depict the learned solution, the noisy data set (the entire thing, not just the ones we train on), as well as two measures of error in the trained networks. The first, which we call the absolute error, is the absolute value of the difference between the learned solution and the noisy data set. The second is the PDE Residual (defined by equation 6 in section 3.1).
In these experiments, the true PDE is the one we use to generate the data set, while the identified PDE is the highest ranked candidate that PDEREAD uncovers. Our principal goal is for the identified PDE to contain the same terms as the true PDE. Matching the coefficients in the identified PDE to those in the true PDE is our secondary goal. Critically, we can only satisfy the second goal if we satisfy the first. Assuming the identified and true PDEs have the same form, we measure our second goal by comparing the coefficients of the identified PDE to the corresponding coefficients in the true PDE. In particular, for a given coefficient, if the coefficient’s value in the true PDE is , and the coefficient’s value in the identified PDE is , then we define the relative error between the two as . In each experiment, we explore the conditions under which PDEREAD identifies the correct PDE (primary goal) and the relative error between the coefficients in the identified and true PDEs.
4.1 Heat equation
First, we consider the heat equation, which describes how heat diffuses through a body over time. In one dimension, with constant thermal diffusivity, , the heat equation is given as follows:
(12) 
Here, temperature , is a function of two variables: and . We test two data sets for this equation, both with . We generate our heat equation data sets using the function in the Chebfun [driscoll2014chebfun] Matlab package. Our repository includes scripts to generate these data sets.
For both data sets, and take values in : as elements of . Both data sets use periodic boundary conditions. Thus, for each ,
We discretize the spatial domain into equally sized subintervals, each having a width of . Similarly, we discretize the temporal domain into equally sized subintervals, each with a width of . Thus, Chebfun solves this problem on a grid with lines partitioning the and subdomains. Both data sets contain data points.
First data set: The initial condition for the first data set is
We contaminate the data set with % noise, using the procedure at the start of section 4. By subsampling the data set with a uniform distribution, we create a set of data points, along with the temperatures at those points. Figure 3 depicts the first heat equation data set.
For this experiment, , , and (reselect the Collocation points each epoch). We train both networks for Epochs using the Adam optimizer with a learning rate of . Figure 4 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual. Using our sparse regression algorithm 3.2, we identify the following PDE from the trained network:
(13) 
Interestingly, this is not the heat equation. PDEREAD did not fail, however. In particular, the learned PDE almost perfectly describes the data set without noise. In section 5.2.1, we demonstrate that the data set in this problem satisfies a specialized form of the heat equation whose PDE is almost identical to equation 13. This result highlights a fundamental problem with PDE discovery in general: it is an illposed problem. In particular, multiple PDEs can describe the same data set. We discuss this point, as well as what exactly happens with this data set, in section 5.2.1.
Second data set: The initial condition for the second data set is the following:
We add % noise to the data set using the procedure at the start of section 4. By subsampling the noisy data set with a uniform distribution, we create a set of data points, along with the temperatures at those points. Figure 3 depicts the second heat equation data set.
For this example, , , and (reselect the Collocation points each epoch). We train the networks for Epochs using the Adam optimizer with a learning rate of . Figure 6 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual.
PDEREAD identifies the following PDE from the trained network:
This is the heat equation. Moreover, the coefficient in the learned PDE has a relative error of approximately % with respect to that in the true PDE (in this case, ). We expect some deviation given the extreme noise level in the data set.
4.2 Burgers’ equation
Next, we consider Burgers’ equation, which arises in many technological contexts, including Fluid Mechanics, Nonlinear Acoustics, Gas Dynamics, and Traffic Flow [basdevant1986spectral]. In one dimension, Burgers’ equation takes the following form:
(14) 
Here, velocity, , is a function of two variables: and . is the diffusion coefficient. Burgers’ equation is a nonlinear secondorder PDE. Significantly, solutions to Burgers’ equation can develop shocks (discontinuities). We test PDEREAD on two data sets for this equation. Both are from [raissi2018deep] and use . The first exhibits shock formation while the latter does not.
First data set: The domain of the first data set is the set of all such that and . The data set uses periodic boundary conditions and the initial condition
The data sets contain the pointwise velocity evaluated on on a spatiotemporal grid. There are spatial grid lines and temporal grid lines, each of which is equally spaced. Thus, the data set contains velocities at a total of data points. Figure 7 depicts the first data set without noise.
We use this data set to test how robust PDEREAD is to sparsity and noise. To do this, we use a uniform distribution over the data set to create a subset of data points, along with the velocities at those data points, of varying sizes, to subsequently train our RNNs. We then test the maximum noise level (to the nearest %, up to a maximum of %) at which PDEREAD reliably discovers Burgers’ equation. Further, the PDE terms in the library are restricted to multivariable polynomials of degree or less using the variables , and .
To illustrate how robust PDEREAD is to noise, we corrupt the first data set with % noise. Then, using a uniform distribution, we create a subset of data points () along with the velocities at those data points to subsequently train our RNNs. We use collocation points (), which the program reselects every epochs (). We train the networks for epochs using the Adam optimizer with a learning rate of . After training , PDEREAD discovers the following PDE:
This is Burgers’ equation, but both coefficients exhibit a relative error of about % with respect to those in the true PDE, equation 14. Figure 8 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual.
To illustrate how robust PDEREAD is to sparsity and noise together, we fix the noise level at % and decrease the number of training data points to just (). We use collocation points (), which the program reselects every epoch (). We train the networks for epochs using the Adam optimizer with a learning rate of , followed by epochs using the LBFGS optimizer with a learning rate of . This yields the following PDE:
This is Burgers’ equation, though both coefficients display a relative error of approximately % with respect to those in the true PDE, equation 14. Therefore, PDEREAD can successfully identifies Burgers’ equation, a nonlinear PDE, from a data set exhibiting a discontinuity and that we corrupt with % noise. Amazingly, we needs velocities at only data points to do so. Figure 9 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual; a dramatic improvement over what has been reported on the literature.
Next, we reduce the number of data points to . In this case, PDEREAD fails to identify Burgers’ equation with % noise, or more. However, PDEREAD successfully identifies Burgers’ equation with % noise and velocities at just data points. To achieve this result, we use collocation points (), which the program reselects every epoch (). We train the networks for epochs using the Adam optimizer with a learning rate of . This yields the following PDE:
This is Burgers’ equation, though the coefficients differ markedly from those in the true PDE, equation 14.
Next, we repeat the experiment above with just data points. At this level of sparsity, PDEREAD fails to identify Burgers’ equation if the data set contains % noise or more. However, it correctly identifies Burgers’ equation with % noise and velocities at just data points. In this case, we use collocation points (), which the program reselects every epoch (). We train the networks for epochs using the Adam optimizer with a learning rate of . This yields the following PDE:
This is Burgers’ equation, though the coefficients are very different different from those in the true PDE, equation 14.
Second data set: The domain of the second data set is also the set of all such that and . The data set uses periodic boundary conditions and the initial condition
The data set contain the velocity evaluated on a spatiotemporal grid. There are spatial grid lines and temporal grid lines, each of which is equally spaced; yielding velocities at a total of data points. Figure 7 depicts the second data set without noise.
PDEREAD successfully discovers Burgers’ equation from this data set with % using just data points to do so. To achieve these results, we use collocation points, which the program reselects every epoch (). We train the networks for epochs using the Adam optimizer with a learning rate of , followed by epochs using the LBFGS optimizer with a learning rate of . After training , PDEREAD yields the following PDE:
This is Burgers’ equation, and the coefficients closely match those in the true PDE, equation 14. Figure 11 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual. This figure shows that the learned solution is strikingly similar to the noiseless data set in figure 10. These results are remarkable given that we train it on exceptionally noisy data.
4.3 Korteweg–De Vries equation
Next, we consider the KortewegDe Vries (KdV) equation. In one dimension, this third order equation takes the following form:
(15) 
Here, wave amplitude, , is a function of two variables: and . This nonlinear equation describes the evolution of onedimensional waves in various settings, including shallowwater conditions.
We test on this equation using a data set from [raissi2018deep]. The domain of this data set is all such that and . This data set starts from the initial condition
The data set contains the wave amplitude at every point on a grid discretization of the problem domain. uniformly spaced grid lines partition the spatial subdomain into equally sized subintervals. Likewise, uniformly spaced grid lines partition the temporal subdomain into equally sized subintervals, each with a width of seconds. Thus, this data set contains samples at a total of data points. Figure 12 depicts the KdV data set.
To test PDEREAD on the KdV equation, we corrupt the KdV data set with % noise. By subsampling the data set with a uniform distribution, we create a subset of data points () to train on along with the associated wave amplitude at those data points. We train both networks on these noisy samples using epochs of the Adam optimizer with a learning rate of , followed by epochs of the LBFGS optimizer with a learning rate of . We then identify a humanreadable PDE from . To demonstrate that our sparse regression algorithm is quite robust, we expand our library to include all multivariable polynomials of degree , or less, that involve the quantities , , , and . In this case, there are terms in the library. Notwithstanding the gigantic library, PDEREAD still identifies the following PDE:
This is the KdV equation, though the coefficients have a relative error of about with respect to the corresponding values in the true PDE, equation 15. Figure 13 depicts the data set with added noise, the learned solution , absolute error of the learned solution, and PDE residual. From this figure, we can see that the learned solution is strikingly similar to the noisefree data set in figure 12.
To demonstrate that the discrepancy in the coefficients in the identified PDE are a product of the large noise level, we repeat the experiment above, but this time with just % noise. We once again use and train the networks using epochs of the Adam optimizer with a learning rate of followed by epochs of LBFGS with a learning rate of . Once trained, we identify a humanreadable PDE from . In this case, our library again restricted to multivariable polynomials of degree , or less, that involve the quantities , , , and . We identify the following PDE:
This is the KdV equation. This time, the coefficients have a relative error of about % with respect to the corresponding values in the true PDE, equation 15.
5 Discussion
5.1 Methodology
5.1.1 Neural Network approximations to and
Learning More General PDEs: Throughout this paper, we have assumed that , the function which approximates, satisfies a PDE of the form of equation 3. This assumption is certainly valid for the PDEs we have studied in this paper, but by no means encapsulates all possible PDEs. For example, consider the one dimensional wave equation:
There is no way to express this equation in the form of equation 3 (due to its second derivaitve in time), which means that PDEREAD can not learn it. With that said, we can modify PDEREAD to learn a larger class of PDEs. Given some , we can learn a PDE of the form
by modifying to the following:
Using this loss function with should allow PDEREAD to learn the wave equation. Determining the efficacy of PDEREAD in the more general case represents a potential area of future research.
Rational Activation Functions: The PDEREAD source code lets the user choose if U and N use rational, , or sigmoid activation functions. However, RNNs seem to universally outperform their and sigmoid counterparts. For example, with data points, PDEREAD using activation functions can only reliably identify Burgers’ equation if the data set contains % or less noise, just half of what PDEREAD using RNNs can handle. This is why PDEREAD uses RNNs by default.
We took great care to implement our rational activation functions as efficiently as possible. In particular, we use Homer’s rule to minimize the number of computations needed to evaluate them. We also experimented with dozens of small tweaks to improve the speed with which the python interpreter evaluates them. Despite this, our RNNs run exceptionally slow. In particular, a forward pass through one of our RNNs takes about an order of magnitude longer than one through a network with the same architecture. However, we fully believe our RNNs run slowly because they are written in Python, and not because RNNs are fundamentally slow.
A type rational function is fundamentally simpler than , in that it requires far fewer arithmetic operations to evaluate. The key difference is that PyTorch’s function is “builtin” and fully compiled, while our rational activation function has to be interpreted. When Python encounters , the function evaluates in machine code. This allows Python to evaluate very efficiently, notwithstanding the function’s fairly complicated nature. By contrast, when Python encounters our rational activation function, it has to interpret our python code linebyline before evaluating the function. Most of the runtime goes into interpreting the lines of code, rather than executing computations. As a result, our rational activation functions evaluate slowly, despite their relatively simple nature.
The problem gets even worse when we differentiate RNNs. PyTorch has to use its differentiation rules to evaluate the derivative of our rational activation functions. This process takes time. By contrast, PyTorch can use built in routines to rapidly evaluate the derivative of . Thus, the runtime penalty associated with using RNNs increases with the order of the hidden PDE.
All of this highlights one of the primary limitations of using an interpreted language, like Python, for computationally demanding tasks. Notwithstanding, these observations reveal a natural solution: Implement our rational activation functions in a fully compiled language, like C++ Indeed, PyTorch has a C++ backend for its python frontend. Nonetheless, most of our experiments still run in a reasonable amount of time ^{5}^{5}5a few hours or less, sometimes much less. In particular, most of the Burgers’ experiments run in less than 5 minutes on an E2 highmemory GCP node. With that said, the computational cost of evaluating our rational activation functions in Python may be unacceptably large when trying to uncover highorder PDEs from large data sets. As such, we believe our rational activation functions should be implemented in C++ before attempting to use PDEREAD on computationally demanding data sets.
Collocation Points and Reselection: We periodically reselect the collocation points because it enhances their regularizing power and increases the computational efficiency of our approach. We discuss the latter point in section 5.2. Here, we focus on the former point: That collocation points regularize , and periodically reselecting them enhances this effect. The collocation loss prevents from simply memorizing the noisy samples of the system response function. must also learn to satisfy the learned PDE, . In other words, the twonetwork approach introduces an inductive bias that embeds the fact that is a PDE solution into the loss function. We fully believe this inductive bias is the biggest strength of the twonetwork approach, and explains its robustness to sparsity and noise. Without it, would be free to overfit the noisy samples, leading to poor generalization. The collocation points and help “denoise” the samples and learn the underlying function.
This naturally raises the question: How many collocation points should we use? As many as possible, we believe. Increasing makes PDEREAD more robust to noise and sparsity. If there are only a few collocation points, then and can “conspire” such that satisfies just at those points. In this case, is free to overfit the noisy samples, and fails to make learn a solution to the hidden PDE. By contrast, if is sufficiently large, then the easiest way to make the collocation loss small is for to learn the hidden PDE and for to learn a solution to it. What “sufficiently large” means is problem dependent, but we believe that having more collocation points than network parameters is “sufficiently large”. In our case, that meant using about collocation points (which is what we use in most experiments). However, we believe you should use as many collocation points as is practical. We discuss further in section 5.2.
Periodically reselecting the collocation points effectively adds new collocation points without increasing computational load. Collocation points are remarkable in that we do not need training data for them. Creating new collocation points is as simple as generating new coordinates. This means that periodically reselecting the collocation points is very cheap. However, periodic reselection forces to satisfy the learned PDE at a different set of points every few epochs. In essence, reselection simulates having more collocation point. This is the principal reason why we periodically reselect the collocation points.
Reselection creates a new problem: choosing how often to reselect the collocation points. In our experiments, we reselect them at least once every 20 epochs. The cost of reselecting collocation points is tiny compared with that of backpropagation. Thus, the frequency of reselection has little impact on PDEREAD’s runtime, but can significantly improve its efficacy. In our experiments, PDEREAD works best when is as small as possible (preferably 1). This is why in most of our experiments. We strongly recommend setting .
Dropout:
Since many of our experiments use sparse data sets, overfitting is a major concern. Periodically reselecting the collocation points dramatically reduces overfitting, and is one of the defining features of PDEREAD. However, we also tried dropout and batch normalization
[srivastava2014dropout]. Dropout is an interesting technique that randomly removes hidden units from the network during each forward pass. A userdefined constant controls the proportion of hidden units that dropout removes. Dropout effectively reduces the network’s size. Since small networks can not overfit to the same degree as big ones, dropout limits overfitting. Further, proponents of dropout argue that it makes the network more robust: The network can not rely on any one hidden unit (since dropout may remove that hidden unit on some forward passes). The network learns to give predictive results even when some hidden units are missing.
Batch normalization adjusts the inputs to each hidden layer. Before a minibatch passes through a hidden layer, the network evaluates its mean and variance. The network then scales and translates the data such that it has mean zero and unit variance (or, when that it not possible, arbitrarily close to unit variance). This tweaked data is the input to the hidden layer. Proponents of Batch normalization argue that it reduces activation function saturation, which is supposed to improve the network’s ability to learn.
In practice, we found these techniques dramatically reduce the performance of PDEREAD. With them, our networks fail to learn the hidden PDE. We believe we know why: The networks can not learn an approximation of the solution and its derivatives (with respect to the network inputs) that maintain their predictive power when dropout randomly removes hidden units. Even using batch normalization without dropout appears to reduce PDEREAD’s performance. Because of these results, neither dropout nor batch normalization are part of our approach.
5.1.2 Sparse Regression via Recursive Feature Elimination
Extraction points: Automatic differentiation allows us to differentiate  and, by extension, evaluate the library terms  anywhere. This means we can place the extraction points anywhere in the problem domain. This also means we can have as many extraction points as we desire. We generate extraction points by sampling a uniform distribution over the spatial domain, . Our approach keeps the collocation, data, and extraction points independent.
Increasing the number of extraction points gives us a better picture of how behaves throughout . Adding extraction points should, therefore, improve how well the solution to equation 9 approximates the solution to equation 8. Thus, should be as large as is practical. Adding extraction points adds rows to the library matrix. Since our sparse regression algorithm solves a sequence of progressively sparser leastsquares problems, increasing the number of extraction points increases runtime. However, in our experience, our sparse regression algorithm runs quickly (a few seconds) even if is quite large ( or more). As such, we use extraction points in all our experiments.
Alternative Sparse Regression Algorithms: Our sparse regression algorithm is one of the greatest strengths of PDEREAD. In particular, PDEREAD does not work nearly as well if we replace our sparse regression algorithm with STRidge, LASSO, or thresholded LASSO ^{6}^{6}6First, use LASSO to find a sparse solution. Determine the features that are zero in the LASSO solution. Discard these features and then find the leastsquares solution using the remaining features. We believe this is because the other algorithms suffer from two main problems: parameter finetuning, and undesirable bias. With this in mind, let us consider STRidge. To make STRidge work, we must select a value for . However, if is too big, then the solution contains unnecessary terms. By contrast, if is too small, then the solution excludes too many terms and has little predictive power. PDEREAD with STRidge only identifies the correct PDE if is in a very narrow range. Worse yet, the set of appropriate values appears to be problem dependent. If we finetune to identify the correct PDE in one experiment, PDEREAD with STRidge and the same value will often not identify the correct PDE in another experiment. As a result, picking an appropriate threshold is difficult without knowing the correct PDE a priori. We call this the finetuning problem. Second, overfitted leastsquares solutions often exhibit large coefficients, which makes them less likely to be thresholded by STRidge. In this sense, STRidge rewards overfitting. While appropriate preconditioning can, to some degree, assuage the second issue, the finetuning problem is inescapable. Both LASSO and thresholded LASSO suffer from similar deficiencies.
Avoiding the finetuning problem is the principal reason we developed a parameterfree sparse regression algorithm. We want a practical algorithm that can work in realworld settings. We believe this goal is unattainable without a parameterfree algorithm.
Finetuning aside, our sparse regression algorithm appears to be far more robust than either STRidge or LASSO. Consider the first Burgers’ equation data set from section 4.2. PDEREAD with STRidge can not reliably identify Burgers’ equation from this data set (and data points) if it contains more than % noise. PDEREAD with LASSO is even worse; it can not identify Burgers’ equation if the data set contains more than % noise. This is in stark contrast to PDEREAD with our sparse regression algorithm, which identifies Burgers’ equation with up to % noise using just data points.
Theoretical Examination of Our Sparse Regression Algorithm: Let us now explore why PDEREAD is so robust to noise. Let be an by matrix and be a vector. We want to find the sparse vector which approximately solves . In this section, we consider how our sparse regression algorithm handles this problem. Given a candidate solution, , we want to determine the component, , of , which does the least to minimize the leastsquare residual. In other words, we want to find
(16) 
As in [guyon2002gene], we define the least important feature (LIF) of as the component of with the smallest magnitude. In general, this is not a principled approach, as the LIF may not correspond to the solution of equation 16. The situation changes, however, if we appropriately normalize . To begin, the leastsquares residual, , obtains its global minimum at the leastsquares solution. A little algebra shows that is a quadratic function of each component of . In particular,
Therefore, is a smooth function of . Thus, we can use a second order Taylor approximation of as a function of
to estimate the value of the leastsquares residual of
with the th feature removed (). In particular,Since the leastsquares solution corresponds to a minima of , we must have . Therefore,
A little algebra shows that
In other words, is the square of the norm of (the th column of ). If we divide the by , then each column of the resulting matrix has unit norm. If we replace by this new, normalized matrix, then
Thus, in this case, the magnitude of a feature indicates how much removing that feature increases the leastsquare residual. This is exactly what our sparse regression algorithm does. The above argument proves the following critical result:
Theorem: if is a leastsquares solution to , and if each column of has a unit norm, then the LIF of corresponds to the solution of equation 16.
We should note, however, that the LIF under our importance metric may not be optimal, in the sense that the residual of the leastsquares solution without the LIF is as small as possible. Let us define the optimal importance of a feature, , as follows:
(17) 
The LIF with optimal importance is the feature, , that gives the smallest residual if we resolve the leastsquares problem subject to the constraint . In other words,
(18) 
Recall that in RFE, after we identify the LIF, we resolve the leastsquares problem without the LIF. Thus, the LIF under optimal importance is the feature of the candidate solution, , such that the residual of the next candidate solution is as small as possible.
The LIF under our importance metric is not necessarily the LIF under optimal importance. In equation 16, we remove the th component but keep the others fixed. By contrast, in equation 18, we remove the th feature and allow the others to change. In particular, the solution to equation 17 may not be . With that said, evaluating a feature’s optimal importance requires a leastsquares solve. Therefore, if our library contains terms, then RFE using optimal importance requires leastsquares solves. For anything other than a tiny library, such an approach is computationally impractical. By contrast, RFE using our importance metric requires just leastsquares solves. Further, since our importance metric is based on the second order Taylor expansion of the residual, it should approximate optimal importance. Thus, in most cases, we expect the LIF under our importance metric (equation 16) to be the LIF under optimal importance (equation 18).
5.2 Results
Picking the Number of Data Points: In most experiments, we somewhat arbitrarily use data points. However, we do not claim PDEREAD needs this many data points to work. In particular, PDEREAD works quite well even if is far smaller than . For example, in section 4.2, we demonstrate that PDEREAD discovers Burgers’ equation with % noise using just data points. Further, PDEREAD discovers Burgers’ equation with up to % noise using just data points.
Computational Costs of Collocation Points: In section 5.1, we state that should be as large as is practical. In our experiments, dominates the runtime of PDEREAD. Further, the computational cost of each collocation points rises with the order of the learned PDE. These results are a direct consequence of the fact that PDEREAD must evaluate and its spatial partial derivatives at each collocation point. By contrast, at each data point, PDEREAD must only evaluate . Differentiating is expensive. Thus, evaluating at a collocation point is far more expensive than evaluating at a data point, which explains why dominates runtime. The order of the learned PDE exacerbates this problem, since higherorder equations require higherorder derivatives. As such, the computational cost associated with each collocation point increases with the order of the learned PDE. Because of these fundamental issues, there is a practical limit to how large can become without making runtime unacceptably large.
With that said, we can hardly understate the importance of using as many collocation points as possible. Increasing forces to learn the hidden PDE and incentivizes to learn a function that satisfies both the learned PDE and the samples.
As such, selecting a good value of requires balancing computational demands with approximation quality. We ran all of our experiments on one of three computers:

An aging 2015 macbook pro with a dual core Intel(R) Core(TM) i75557U CPU @ 3.10 GHz CPU. This computer has an integrated GPU and does not support CUDA. As such, all experiments on this computer run on the CPU. Most of our Burgers’ and heat equation experiments ran on this machine.

A highmemory E2 GCP instance. This machine features a dual core CPU (with similar specs to the one in the macbook pro) but no GPU. Some of our Burgers’ equation experiments ran on this machine, along with most of our KdV experiments.

A brand new highperformance desktop from Lambda Labs with a 32core AMD Ryzen Threadripper PRO 3975WX CPU and four Nvidia RTX A4000 graphics cards. All experiments on this machine use GPUs, and run more than ten times faster than experiments on the other machines. Unfortunately, our group received this computer after we ran most of our experiments. As such, we only used it to run a few Burgers’ and KdV experiments.
The aging hardware on the first two computers, which ran the overwhelming majority of our experiments, limits us to just a few thousand collocation points in most experiments. This limitation highlights the benefits of collocation point reselection, which effectively adds additional collocation points without increasing runtime. However, increasing may enable PDEREAD to learn the PDEs in section 4 at higher noise and sparsity levels than we report.
PDEREAD Works if Accepts Too Many Derivatives: In our experiments, the number of spatial partial derivatives of that accepts matches the order of the hidden PDE. If the hidden PDE is of order , then is a function of and its first spatial partial derivatives. In practice, however, the order of the hidden PDE may be unknown. To be practical, PDEREAD must work in this case: which it does. To demonstrate this, we use Burgers’ equation (a second order equation). In this experiment, accepts and its first three spatial partial derivatives. We corrupt the first Burgers’ equation data set with % noise and then uniformly subsample points from the noisy data set (). We set and . We train for epochs using the Adam optimizer with a learning rate of , followed by epochs using the LBFGS optimizer with a learning rate of . After training, PDEREAD identifies the following PDE:
Which is Burgers’ equation. In this experiment, learns to ignore the third derivative of . Therefore, PDEREAD works even if the order of the hidden PDE is unknown. With that said, PDEREAD appears to be less robust in this case. If we repeat this experiment with % noise, then PDEREAD identifies the following PDE: