Learning Partial Differential Equations from Data Using Neural Networks

by   Ali Hasan, et al.

We develop a framework for estimating unknown partial differential equations from noisy data, using a deep learning approach. Given noisy samples of a solution to an unknown PDE, our method interpolates the samples using a neural network, and extracts the PDE by equating derivatives of the neural network approximation. Our method applies to PDEs which are linear combinations of user-defined dictionary functions, and generalizes previous methods that only consider parabolic PDEs. We introduce a regularization scheme that prevents the function approximation from overfitting the data and forces it to be a solution of the underlying PDE. We validate the model on simulated data generated by the known PDEs and added Gaussian noise, and we study our method under different levels of noise. We also compare the error of our method with a Cramer-Rao lower bound for an ordinary differential equation. Our results indicate that our method outperforms other methods in estimating PDEs, especially in the low signal-to-noise regime.



There are no comments yet.


page 1

page 2

page 3

page 4


Data-Driven Deep Learning of Partial Differential Equations in Modal Space

We present a framework for recovering/approximating unknown time-depende...

Weak SINDy For Partial Differential Equations

We extend the WSINDy (Weak SINDy) method of sparse recovery introduced p...

Differential radial basis function network for sequence modelling

We propose a differential radial basis function (RBF) network termed RBF...

Error bounds for PDE-regularized learning

In this work we consider the regularization of a supervised learning pro...

Solving the Dirichlet problem for the Monge-Ampère equation using neural networks

The Monge-Ampère equation is a fully nonlinear partial differential equa...

CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs

We present cudaclaw, a CUDA-based high performance data-parallel framewo...

Stability selection enables robust learning of partial differential equations from limited noisy data

We present a statistical learning framework for robust identification of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Schematic of training process. The black line is the function trained by the neural network. forces this function to match the data and forces it to be smooth and recovers the equation

Partial differential equations (PDEs) are widely used in several quantitative disciplines, from physics to economics, in an attempt to describe the evolution and dynamics of phenomena. Often, such equations are derived using first principle approaches in conjunction with data observations. However, as datasets are often cumbersome and difficult to analyze through traditional means, there arises a need for the automation of such processes. We present an approach that uses neural function approximation with function regularization to recover the governing partial differential equations a wide range of possible PDEs.

Recent work [rudy2017data, brunton2016discovering] used finite difference and polynomial approximation to extract the governing equation from data. However, the methods perform poorly in the presence of noise. In [long2017pde]

, the authors use convolutional neural networks with filters constrained to finite difference approximations to learn the form of a PDE, but no sparsity constraint is enforced on the learned PDE, which may lead to verbose solutions. Although the approaches described in

[berg2019data] and [xu2019dl] consider deep learning approaches for function approximation, these do not analyze the low signal-to-noise (SNR) regime. All previous methods [rudy2017data, berg2019data, xu2019dl] consider parabolic PDEs, that is, PDEs that depend on first order derivatives of the function in time. Our work is, to the best of our knowledge, the first to consider more general PDEs.

Similar to previous methods [rudy2017data, berg2019data, xu2019dl], we consider PDEs which are linear combinations of dictionary functions, but instead of assuming this linear combination has a dependence on time, we recover the underlying PDE from the null space of the dictionary functions. We interpolate the observed data using a neural network, and include a regularization term that forces the neural network to follow the PDE that best describes the data. In order to calculate this regularization term, we use automatic differentiation to calculate derivatives of the neural network. We finally determine the underlying PDE by checking the learned regularization term. A byproduct of this dual optimization is the increase in performance of the method in the low SNR regime. To summarize, our main contributions are:

  1. Establish a deep learning framework for the identification of PDEs which are linear combinations of user-defined dictionary functions.

  2. Introduce a regularization scheme for preventing function approximators from overfitting to noise.

  3. Compare our method with the Cramer-Rao lower bound for a simple ordinary differential equation (ODE).

Our paper is organized as follows. In Section 2 we introduce the PDE estimation problem. In Section 3, we present our approach to estimating the underlying PDE using neural networks. In Section 4 we present a Cramer-Rao lower bound [cramer1946mathematical] for a simple ODE and compare it with the results obtained by our methods. Finally, in Section 5 we present numerical results obtained with simulated data and in section 6 we summarize our contributions and discuss future applications.

2 Problem

A PDE is a equation that relates a function with its partial derivatives. For a function , where

, and a vector of non-negative integers

, we denote and . Using this notation, we can define any PDE by


where is a function that relates with and its partial derivatives at . We say is a solution to the PDE if (1) holds for every point . In this paper, we observe data points , where are noisy function values at , that is,


where the noise

is assumed to be Gaussian, with variance

, and is a solution to (1) for some unknown function . Since the space of possible functions is prohibitively large, we restrict our attention to PDEs that are linear combinations of dictionary functions, relating and its derivatives. More formally, let be such a dictionary, we consider PDEs of the form


where is a vector of linear coefficients to be determined. For brevity, we use the notation for the left-hand side in (3).

Name Equation Dictionary
Table 1: List of equations and dictionaries used.

2.1 Wave Equation Illustration

As a motivating example, consider the 1D wave equation. For brevity, denote by , where is any variable within the domain of . Then , where , is a solution to the 1D wave equation if


If we then define our dictionary as and , then and the PDE is of the form (3) with .

3 Methods

Suppose we observe defined as in (2), where is a solution to (3). Our approach is to approximate by a neural network function , where

is the input/training data. The loss function contains two main components,

and , which we explain in the following subsections.

3.1 Fitting the Data

The first component of the loss function is a mean square error (MSE) loss between the observed data points and the value of the neural network at these points. That is,


where are the neural network parameters. As a consequence of universal approximation theorems [hornik1991approximation], not only can be approximated with arbitrary accuracy, by a neural network , but also its derivatives are approximated by up to arbitrary accuracy. More formally, if we define the Sobolev norm as


then for any there exists a neural network such that . Since is a solution to (3) and is arbitrarily close to in Sobolev norm, is an approximate solution to the same PDE, and the parameters can be determined from . However, the sampled data contains noise, and minimizing may lead to overfitting. We circumvent this by introducing a regularization term.

3.2 PDE Estimation

To estimate the underlying PDE, we first construct a dictionary with terms, and evaluate these functions at points , sampled from , obtaining a matrix with entries

From (3), , and therefore lies in the null space of . Equivalently, is a singular vector of

, with associated singular value

. If , we have, assuming some regularity conditions on , that

for some constant that depends on . Therefore the singular values and associated singular vectors of both matrices are close, and the singular vector of , associated with its smallest singular value, is an approximation of . In order to calculate the smallest singular vector of , we invoke the min-max theorem for singular values [van1983matrix, Theorem 8.6.1], which states the minimum of , subject to , is the smallest singular value of . Therefore, we introduce the loss term


and add the constraint . We note that enforcing these constraint is necessary, otherwise the minimizer of (7) would be . The contribution of is twofold: while minimizing (7) over recovers the PDE, minimizing (7) over forces it to be a solution to the PDE, and prevents it from overfitting to noise.

Finally, as we believe that laws of nature are inherently simpler and depend on fewer terms, we impose an additional loss term to promote sparsity in the recovered law.


3.3 Combining Losses

We combine the different loss terms in the following manner to obtain the final loss function we optimize.


The multiplicative scaling of on the other loss terms acts as an additional regularizer for the function. When is large, this means the noise variance is larger, and thus the contribution of the other terms increases in order to force to be smooth and prevent it from overfitting to noise. On the other hand, when is small, the noise variance is smaller and it is more important for to interpolate the data.

Noise Level 1 0.01 0
Table 2:

Table showing average and standard deviation of error for different noise levels and equations. Here the noise level represents

in (2) and the error is defined by (17).

4 A Cramer-Rao bound for a simple ODE

In this section we present a lower bound on estimating a simple ordinary differential equation (ODE), using the Cramér-Rao bound [cramer1946mathematical], in order to assess the performance of our method. We consider the ODE


Our goal is to estimate by observing noisy data points of a solution to the ODE. All solutions of this ODE are of the form


for some latent variable . Consider we observe data points , with each

drawn independently from the uniform distribution in

, defined as in (2), and defined by (11

). The probability distribution of

is given by:


where we used , since is drawn from the uniform distribution in , and is a centered Gaussian variable with variance . If and

are unbiased estimators for

and , respectively, the Cramer-Rao bound [cramer1946mathematical] states that


where is the Fisher information matrix, is the number of data points and we write if is a positive semi-definite matrix. The entries of are defined by


where is any of , or , and is defined in (12). If we define , or equivalently , since is an unbiased estimator, (13) implies


We now use (14) to evaluate (15). For brevity, we omit the evaluation, and just present the final expression:


In order to compare our proposed method to the Cramer-Rao lower bound, we consider two dictionary functions, and , we generate data points as in (2), with , and check if our algorithm recovers (10) with . We run the experiment 3 times and compare the error with (16). Finally, we plot the results in Fig. 2.

Figure 2: Plot comparing the derived Cramer Rao lower bound (CRLB) to observed empirical results of the described algorithm for 5 noise levels. Algorithms performance is comparable to CRLB shown in orange.

5 Numerical Simulations

As motivating examples, we consider the wave equation, the Helmholtz equation, the Korteweg-de Vries (KdV) equation, a simulated vortex, and Hamilton-Jacobi-Bellman (HJB) equation. A summary of these equations and of the dictionaries used is presented in Table 1

. Our neural network is a feed forward fully connected neural network, 4 hidden layers, 50 neurons per hidden layer similar to

[berg2019data], using the softplus [glorot2011deep]activation function as the nonlinearity. For optimization, we use the Adam optimizer with learning rate of for the parameter and for . The learning rate for both decays exponentially by

each epoch. All hyperparameters are maintained constant through all different experiments. Finally, we sample 10000 points randomly within the domain for the function approximation.

In order to measure the error of our PDE recovery, we use the following formula


We note this quantity is always non-negative, and is if and only if and are colinear. Moreover, this formula also gives an estimate on how many digits of the recovered coefficients are correct: if the error is of the order of , then has correct digits.

For all the equations in Table 1 and noise levels , we run our method 3 times and present the results in Table  2. We include the associated code at https://github.com/alluly/pde-estimation.

6 Discussion

We present a method for reconstructing the underlying PDE for a set of data while being robust to noise and generalizing to a variety of PDEs. We consider the method’s applicability to multiple equations at various noise levels. Finally, we show that the method approaches the theoretical bound for parameter estimation for noisy cases for a simple ODE. In the future, extensions to stochastic differential equations and higher dimensional equations should be considered. While we present all results for a fixed set of algorithm hyperparameters, it is unclear if these hyperparameters are optimal, and the effects of different hyperparameter regimes should be studied. Different algorithm hyperparameters may aid in reducing the variability of the recovered solutions. The results present an opening to apply our method to problems where the estimation of an underlying PDE is necessary.