DeepMoD: Deep learning for Model Discovery in noisy data

by   Gert-Jan Both, et al.

We introduce DeepMoD, a deep learning based model discovery algorithm which seeks the partial differential equation underlying a spatio-temporal data set. DeepMoD employs sparse regression on a library of basis functions and their corresponding spatial derivatives. A feed-forward neural network approximates the data set and automatic differentiation is used to construct this function library and perform regression within the neural network. This construction makes it extremely robust to noise and applicable to small data sets and, contrary to other deep learning methods, does not require a training set and is impervious to overfitting. We illustrate this approach on several physical problems, such as the Burgers', Korteweg-de Vries, advection-diffusion and Keller-Segel equations, and find that it requires as few as O(10^2) samples and works at noise levels up to 75 at very few samples highlights the potential of this method to be applied on experimental data. Code and examples available at


page 1

page 2

page 3

page 5

page 6

page 9

page 13

page 14


Sparsely constrained neural networks for model discovery of PDEs

Sparse regression on a library of candidate features has developed as th...

Model discovery in the sparse sampling regime

To improve the physical understanding and the predictions of complex dyn...

Sparsistent Model Discovery

Discovering the partial differential equations underlying a spatio-tempo...

Deep-learning of Parametric Partial Differential Equations from Sparse and Noisy Data

Data-driven methods have recently made great progress in the discovery o...

PDE-READ: Human-readable Partial Differential Equation Discovery using Deep Learning

PDE discovery shows promise for uncovering predictive models for complex...

Torch-Struct: Deep Structured Prediction Library

The literature on structured prediction for NLP describes a rich collect...

An Empirical Study of Incremental Learning in Neural Network with Noisy Training Set

The notion of incremental learning is to train an ANN algorithm in stage...

Code Repositories


Pytorch implementation of the DeepMoD algorithm: [arXiv:1904.09406]

view repo


This implementation of DeePyMoD is no longer maintained! We switched to a PyTorch based implementation:

view repo


DeepMod is a deep learning based model discovery algorithm which seeks the partial differential equation underlying a spatio-temporal data set. DeepMoD employs sparse regression on a library of basis functions and their corresponding spatial derivatives. This code is based on the paper: [arXiv:1904.09406](

view repo


Our goal is to develop a fully-automated procedure which discovers the partial differential equation (PDE) underlying a measured data set. Given a data set , we can write this problem as,


where we seek the function . To find , we generate a large set of possible models by considering all permutations of a library of candidate terms. The choice of the library depends on the problem at hand but generally consists of polynomial basis functions and their corresponding spatial derivatives. For example, in the one dimensional examples we present in this paper, the library consists of all polynomials in up to second order, the derivatives of with respect to the coordinates (e.g. ) up to third order and any possible combinations of these two sets (e.g. ), totalling just 12 terms. However, one can construct more than 4000 unique models from this limited set of functions, rendering an information theory approach computationally unfeasible mangan2016.

We circumvent this problem by utilizing a sparse regression approach, in which the model discovery problem is rewritten as


where is a column vector of size containing the time derivative of each sample and contains all possible terms for each of the samples, so that we can write it as


Since contains significantly more terms than required, most coefficients in will be zero and hence we are looking for a sparse solution of the vector . In the next section, we discuss how this regression task is solved using Lasso, a sparsity promoting regression method, within a neural network.

Figure 2: (a) Cost functions and (b) coefficient values,

, as function of the number of epochs for the Burgers’ dataset consisting of 2000 points and

white noise. Initially, the neural network optimizes the MSE and only after the MSE is converged the coefficient vector is optimized by the network. (c) The values in the grid indicate the accuracy of the algorithm tested on the Burgers’ equation, defined as the mean relative error over the coefficients, as function of the sample size of the data set and level of noise. The coloring represents the fraction of correct runs (Red indicates that in none of the five iterations the correct PDE is discovered).

.1 Lasso in neural network

We employ a densely-connected feed-forward neural network which takes the spatial and temporal coordinates of the problem, i.e. as input, and outputs , an approximation of at . In other words, the neural network approximates the function . The network is trained by optimizing the cost function,


Here, , is the mean squared error (MSE) of the output of the neural network with respect to the dataset ,


The last two terms of Eq. 5 correspond to the Lasso regularization: performs regression to find the coefficient vector and is an regularizer on . In order to implement the regression problem (Eq. 3) within the neural network, we introduce the regression based cost function,


Note that the coefficient vector is updated alongside the weights and biases of the neural network, while the terms in are computed from the output of the neural network (i.e. ). Automatic differentiation is used to calculate all the spatial and temporal derivatives in , returning machine-precision derivatives. This approach is considerably more accurate than any form of numerical differentiation. Moreover, acts as a regularizer on , preventing overfitting of the noisy data set, even though our library contains a large amount of terms.

Finally, an regularization on the vector is added to ensure its sparsity,


Here is a constant setting the strength of the regularization (further discussed in the SI).

The total cost of the neural network is then minimized using the Adam optimizer. The combination of the MSE term and the regression term in the cost function constrain the network in such a way that it converges to the right solution. To determine if the network has converged, we introduce a convergence criterion. As we show in Fig. 2(a,b), the MSE converges before does, so that our criterion is based on the convergence of :


This criterion states that the maximum value of the gradient of the loss function with respect to the coefficients must be smaller than a given tolerance. Note here that we have scaled the gradients as we discuss in the next paragraph. Since it is not guaranteed the network will reach this tolerance, we train the network until the convergence criterion is satisfied, or for a maximum amount of iterations.

.2 Normalization and thresholding

When the neural network has finished training, we obtain the sparse vector . Despite the regularization, most terms will be non-zero and hence we need to threshold the small coefficients to obtain the true sparse representation. Since each term has different dimensions, Eq. 2 is rendered dimensionless,


where is the norm of each column of and the norm of the time-derivative vector. As a result of this transformation, components of will typically be . This normalization allows DeepMoD to select component of different orders of magnitude (shown in the SI). We then train the network one final time without

penalty and with the regression term only containing the terms selected in the first cycle, to find an unbiased estimate of the coefficients of the underlying PDE.


We highlight the performance of DeepMoD on a set of case studies: the Burgers’ equation with and without shock, the Korteweg - de Vries equation, the 2D advection-diffusion equation and the Keller-Segel model for chemotaxis. These examples show the ability of DeepMoD to handle (1) non-linear equations, (2) solutions containing a shock wave, (3) higher dimensional equations and (4) coupled PDEs.

.3 Non-linear PDEs

We apply DeepMoD to recover various non-linear and higher order differential equations. As examples we consider Burgers’ equation (in the SI we the Korteweg-de Vries equation, which contains a third-order derivative). The Burgers’ equation occurs in various areas of gas dynamics, fluid mechanics and applied mathematics and is evoked as a prime example to benchmark model discovery Rudy et al. (2017); long2018pde and coefficient inference algorithms Raissi et al. (2017a, b); raissi2019physics, as it contains a non-linear term as well as second order spatial derivative,


Here is the viscosity of the fluid and its velocity field. We use the dataset produced by Rudy et al. Rudy et al. (2017), where . The numerical simulations for the synthetic data were performed on a dense grid for numerical accuracy. DeepMoD requires significantly less datapoints than this grid and we hence construct a smaller dataset for DeepMoD by randomly sampling the results through space and time. From now on, we will refer to randomly sampling from this dense grid simply as sampling. Also note that this shows that our method does not require the data to be regularly spaced or stationary in time. For the data in Fig. 2 we add white noise and sampled 2000 points for DeepMoD to be trained on.

Figure 3: Ground truth, Noisy and Inferred data at for the Burgers’ equation with a shock wave (10 noise and 2000 sample points).

We train the neural network using an Adam optimizer (see SI for details) and plot the different contributions of the cost function as a function of the training epoch in Fig. 2a and we show the value of each component of as a function of the training epoch in Fig. 2b. Note that for this example, after approximately 2000 epochs, the MSE is converged, while at the same time we observe the components of only start to converge after this point. We can thus identify three ’regimes’: in the initial regime (0 - 2000 epochs), the MSE is trained. Since the output of the neural network is far from the real solution, so is , and the regression task cannot converge (See first 2000 epochs in Fig. 2b). After the MSE has converged, is sufficiently accurate to perform the regression task and starts to converge. After this second regime (2000 - 6000 epochs), all components of the cost converged (6000 epochs) and we can determine the solution. From this, we obtain a reconstructed solution (see SI and next section for example) and at the same time recover the underlying PDE, with coefficients as little as error in the obtained coefficients:


This accuracy can be increased even further by using a larger network or a lower convergence tolerance.

Next, we characterize the robustness of DeepMoD in Fig. 2c, where we run DeepMod for five times for a range of sample sizes and noise levels. Each of the five runs was made with a differently sampled data set so that each is independent from another. The color in Fig. 2c shows how many of the five runs return the correct equation and the value in the grid displays the mean error over all correct runs. Observe that at vanishing noise levels, we recover Eq. S12 with as little as 100 data-points, while for 5000 data points we recover the PDE with noise levels of up to . Between the domain where we recover the correct equation for all five runs and the domain where we do not recover a single correct equation, we observe an intermediate domain where only a fraction of the runs return the correct equation, indicating the importance of sampling (See SI 2 for further discussion). Note that DeepMoD allows up to two orders of magnitude higher noise-levels and smaller sample sizes with respect to state-of-the-art model discovery algorithms Rudy et al. (2017). We show in the SI that DeepMod has similar performance for the KdV equation, which contains a third order spatial derivative.

.4 Shock wave solutions

If the viscosity is too low, the Burgers’ equation develops a discontinuity called a shock (See Fig. 3). Shocks are numerically hard to handle due to divergences in the numerical derivatives. Since DeepMoD uses automatic differentiation we circumvent this issue. We adapt the data from Raissi et al Raissi et al. (2017b), which has , sampling 2000 points and adding 10 white noise (See Fig. 3). We recover ground truth solution of the Burgers’ equation as well as the corresponding PDE,


with a relative error of 5 on the coefficients. In Fig. 3 we show the inferred solution for .

Figure 4: Ground truth (left), Sampled (middle) and reconstructed (right) solution for the 2D advection diffusion equation at (10 white noise and 5000 samples). Note that, even for an extremely low sampling rate, the full concentration profile is recovered, as well as the corresponding PDE of the process.

.5 Higher dimensional input

To showcase the robustness of DeepMoD on high-dimensional input data, we consider the 2D advection diffusion process described by,


where is the velocity vector describing the advection and is the diffusion coefficient. Our initial condition is a 2D Gaussian and set and . For as little as 5000 randomly sampled points we recover the correct form of the PDE as well as the vector for noise levels up to (See SI 2). Our library consists of all spatial derivatives (including cross terms) as well as first order polynomial terms, totalling 12 terms. At vanishing noise levels we obtain the correct equations with as little as 200 sample points through space and time. This number is surprisingly small considering this is an 2D equation. To illustrate this, Fig. 4 shows the ground truth, sampled points with noise and the reconstructed profile at two different times. Note that while it is impossible to identify the underlying process by sight, the algorithm correctly discovers the underlying PDE. After the network learned the mapping on 5000 sample points, we can reconstruct the solution on the entire domain (Fig. 4 right), highlighting the ability of DeepMoD to act as a highly-accurate physics-informed interpolater and denoiser.

.6 Coupled differential equations

Finally, we apply DeepMod to a set of coupled PDE’s in the form of the Keller-Segel (KS) equations, a classical model for chemotaxis keller1970; chavanis2010. Chemotactic attraction is one of the leading mechanisms that accounts for the morphogenesis and self-organization of biological systems. The KS model describes the evolution of the density of cells and the secreted chemical ,


Here the first equation represents the drift-diffusion equation with a diffusion coefficient of the cells, and a chemotactic sensitivity , which is a measure for the strength of their sensitivity to the gradient of the secreted chemical . The second equation represents the reaction diffusion equation of the secreted chemical , produced by the cells at a rate and degraded with a rate . For a 1D system, we sample 10000 points of and for parameter values of , , , and and add 5 white noise. We choose a library consisting of all spatial derivatives (including cross terms) as well as first order polynomial terms, totalling 36 terms. For these conditions we recover the correct set of PDEs,

Figure 5: Ground truth, noisy and reconstructed solutions for the density of cells, (top row) and the density of secreted chemicals (bottom row) in the Keller Segel model for 5 white noise and 10000 samples.

as well as the reconstructed fields for and (See Fig. 5). Note that even the coupled term, , which becomes vanishingly small over most of the domain, is correctly identified by the algorithm, even in the presence of considerable noise levels.


In this paper we presented DeepMoD, a novel model-discovery algorithm, which utilizes neural networks to discover the underlying PDE of a spatio-temporal dataset. We demonstrate the algorithm on five case studies: Burgers’, Korteweg-de Vries, advection diffusion and Keller-Segel equations, but DeepMod extends far beyond the case studies described here. In contrast to many of the state of the art model discovery algorithms DeepMoD is very robust with respect to elevated noise levels and resilient to small data set sizes. DeepMoD allows higher dimensional input/output as well as coupled PDEs as demonstrated with the 2D advection diffusion and Keller-Segel equation.

Through the use of automatic differentiation, combined with a regression-based regularization, the approximation of the spatio-temporal derivatives in noisy data is strongly enhanced. DeepMoD combines two previously established ideas, (i) a regression-based approach to model discovery (pioneered by e.g. Rudy et al. Rudy et al. (2017); rudy2018) and (ii) the ability of neural networks to infer system parameters in the context of Physics Informed Neural Networks (Raissi et al. Raissi et al. (2017a, b); raissi2019physics. We show that combining both approaches strongly improves the model discovery task at hand and results in an increased robustness with respect to noise-levels and sample size for model discovery tasks. This approach allows model selections on highly noisy numerical data and potentially also on experimental data, which to date is one of the prime challenges of this field. DeepMoD also allows to infer the various type of diffusive, chemo-tactic equations, from single particle tracking (SPT) data by following a similar approach as Rudy et al. (2017), which will advance existing approaches to infer potentially anomalous diffusive processes from SPT data el2015; Granik2019.

The success of this approach however strongly relies on the completeness of the library functions in

. If the underlying functions are not present in the library, DeepMoD will not return the correct underlying equation, nor will the regression based cost function converge. Conversely, since the predetermined library functions in DeepMoD can be tailored to the problem at hand, it can contain highly non-linear functions such as exponential functions or the sigmoidal functions in genetic activation networks


. We have empirically found that including these extensive libraries does not result in over-fitting of the data, even though the optimization contains more degrees of freedom.

Besides the model selection capabilities, DeepMoD demonstrates its usefulness to denoise data and allows accurately approximating derivatives from noisy data, a notoriously difficult task to solve with classical interpolation and finite difference schemes. Employing this ”function library based” regulation of neural network architecture may boost the enhancement of e.g. super resolution images through physics informed regularization

yang2010image; von2019.


We acknowledge the support of NVidia through their academic GPU grant program. We thank Jonathan Grizou for his valuable input on the manuscript.