1 Introduction
Modeling and extracting governing equations from complex timeseries can provide useful information for analyzing data. An accurate governing system could be used for making datadriven predictions, extracting largescale patterns, and uncovering hidden structures in the data. In this work, we present an approach for modeling timedependent data using differential equations which are parameterized by shallow neural networks, but retain their intrinsic (continuous) differential structure.
For timeseries data, recurrent neural networks (RNN) is often employed for encoding temporal data and forecasting future states. Part of the success of RNN are due to the internal memory architecture which allows these networks to better incorporate state information over the length of a given sequence. Although widely successful for language modeling, translation, and speech recognition, their use in highfidelity scientific computing applications is limited. One can observe that a sequence generated by an RNN may not preserve temporal regularity of the underlying signals (see, for example
[2] or Figure 2.3) and thus may not represent the true continuous dynamics.For imaging tasks, deep neural networks (DNN) such as ResNet [10, 11], FractalNet [18], and DenseNet [13] have been successful in extracting complex hierarchical spatial information. These networks utilize intralayer connectivity to preserve feature information over the network depth. For example, the ResNet architecture uses convolutional layers and skip connections. The hidden layers take the form where represents the features at layer and
is a convolutional neural network (or more generally, any universal approximator) with trainable parameters
. The evolution of the features over the network depth is equivalent to applying the forward Euler method to the ordinary differential equation (ODE):
. The connection between ResNet’s architecture, numerical integrators for differential equations, and optimal control has been presented in [4, 22, 9, 30, 40].Recently, DNNbased approaches related to differential equations have been proposed for data mining, forecasting, and approximation. Examples of algorithms which use DNN for learning ODE and PDE include: learning from data using a PDEbased network [21, 20]
, deep learning for advection equations
[3], approximating dynamics using ResNet with recurrent layers [25], and learning and modeling solutions of PDE using networks [28]. Other approaches for learning governing systems and dynamics involve sparse regularizers ( or hardthresholding approaches in [1, 29, 32, 33] and problems in [39, 35, 34]) or models based on Gaussian processes [27, 26].Note that in [21, 20] it was shown that adding more blocks of the PDEbased network improved (experimentally) the model’s predictive capabilities. Using ODEs to represent the network connectivity, [2] proposed a ‘continuousdepth’ neural network called ODENet. Their approach essentially replaces the layers in ResNetlike architectures with a trainable ODE. In [2], the authors state that their approach has several advantages, including the ability to better connect ‘layers’ due to the continuity of the model and a lower memory cost when training the parameters using the adjoint method. The adjoint method proposed in [2] may not be stable for a general problem. In [7], a memory efficient and stable approach for training a neural ODE was given.
1.1 Contributions of this Work.
We present a machine learning approach for constructing approximations to governing equations of timedependent systems that blends physicsinformed candidate functions with neural networks. In particular, we construct a network approximation to an ODE which takes into account the connectivity between components (using a dictionary of monomials) and the differential structure of spatial terms (using finite difference kernels). If the user has prior knowledge on the structure or source of the data, i.e. fluids, mechanics, etc., one can incorporate commonly used physical models into the dictionary. We show that our approach can be used to extract ODE or PDE models from timedependent data, improve the spatial accuracy of reduced order models, and reduce the parameter cost for image classification (for the MNIST and Fashion MNIST datasets).
2 Modeling Via Ordinary Differential Equations
Given discretetime measurements generated from an unknown dynamic process, we model the timeseries using a (firstorder) ordinary differential equation, , with . The problem is to construct an approximation to the unknown generating function , i.e. we will learn networks such that . Essentially, we are learning a neural network approximation to the velocity field. Following the approach in [2], the system is trained by a ‘continuous’ model and the function
is parameterized by multilayer perceptrons (MLP). Since a twolayer MLP may require a large width to approximate a generic (nonlinear) function
, we purpose a different parameterization. Specifically, to better capture higherorder correlations between components of the data and to lower the number of parameters needed in the MLP (see for example, Figure 2.2), a dictionary of candidate inputs is added. Let be the collection (as a matrix) of the th order monomial terms depending on and , i.e. each element in can be written as:One has freedom to determine the particular dictionary elements; however, the choice of monomial terms provides a model for the interactions between each of the components of the timeseries and is used for model identification of dynamical systems in the general setting [1, 39, 34]. For simplicity, we will suppress the (userdefined) parameter .
, regularized optimization with polynomial dictionaries is used to approximate the generating function of some unknown dynamic process. When the dictionary is large enough so that the ‘true’ function is in the span of the candidate space, the solutions produced by sparse optimization are guaranteed to be accurate. To avoid defining a sufficiently rich dictionary, we propose using an MLP (with a nonpolynomial activation function) in combination with the monomial dictionary, so that general functions may be wellapproximated by the network. Note that the idea of using products of the inputs appears in other network architectures for example, the highorder neural networks
[8, 38].In many DNN architectures, batch normalization
[14] or weight normalization [31] are used to improve the performance and stability during training. For the training of NeuPDE, a simple (uniform) normalization layer, , is added between the input and dictionary layers, which mapsto a vector in
(using the range over the all components). Specifically, let and be the maximum (and minimum) value of the data, over all components and samples and define the vector as:This normalization is applied to each component uniformly and enforces that each component of the dictionary is bounded by one (in magnitude). We found that this normalization was sufficient for stabilizing training and speeding up optimization in the regression examples. Without this step, divergence in the training phase was observed.
To train the network: let be the vector of learnable parameters in the MLP layer, then the optimization problem is:
(2.1)  
where are regularization parameters set by the user and is an MLP. Specifically, let be a smooth activation function, for example, the exponential linear unit (ELU)
or the hyperbolic tangent, tanh, which will be sufficiently smooth for integration using RungeKutta schemes. The righthand side of the ODE is parameterized by a fully connected layer  activation layer  fully connected layer, i.e.
where , i.e. the vectorization of all components of the matrices and and biases and . Therefore, the first layer of the MLP in the form takes a linear combination of candidate functions (applied to normalized data). Note that the dictionary does not include the constant term since we include a bias in the first fully connected layer. The function is a regularizer on the parameters (for example, the norm) and the timederivative is penalized by the norm. When used, the parameters are set to and (no tuning is performed).
The constraints in Eqn. (2.1) are written in continuoustime, i.e. the value of is defined by the ODE and thus can be evaluated at any time . For a given set of parameters , the values are obtained by numerical integration (for example, using a RungeKutta scheme). To optimize Eqn. (2.1) using a gradientbased method, the backpropagation algorithm or the adjoint method (following [2]) can be used. The adjoint method requires solving the ODE (and its adjoint) backwardintime, which can lead to numerical instabilities. Following the approach in [7], checkpointing can be used to mitigate this issue.
For all experiments, we take the ‘discretizethenoptimize’ approach. The objective function, Eqn. (2.1), is discretized as follows:
(2.2)  
where is an ODE solver (i.e. a RungeKutta scheme) applied times, is rescaled by the timestep, and the timederivative is discretized on the timeinterval with the integral approximated by piecewise constant quadrature. The constraint that the ODE is satisfied at every timestamp has been transformed to the constraint that the sequence for is generated by the forward evolution of an ODE solver. The ODE solver takes (as its inputs) the initial data and the function that defines the RHS of the ODE. Note that the ODE solver can be ‘subgrid’ in the sense that, over a time interval , we can set the solver to take multiple (smaller) timesteps. This will increase storage cost needed for backpropagation; however, taking multiple timesteps can better resolve complex dynamics embedded by (see examples below). Additionally, the timederivative regularizer helps to control the growth of the generative model, which yields control over the solution and its regularity.
Eqn. (2.2) is solved using the Adam algorithm [15] with temporal minibatches. Each minibatch is constructed by taking a random sampling of the timestamps for and then collecting all datapoints from to for some . This can be easily extended to multiple timeseries, by sampling over each trajectory as well. Note that, in experiments, taking nonoverlapping subintervals did not change the results. The backpropagation algorithm [23] applied to each of the subintervals can be done in parallel. For all of the regression examples, we set where is the given (sequential) data. For the image classification examples, we used the standard crossentropy for .
Remark 2.1.
Layers: The righthand side of the ODE is parameterized by one set of parameters . Therefore, in terms of DNN layers, we are technically only training one “layer”. However, changes in the structure between timesteps are taken into account by the timedependence, i.e. the dictionary terms that contain . Thus, we are embedding multiplelayers into the timedependent dictionary.
Remark 2.2.
‘Continuousdepth’: Eqn. (2.2) is fully discrete when using a RungeKutta solver for and its gradient can be calculated using the backpropagation algorithm. If we used an adaptive ODE solver, such as RungeKutta 45, the forward propagation would generate a new set of timestamps (which always contain the timestamps ) in order to approximate the forward evolution , given an error tolerance and a set of parameters . We tested the continuousdepth versions, using backpropagation and the adjoint method (see Appendix A). Although the backward evolution of the ODE constraint may not be wellposed (i.e. numerical unstable or unbounded), our experiments lead to similar results. This could be due to the temporal minibatches which enforce shortintegration windows. Following [7], a checkpointing approach should be used, which would also control the memory footprint. It should be noted that, the adjoint approach was tested for the ODE examples but not for PDE examples, since the PDE examples are not timereversible and will lead to backwardintegration issues.
In addition, when the network is discrete, one may still consider it as a ‘continuousdepth’ approximation, since the underlying approximation can be refined by adding more timestamps, without the need to retrain the MLP.
2.1 Autonomous ODE.
When the timeseries data is known to be autonomous, i.e. the ODE takes the form , one can drop the dependency in the dictionary. In this case, the monomials take the form . We train this model by minimizing:
(2.3)  
where is the given data (corrupted by noise) over the timestamps for . The true governing equation is given by the 3d Lorenz system:
(2.4) 
which emits chaotic trajectories.
In Figure 2.1(a), we train the model with 20 hidden nodes per layer using a quadratic dictionary, i.e. there are 9 terms in the dictionary, with 20 bias parameters, with 3 bias parameters, for a total of 263 trainable parameters. The solid curves are the timeseries generated by a forward pass of the trained model. The learned system generates a highfidelity trajectory for the first part of the timeinterval. In Figure 2.1(bc), we investigate the effect of the degree in the dictionary. In Figure 2.1(b), using a degree 1 monomial dictionary with 38 hidden nodes per layers, i.e. 3 terms in the dictionary, with 38 bias parameters, with 3 bias parameters (for a total of 269 trainable parameters), the generated curves trace a similar part of phase space, but are pointwise inaccurate. By increasing the hidden nodes to 100 per layer (3 terms in the dictionary, with 100 bias parameters, with 3 bias parameter, for a total of 703 trainable parameters), we see in Figure 2.1(c) that the method (using a degree 1 dictionary) is able to capture the correct pointwise information (on the same order of accuracy as Figure 2.1(a)) but requires more than double the number of parameters.
2.2 NonAutonomous ODE and Noise.
To investigate the effects of noise and regularization, we fit the data to a nonlinear spiral:
(2.5) 
corrupted by noise. The third coordinate of Eqn. (2.5) is timedependent, which can be challenging for many recovery algorithms. This is partly due to the redundancy introduced into the dictionary by the timedependent terms. To generate Figure 2.2, we set:

the dictionary degree to 1, with 4 terms, with 46 bias parameters, with 3 bias parameters (371 trainable parameters in total);

the degree to 4, with 69 terms, with 4 bias parameter, with 3 bias parameters (295 trainable parameters in total), and the regularization parameter to ;

the degree to 4, with 69 terms, with 5 bias parameter, with 3 bias parameters (368 trainable parameters in total).
For cases (bc), we set the degree of the dictionary to be larger than the known degree of the governing ODE in order to verify that we do not overfit using a higherorder dictionary and that we are not tailoring the dictionary to the problem. In Figure 2.2(a), the dictionary of linear monomials with a moderately sized MLP seems to be insufficient for capturing the true nonlinear dynamics. This can be observed by the oversmoothing caused by the linearlike dynamics. In Figure 2.2(c), a nonlinear dictionary can fit the data and extract the correct pattern (the ‘squared’off corners). Figure 2.2(b) shows that we are able to decrease the total number of parameters and fit the trajectory within the same tolerance as (c) by penalizing the derivative. Both (b) and (c) have achieved a meansquared loss under .
2.3 Comparison for Extracting Governing Models.
Comparison with SINDy. We compare the results of Figure 2.2 with an approximation using the SINDy algorithm from [1] (theoretical results of convergence and relationship to the problem appear in [41]). These approaches differ, since the SINDy algorithm seeks to recover a sparse approximation to the governing system given one tuning parameter and is restricted to the span of the dictionary elements. To make the dictionary sufficiently rich, the degree is set to 4 as was done for Figure 2.2 (bc). Since the sparsity of the first two components is equal to one, we search over all parameterspace (up to 6 decimals) that yields the smallest nonzero sparsity. The smallest nonzero sparsity for the first component is 12 and for the second component is 3 with:
(2.6) 
which is independent of and and does not lead to an accurate approximation to the nonlinear spiral. This is likely due to the level of noise present in the data and the incorporation of the timecomponent.
Comparison with LASSObased methods. We compare the results of Figure 2.2 with LASSObased approximations for learning governing equations [35]. The LASSO parameter is chosen so that the sparsity of the solution matches the sparsity of the true dynamics (with respect to a dictionary of degree 4). In addition, the coefficients are ‘debiased’ following the approach in [35]. The learned system is:
(2.7) 
which matches the profile of the data in the plane; however, it does not predict the correct dynamics for (emitting seemingly periodic orbits). While the LASSObased approach better resolves the statespace dependence, it does not correctly identify the timecomponent.
Comparison with RNN. In Figure 2.3 the Lorenz system (see Figure 2.1) is approximated by our proposed approach and a standard LSTM (RNN), with the same number of parameters. Although the RNN learns internal hidden states, the RNN does not learn the correct regularity of the trajectories thus leading to sharp corners. It is worth noting that, in experiments, as the number of parameters increases, both the RNN and our network will produce sequences that approach the true timeseries.
2.4 ODE from LowRank Approximations
For certain spatiotemporal systems, reducedorder models can be used to transform complex dynamics into lowdimensional timeseries (with stationary spatial modes). One of the popular methods for extracting the spatial modes and identifying the corresponding temporaldynamics is the dynamic mode decomposition (DMD) introduced in [37]. The projected DMD method [36] makes use of the SVD approximation to construct the modes and the linear dynamical system. Another reducedorder model, known as the proper orthogonal decomposition (POD) [12], can be used to construct spatial modes which best represent a given spatiotemporal dataset. The projected DMD and the POD methods leverage lowrank approximations to reduce the dimension of the system and to construct a linear approximation to the dynamics (related to the spectral analysis of the Koopman operator), see [17] and the citations within.
We apply our approach to construct a neural network approximation to the timeseries generated by a lowrank approximation of the von Kármán vortex sheet. We explain the construction for this example here but for more details, see [17]. Given a collection of measurements , where are the spatial coordinates and are the timestamps, define as the matrix whose columns are the vectorization of each , i.e. and where is the number of grid points used to discretize . The SVD of the data is given by , where and are unitary matrices and is a diagonal matrix. The best rank approximation of is given by where is the restriction of to the top singular values and and are the corresponding singular vectors. The columns of the matrix represent the spatial modes that can be used as a lowdimensional representation of the data. In particular, we define the vector by the projection of the data (i.e. the columns of ) onto the span of , that is:
Thus, we can construct the timestamps from the measurements and can train the system using a version Eqn. (2.1) with the constraint that the ODE is of the form:
The additional matrix resembles the standard linear structure from the DMD approximation and the function can be seen as a nonlinear closure for the linear dynamics. The function is approximated, as before, by . To train the model, we minimize:
(2.8)  
where and also includes the trainable parameters from . Note that, to recover an approximation to the original measurements , the vector is mapped back to the correct spatial ordering (inverting the vectorization process).
In Figure 2.4, our approach with an 8 mode decomposition is compared to an 8 mode DMD approximation. The DMD approximation in Figure 2.4(a) introduces two erroneous vortices near the bottom boundary. Our approach matches the test data with higher accuracy, specifically, the relative error between our generated solution at the terminal time is compared to DMD’s relative error of . It is worth noting that this example shows the benefit of the additional term in the lowmode limit; however, using more modes, the DMD method becomes a very accurate approximation. Unlike the standard DMD method, our model does not require the data to be uniformly spaced in time.
3 Partial Differential Equations
A general form for a firstorder in time, th order in space, nonlinear PDE is:
where denotes the collection of all th order spatial partial derivatives of for . We form a dictionary as done in Sec. 2, where the monomial terms now apply to , , , and for . The spatial derivatives as well as can be calculated numerically from data using finite differences. We then use an MLP, , to parametrize the governing equation:
(3.1) 
see also [35, 29]. In particular, the function can be written as:
(3.2) 
where and are collections of convolutions, and are biases, are all the parameters from and , and is ELU activation function. The input channels are the monomials determined by , , , and , where is extended to a constant 2d array. The first linear layer maps the dictionary terms to multiple hidden channels, each defined by their own convolution. Thus, each hidden channel is a linear combination of input layers. Then we apply the ELU activation, followed by a convolution, which is equivalent to taking linear combinations of the activated hidden channels. Note that this differs from [35, 29] in several ways. In the first linear layer, our network uses multiple linear combinations rather than the single combination as in [35, 29]. Additionally, by using a (nonlinear) MLP we can approximate a general function on the coordinates and derivative; however, previous work defined approximations that model functions within the span of the dictionary elements.
To illustrate this approach, we apply the method to two examples: a regression problem using data from a 2d Burgers’ simulation (with noise) and the image classification problem using the MNIST and MNISTFashion datasets.
3.1 Burgers’ Equation
We consider the 2d Burgers’ equation,
The training and test data are generated on , with timestep and a uniform grid. To make the problem challenging, the training data is generated using a sine function in as the initial condition, while the test data uses a sine function in as the initial condition. We generate 5 training trajectories by adding noise to the initial condition. Our training set is of size and our test data is of size . To train the parameters we minimize:
(3.3)  
where is a discretization of .
Training, Minibatching, and Results. The minibatches used during training are constructed with minibatches in time and the fullbatch in space. For our experiment, we set a (temporal) batch size of 16 with a length of , i.e. each batch is of size containing 16 short trajectories. The points are chosen at random, without overlapping. The initial points of each minibatch are treated as the initial conditions for the batch, and our predictions are performed over the length of the trajectory. This is done at each iteration of the Adam optimizer with a learning rate of .
In Figure 3.1, we take 2000 iterations of training, and evaluate our results on both the training and test sets. Each of the convolutional layers have 50 hidden units, for a total of 2301 learnable parameters. For visualization, we plot the learned solution at the terminal time on both the training and test set. The meansquared error on the full training set is and on the test set is (for reference, the meansquared value of the test data is over ).
3.2 Image Classification: MNIST Data.
Another application of our approach is in reducing the number of parameters in convolutional networks for image classification. We consider a linear (spatiallyinvariant) dictionary for Eqn. (3.1
). In particular, the righthand side of the PDE is in the form of normalization, ReLU activation, two convolutions, and then a final normalization step. Each convolutional layer uses a
kernel of the form with 6 trainable parameters, where are kernels that represent the identity and the five finite difference approximations to the partial derivatives , , , , and . In CNN [19, 10, 11], the early features (relative to the network depth) typically appear to be images that have been filtered by edge detectors [30, 40]. The early/midlevel trained kernels often represent edge and shape filters, which are connected to secondorder spatial derivatives. This motivates us to replace the convolutions in ODENet [2] by finite difference kernels.Result shows that even though the trainable set of parameters are decreased by a third (each kernel has 6 trainable parameters, rather than 9), the overall accuracy is preserved (see Table 3.1). We follow the same experimental setup as in [2], except that the convolutional layers are replaced by finite differences. We first downsample the input data using a downsampling block with 3 convolutional layers. Specifically, we take a convolutional layer with 64 output channels and then apply two
convolutional layers with 64 output channels and a stride of 2. Between each convolutional layer, we apply batchnormalization and a ReLU activation function. The output of the downsampling block is of size
with 64 channels. We then construct our PDE block using 6 ‘PDE’ layers, taking the form:(3.4) 
We call each subinterval (indexed by ) a PDE layer since it is the evolution of a semidiscete approximation of a coupled system of PDE (the particular form of the convolutions act as approximations to differential operators). The function takes the form:
(3.5) 
where is batchnormalization, is a collection of kernels of the form , contains all the learnable parameters, and is the ReLU activation function. The PDE block is followed by batchnormalization, the ReLU activation function, and a 2d pooling layer. Lastly, a fully connected layer is used to transform the terminal state (activated and averaged) of the PDE blocks to a 10 component vector.
For the optimization, the crossentropy loss is used to compare the predicted outputs and the true label. We use the SGD optimizer with momentum set to 0.9. There are 160 total training epochs; we set the learning rate to 0.1 and decrease it by 1/10 after epoch 60, 100 and 140. The training is stopped after 160 epochs. All of the convolutions performed after the downsampling block are linear combinations of the 6 finite difference operators rather than the traditional
convolution.Method  

Name  #Params. (M)  Error() 
MLP [19]  0.24  1.6 
ResNet  0.60  0.41 
ODENet  0.22  0.51 
Our  0.18  0.51 
3.3 Image Classification: Fashion MNIST.
For a second test, we apply our network to the Fashion MNIST dataset. The Fashion MNIST dataset has 50000 training figures and 10000 test figures that can be classified into 10 different classes. Each data is of size
. For our experiment, we did not use any data augmentation, except for normalization before training. The structure of NeuPDE in this example differs from the MNIST experiment as follows: (1) we downsample the data once instead of twice and (2) after the 1st PDE block, which has 64 hidden units, we use a kernel with 64 input channels and 128 output channel, and then use one more PDE block with 128 hidden units followed by a fully connected layer. There are numerous reported test results on the Fashion MNIST dataset [5]; we compare our result to ResNet18 [6] and a simple MLP [24]. The test results are presented in Table 3.2 and all algorithms are tested without the use of data augmentation. In both the MNIST and Fashion MNIST experiments, we show that the NeuPDE approach allows for less parameters by enforcing (continuous) structure through the network itself.4 Discussion
We propose a method for learning approximations to nonlinear dynamical systems (ODE and PDE) using DNN. The network we use has an architecture similar to ResNet and ODENet, in the sense that it approximates the forward integration of a firstorder in time differential equation. However, we replace the forcing function (i.e. the layers) by an MLP with higherorder correlations between the spatiotemporal coordinates, the states, and derivatives of the states. In terms of convolution neural networks, this is equivalent to enforcing that the kernels approximate differential operators (up to some degree). This was shown to produce more accurate approximations to complex timeseries and spatiotemporal dynamics. As an additional application, we showed that when applied to image classification problems, our approach reduced the number of parameters needed while maintaining the accuracy. In scientific applications, there is more emphasis on accuracy and models that can incorporate physical structures. We plan to continue to investigate this approach for physical systems.
In imaging, one should consider the computational cost for training the networks versus the number of parameters used. While we argue that our architecture and structural conditions could lead to models with fewer parameter, it could be potentially slower in terms of training (due to the trainable nonlinear layer defined by the dictionary). Additionally, we leave the scalability of our approach for larger imaging data set, such as ImageNet, to future work. For larger classification problems, we suspect that higherorder derivatives (beyond secondorder) may be needed. Also, while higherorder integration methods (RungeKutta 4 or 45) may be better at capturing features in the ODE/PDE examples, more testing is required to investigate their benefits for imaging classification.
5 Acknowledgements
The authors would like to acknowledge the support of AFOSR, FA95501710125 and the support of NSF CAREER grant . We would like to thank Scott McCalla for providing feedback on this manuscript.