Modeling and extracting governing equations from complex time-series can provide useful information for analyzing data. An accurate governing system could be used for making data-driven predictions, extracting large-scale patterns, and uncovering hidden structures in the data. In this work, we present an approach for modeling time-dependent data using differential equations which are parameterized by shallow neural networks, but retain their intrinsic (continuous) differential structure.
For time-series data, recurrent neural networks (RNN) is often employed for encoding temporal data and forecasting future states. Part of the success of RNN are due to the internal memory architecture which allows these networks to better incorporate state information over the length of a given sequence. Although widely successful for language modeling, translation, and speech recognition, their use in high-fidelity scientific computing applications is limited. One can observe that a sequence generated by an RNN may not preserve temporal regularity of the underlying signals (see, for example or Figure 2.3) and thus may not represent the true continuous dynamics.
For imaging tasks, deep neural networks (DNN) such as ResNet [10, 11], FractalNet , and DenseNet  have been successful in extracting complex hierarchical spatial information. These networks utilize intra-layer connectivity to preserve feature information over the network depth. For example, the ResNet architecture uses convolutional layers and skip connections. The hidden layers take the form where represents the features at layer and
is a convolutional neural network (or more generally, any universal approximator) with trainable parameters
. The evolution of the features over the network depth is equivalent to applying the forward Euler method to the ordinary differential equation (ODE):. The connection between ResNet’s architecture, numerical integrators for differential equations, and optimal control has been presented in [4, 22, 9, 30, 40].
Recently, DNN-based approaches related to differential equations have been proposed for data mining, forecasting, and approximation. Examples of algorithms which use DNN for learning ODE and PDE include: learning from data using a PDE-based network [21, 20]
, deep learning for advection equations, approximating dynamics using ResNet with recurrent layers , and learning and modeling solutions of PDE using networks . Other approaches for learning governing systems and dynamics involve sparse regularizers ( or hard-thresholding approaches in [1, 29, 32, 33] and problems in [39, 35, 34]) or models based on Gaussian processes [27, 26].
Note that in [21, 20] it was shown that adding more blocks of the PDE-based network improved (experimentally) the model’s predictive capabilities. Using ODEs to represent the network connectivity,  proposed a ‘continuous-depth’ neural network called ODE-Net. Their approach essentially replaces the layers in ResNet-like architectures with a trainable ODE. In , the authors state that their approach has several advantages, including the ability to better connect ‘layers’ due to the continuity of the model and a lower memory cost when training the parameters using the adjoint method. The adjoint method proposed in  may not be stable for a general problem. In , a memory efficient and stable approach for training a neural ODE was given.
1.1 Contributions of this Work.
We present a machine learning approach for constructing approximations to governing equations of time-dependent systems that blends physics-informed candidate functions with neural networks. In particular, we construct a network approximation to an ODE which takes into account the connectivity between components (using a dictionary of monomials) and the differential structure of spatial terms (using finite difference kernels). If the user has prior knowledge on the structure or source of the data, i.e. fluids, mechanics, etc., one can incorporate commonly used physical models into the dictionary. We show that our approach can be used to extract ODE or PDE models from time-dependent data, improve the spatial accuracy of reduced order models, and reduce the parameter cost for image classification (for the MNIST and Fashion MNIST datasets).
2 Modeling Via Ordinary Differential Equations
Given discrete-time measurements generated from an unknown dynamic process, we model the time-series using a (first-order) ordinary differential equation, , with . The problem is to construct an approximation to the unknown generating function , i.e. we will learn networks such that . Essentially, we are learning a neural network approximation to the velocity field. Following the approach in , the system is trained by a ‘continuous’ model and the function
is parameterized by multilayer perceptrons (MLP). Since a two-layer MLP may require a large width to approximate a generic (nonlinear) function, we purpose a different parameterization. Specifically, to better capture higher-order correlations between components of the data and to lower the number of parameters needed in the MLP (see for example, Figure 2.2), a dictionary of candidate inputs is added. Let be the collection (as a matrix) of the th order monomial terms depending on and , i.e. each element in can be written as:
One has freedom to determine the particular dictionary elements; however, the choice of monomial terms provides a model for the interactions between each of the components of the time-series and is used for model identification of dynamical systems in the general setting [1, 39, 34]. For simplicity, we will suppress the (user-defined) parameter .
, regularized optimization with polynomial dictionaries is used to approximate the generating function of some unknown dynamic process. When the dictionary is large enough so that the ‘true’ function is in the span of the candidate space, the solutions produced by sparse optimization are guaranteed to be accurate. To avoid defining a sufficiently rich dictionary, we propose using an MLP (with a non-polynomial activation function) in combination with the monomial dictionary, so that general functions may be well-approximated by the network. Note that the idea of using products of the inputs appears in other network architectures for example, the high-order neural networks[8, 38].
In many DNN architectures, batch normalization or weight normalization  are used to improve the performance and stability during training. For the training of NeuPDE, a simple (uniform) normalization layer, , is added between the input and dictionary layers, which maps
to a vector in(using the range over the all components). Specifically, let and be the maximum (and minimum) value of the data, over all components and samples and define the vector as:
This normalization is applied to each component uniformly and enforces that each component of the dictionary is bounded by one (in magnitude). We found that this normalization was sufficient for stabilizing training and speeding up optimization in the regression examples. Without this step, divergence in the training phase was observed.
To train the network: let be the vector of learnable parameters in the MLP layer, then the optimization problem is:
where are regularization parameters set by the user and is an MLP. Specifically, let be a smooth activation function, for example, the exponential linear unit (ELU)
or the hyperbolic tangent, tanh, which will be sufficiently smooth for integration using Runge-Kutta schemes. The right-hand side of the ODE is parameterized by a fully connected layer - activation layer - fully connected layer, i.e.
where , i.e. the vectorization of all components of the matrices and and biases and . Therefore, the first layer of the MLP in the form takes a linear combination of candidate functions (applied to normalized data). Note that the dictionary does not include the constant term since we include a bias in the first fully connected layer. The function is a regularizer on the parameters (for example, the norm) and the time-derivative is penalized by the norm. When used, the parameters are set to and (no tuning is performed).
The constraints in Eqn. (2.1) are written in continuous-time, i.e. the value of is defined by the ODE and thus can be evaluated at any time . For a given set of parameters , the values are obtained by numerical integration (for example, using a Runge-Kutta scheme). To optimize Eqn. (2.1) using a gradient-based method, the back-propagation algorithm or the adjoint method (following ) can be used. The adjoint method requires solving the ODE (and its adjoint) backward-in-time, which can lead to numerical instabilities. Following the approach in , checkpointing can be used to mitigate this issue.
For all experiments, we take the ‘discretize-then-optimize’ approach. The objective function, Eqn. (2.1), is discretized as follows:
where is an ODE solver (i.e. a Runge-Kutta scheme) applied -times, is rescaled by the time-step, and the time-derivative is discretized on the time-interval with the integral approximated by piece-wise constant quadrature. The constraint that the ODE is satisfied at every time-stamp has been transformed to the constraint that the sequence for is generated by the forward evolution of an ODE solver. The ODE solver takes (as its inputs) the initial data and the function that defines the RHS of the ODE. Note that the ODE solver can be ‘sub-grid’ in the sense that, over a time interval , we can set the solver to take multiple (smaller) time-steps. This will increase storage cost needed for back-propagation; however, taking multiple time-steps can better resolve complex dynamics embedded by (see examples below). Additionally, the time-derivative regularizer helps to control the growth of the generative model, which yields control over the solution and its regularity.
Eqn. (2.2) is solved using the Adam algorithm  with temporal mini-batches. Each mini-batch is constructed by taking a random sampling of the time-stamps for and then collecting all data-points from to for some . This can be easily extended to multiple time-series, by sampling over each trajectory as well. Note that, in experiments, taking non-overlapping sub-intervals did not change the results. The back-propagation algorithm  applied to each of the subintervals can be done in parallel. For all of the regression examples, we set where is the given (sequential) data. For the image classification examples, we used the standard cross-entropy for .
Layers: The right-hand side of the ODE is parameterized by one set of parameters . Therefore, in terms of DNN layers, we are technically only training one “layer”. However, changes in the structure between time-steps are taken into account by the time-dependence, i.e. the dictionary terms that contain . Thus, we are embedding multiple-layers into the time-dependent dictionary.
‘Continuous-depth’: Eqn. (2.2) is fully discrete when using a Runge-Kutta solver for and its gradient can be calculated using the back-propagation algorithm. If we used an adaptive ODE solver, such as Runge-Kutta 45, the forward propagation would generate a new set of time-stamps (which always contain the time-stamps ) in order to approximate the forward evolution , given an error tolerance and a set of parameters . We tested the continuous-depth versions, using back-propagation and the adjoint method (see Appendix A). Although the backward evolution of the ODE constraint may not be well-posed (i.e. numerical unstable or unbounded), our experiments lead to similar results. This could be due to the temporal mini-batches which enforce short-integration windows. Following , a checkpointing approach should be used, which would also control the memory footprint. It should be noted that, the adjoint approach was tested for the ODE examples but not for PDE examples, since the PDE examples are not time-reversible and will lead to backward-integration issues.
In addition, when the network is discrete, one may still consider it as a ‘continuous-depth’ approximation, since the underlying approximation can be refined by adding more time-stamps, without the need to retrain the MLP.
2.1 Autonomous ODE.
When the time-series data is known to be autonomous, i.e. the ODE takes the form , one can drop the -dependency in the dictionary. In this case, the monomials take the form . We train this model by minimizing:
where is the given data (corrupted by noise) over the time-stamps for . The true governing equation is given by the 3d Lorenz system:
which emits chaotic trajectories.
In Figure 2.1(a), we train the model with 20 hidden nodes per layer using a quadratic dictionary, i.e. there are 9 terms in the dictionary, with 20 bias parameters, with 3 bias parameters, for a total of 263 trainable parameters. The solid curves are the time-series generated by a forward pass of the trained model. The learned system generates a high-fidelity trajectory for the first part of the time-interval. In Figure 2.1(b-c), we investigate the effect of the degree in the dictionary. In Figure 2.1(b), using a degree 1 monomial dictionary with 38 hidden nodes per layers, i.e. 3 terms in the dictionary, with 38 bias parameters, with 3 bias parameters (for a total of 269 trainable parameters), the generated curves trace a similar part of phase space, but are point-wise inaccurate. By increasing the hidden nodes to 100 per layer (3 terms in the dictionary, with 100 bias parameters, with 3 bias parameter, for a total of 703 trainable parameters), we see in Figure 2.1(c) that the method (using a degree 1 dictionary) is able to capture the correct point-wise information (on the same order of accuracy as Figure 2.1(a)) but requires more than double the number of parameters.
2.2 Non-Autonomous ODE and Noise.
To investigate the effects of noise and regularization, we fit the data to a non-linear spiral:
corrupted by noise. The third coordinate of Eqn. (2.5) is time-dependent, which can be challenging for many recovery algorithms. This is partly due to the redundancy introduced into the dictionary by the time-dependent terms. To generate Figure 2.2, we set:
the dictionary degree to 1, with 4 terms, with 46 bias parameters, with 3 bias parameters (371 trainable parameters in total);
the degree to 4, with 69 terms, with 4 bias parameter, with 3 bias parameters (295 trainable parameters in total), and the regularization parameter to ;
the degree to 4, with 69 terms, with 5 bias parameter, with 3 bias parameters (368 trainable parameters in total).
For cases (b-c), we set the degree of the dictionary to be larger than the known degree of the governing ODE in order to verify that we do not overfit using a higher-order dictionary and that we are not tailoring the dictionary to the problem. In Figure 2.2(a), the dictionary of linear monomials with a moderately sized MLP seems to be insufficient for capturing the true nonlinear dynamics. This can be observed by the over-smoothing caused by the linear-like dynamics. In Figure 2.2(c), a nonlinear dictionary can fit the data and extract the correct pattern (the ‘squared’-off corners). Figure 2.2(b) shows that we are able to decrease the total number of parameters and fit the trajectory within the same tolerance as (c) by penalizing the derivative. Both (b) and (c) have achieved a mean-squared loss under .
2.3 Comparison for Extracting Governing Models.
Comparison with SINDy. We compare the results of Figure 2.2 with an approximation using the SINDy algorithm from  (theoretical results of convergence and relationship to the problem appear in ). These approaches differ, since the SINDy algorithm seeks to recover a sparse approximation to the governing system given one tuning parameter and is restricted to the span of the dictionary elements. To make the dictionary sufficiently rich, the degree is set to 4 as was done for Figure 2.2 (b-c). Since the sparsity of the first two components is equal to one, we search over all parameter-space (up to 6 decimals) that yields the smallest non-zero sparsity. The smallest non-zero sparsity for the first component is 12 and for the second component is 3 with:
which is independent of and and does not lead to an accurate approximation to the nonlinear spiral. This is likely due to the level of noise present in the data and the incorporation of the time-component.
Comparison with LASSO-based methods. We compare the results of Figure 2.2 with LASSO-based approximations for learning governing equations . The LASSO parameter is chosen so that the sparsity of the solution matches the sparsity of the true dynamics (with respect to a dictionary of degree 4). In addition, the coefficients are ‘debiased’ following the approach in . The learned system is:
which matches the profile of the data in the -plane; however, it does not predict the correct dynamics for (emitting seemingly periodic orbits). While the LASSO-based approach better resolves the state-space dependence, it does not correctly identify the time-component.
Comparison with RNN. In Figure 2.3 the Lorenz system (see Figure 2.1) is approximated by our proposed approach and a standard LSTM (RNN), with the same number of parameters. Although the RNN learns internal hidden states, the RNN does not learn the correct regularity of the trajectories thus leading to sharp corners. It is worth noting that, in experiments, as the number of parameters increases, both the RNN and our network will produce sequences that approach the true time-series.
2.4 ODE from Low-Rank Approximations
For certain spatio-temporal systems, reduced-order models can be used to transform complex dynamics into low-dimensional time-series (with stationary spatial modes). One of the popular methods for extracting the spatial modes and identifying the corresponding temporal-dynamics is the dynamic mode decomposition (DMD) introduced in . The projected DMD method  makes use of the SVD approximation to construct the modes and the linear dynamical system. Another reduced-order model, known as the proper orthogonal decomposition (POD) , can be used to construct spatial modes which best represent a given spatio-temporal dataset. The projected DMD and the POD methods leverage low-rank approximations to reduce the dimension of the system and to construct a linear approximation to the dynamics (related to the spectral analysis of the Koopman operator), see  and the citations within.
We apply our approach to construct a neural network approximation to the time-series generated by a low-rank approximation of the von Kármán vortex sheet. We explain the construction for this example here but for more details, see . Given a collection of measurements , where are the spatial coordinates and are the time-stamps, define as the matrix whose columns are the vectorization of each , i.e. and where is the number of grid points used to discretize . The SVD of the data is given by , where and are unitary matrices and is a diagonal matrix. The best -rank approximation of is given by where is the restriction of to the top singular values and and are the corresponding singular vectors. The columns of the matrix represent the spatial modes that can be used as a low-dimensional representation of the data. In particular, we define the vector by the projection of the data (i.e. the columns of ) onto the span of , that is:
Thus, we can construct the time-stamps from the measurements and can train the system using a version Eqn. (2.1) with the constraint that the ODE is of the form:
The additional matrix resembles the standard linear structure from the DMD approximation and the function can be seen as a nonlinear closure for the linear dynamics. The function is approximated, as before, by . To train the model, we minimize:
where and also includes the trainable parameters from . Note that, to recover an approximation to the original measurements , the vector is mapped back to the correct spatial ordering (inverting the vectorization process).
In Figure 2.4, our approach with an 8 mode decomposition is compared to an 8 mode DMD approximation. The DMD approximation in Figure 2.4(a) introduces two erroneous vortices near the bottom boundary. Our approach matches the test data with higher accuracy, specifically, the relative error between our generated solution at the terminal time is compared to DMD’s relative error of . It is worth noting that this example shows the benefit of the additional term in the low-mode limit; however, using more modes, the DMD method becomes a very accurate approximation. Unlike the standard DMD method, our model does not require the data to be uniformly spaced in time.
3 Partial Differential Equations
A general form for a first-order in time, -th order in space, nonlinear PDE is:
where denotes the collection of all -th order spatial partial derivatives of for . We form a dictionary as done in Sec. 2, where the monomial terms now apply to , , , and for . The spatial derivatives as well as can be calculated numerically from data using finite differences. We then use an MLP, , to parametrize the governing equation:
where and are collections of convolutions, and are biases, are all the parameters from and , and is ELU activation function. The input channels are the monomials determined by , , , and , where is extended to a constant 2d array. The first linear layer maps the dictionary terms to multiple hidden channels, each defined by their own convolution. Thus, each hidden channel is a linear combination of input layers. Then we apply the ELU activation, followed by a convolution, which is equivalent to taking linear combinations of the activated hidden channels. Note that this differs from [35, 29] in several ways. In the first linear layer, our network uses multiple linear combinations rather than the single combination as in [35, 29]. Additionally, by using a (nonlinear) MLP we can approximate a general function on the coordinates and derivative; however, previous work defined approximations that model functions within the span of the dictionary elements.
To illustrate this approach, we apply the method to two examples: a regression problem using data from a 2d Burgers’ simulation (with noise) and the image classification problem using the MNIST and MNIST-Fashion datasets.
3.1 Burgers’ Equation
We consider the 2d Burgers’ equation,
The training and test data are generated on , with time-step and a uniform grid. To make the problem challenging, the training data is generated using a sine function in as the initial condition, while the test data uses a sine function in as the initial condition. We generate 5 training trajectories by adding noise to the initial condition. Our training set is of size and our test data is of size . To train the parameters we minimize:
where is a discretization of .
Training, Mini-batching, and Results. The mini-batches used during training are constructed with mini-batches in time and the full-batch in space. For our experiment, we set a (temporal) batch size of 16 with a length of , i.e. each batch is of size containing 16 short trajectories. The points are chosen at random, without overlapping. The initial points of each mini-batch are treated as the initial conditions for the batch, and our predictions are performed over the length of the trajectory. This is done at each iteration of the Adam optimizer with a learning rate of .
In Figure 3.1, we take 2000 iterations of training, and evaluate our results on both the training and test sets. Each of the convolutional layers have 50 hidden units, for a total of 2301 learnable parameters. For visualization, we plot the learned solution at the terminal time on both the training and test set. The mean-squared error on the full training set is and on the test set is (for reference, the mean-squared value of the test data is over ).
3.2 Image Classification: MNIST Data.
Another application of our approach is in reducing the number of parameters in convolutional networks for image classification. We consider a linear (spatially-invariant) dictionary for Eqn. (3.1
). In particular, the right-hand side of the PDE is in the form of normalization, ReLU activation, two convolutions, and then a final normalization step. Each convolutional layer uses akernel of the form with 6 trainable parameters, where are kernels that represent the identity and the five finite difference approximations to the partial derivatives , , , , and . In CNN [19, 10, 11], the early features (relative to the network depth) typically appear to be images that have been filtered by edge detectors [30, 40]. The early/mid-level trained kernels often represent edge and shape filters, which are connected to second-order spatial derivatives. This motivates us to replace the convolutions in ODE-Net  by finite difference kernels.
Result shows that even though the trainable set of parameters are decreased by a third (each kernel has 6 trainable parameters, rather than 9), the overall accuracy is preserved (see Table 3.1). We follow the same experimental setup as in , except that the convolutional layers are replaced by finite differences. We first downsample the input data using a downsampling block with 3 convolutional layers. Specifically, we take a convolutional layer with 64 output channels and then apply two
convolutional layers with 64 output channels and a stride of 2. Between each convolutional layer, we apply batch-normalization and a ReLU activation function. The output of the downsampling block is of sizewith 64 channels. We then construct our PDE block using 6 ‘PDE’ layers, taking the form:
We call each subinterval (indexed by ) a PDE layer since it is the evolution of a semi-discete approximation of a coupled system of PDE (the particular form of the convolutions act as approximations to differential operators). The function takes the form:
where is batch-normalization, is a collection of kernels of the form , contains all the learnable parameters, and is the ReLU activation function. The PDE block is followed by batch-normalization, the ReLU activation function, and a 2d pooling layer. Lastly, a fully connected layer is used to transform the terminal state (activated and averaged) of the PDE blocks to a 10 component vector.
For the optimization, the cross-entropy loss is used to compare the predicted outputs and the true label. We use the SGD optimizer with momentum set to 0.9. There are 160 total training epochs; we set the learning rate to 0.1 and decrease it by 1/10 after epoch 60, 100 and 140. The training is stopped after 160 epochs. All of the convolutions performed after the downsampling block are linear combinations of the 6 finite difference operators rather than the traditionalconvolution.
3.3 Image Classification: Fashion MNIST.
For a second test, we apply our network to the Fashion MNIST dataset. The Fashion MNIST dataset has 50000 training figures and 10000 test figures that can be classified into 10 different classes. Each data is of size. For our experiment, we did not use any data augmentation, except for normalization before training. The structure of NeuPDE in this example differs from the MNIST experiment as follows: (1) we downsample the data once instead of twice and (2) after the 1st PDE block, which has 64 hidden units, we use a kernel with 64 input channels and 128 output channel, and then use one more PDE block with 128 hidden units followed by a fully connected layer. There are numerous reported test results on the Fashion MNIST dataset ; we compare our result to ResNet18  and a simple MLP . The test results are presented in Table 3.2 and all algorithms are tested without the use of data augmentation. In both the MNIST and Fashion MNIST experiments, we show that the NeuPDE approach allows for less parameters by enforcing (continuous) structure through the network itself.
We propose a method for learning approximations to nonlinear dynamical systems (ODE and PDE) using DNN. The network we use has an architecture similar to ResNet and ODENet, in the sense that it approximates the forward integration of a first-order in time differential equation. However, we replace the forcing function (i.e. the layers) by an MLP with higher-order correlations between the spatio-temporal coordinates, the states, and derivatives of the states. In terms of convolution neural networks, this is equivalent to enforcing that the kernels approximate differential operators (up to some degree). This was shown to produce more accurate approximations to complex time-series and spatio-temporal dynamics. As an additional application, we showed that when applied to image classification problems, our approach reduced the number of parameters needed while maintaining the accuracy. In scientific applications, there is more emphasis on accuracy and models that can incorporate physical structures. We plan to continue to investigate this approach for physical systems.
In imaging, one should consider the computational cost for training the networks versus the number of parameters used. While we argue that our architecture and structural conditions could lead to models with fewer parameter, it could be potentially slower in terms of training (due to the trainable nonlinear layer defined by the dictionary). Additionally, we leave the scalability of our approach for larger imaging data set, such as ImageNet, to future work. For larger classification problems, we suspect that higher-order derivatives (beyond second-order) may be needed. Also, while higher-order integration methods (Runge-Kutta 4 or 45) may be better at capturing features in the ODE/PDE examples, more testing is required to investigate their benefits for imaging classification.
The authors would like to acknowledge the support of AFOSR, FA9550-17-1-0125 and the support of NSF CAREER grant . We would like to thank Scott McCalla for providing feedback on this manuscript.