1 Introduction
Dynamical systems (strogatz2018nonlinear) describe the evolution of phenomena occurring in nature, in which a differential equation, , models the dynamics of a dimensional state . Here is a nonlinear operator parameterized by parameters . Learning dynamical systems is to search a good model for a dynamical system in the hypothesis space guided by some criterion for performance. In this work, we study the forecasting problem of predicting an sequence of future states given an sequence of historic states .
A plethora of work has been devoted to learning dynamical systems. When
is known, physicsbased methods based on numerical integration are commonly used for parameter estimation
(Houska2012). mazier2; Ali2018Equ; Sirignano2018DGM propose to directly solve by approximatingwith neural networks that take the coordinates and time as input. When
is unknown and the data is abundant, datadriven methods are preferred. For example, deep learning (DL), especially deep sequence models (Benidis2020NeuralFI; Sezer2020Survey; Flunkert2017DeepARPF; Rangapuram2018DeepSS) have demonstrated success in time series forecasting. In addition, Wang2020TF; ayed2019learning; Wang2020symmetry; chen19 have developed hybrid DL models based on differential equations for spatiotemporal dynamics forecasting.Our study was initially motivated by the need to forecast COVID19 dynamics. Since the actual dynamics of COVID19 is unknown, we perform a comprehensive benchmark study of various methods to forecast the cumulative number of confirmed, removed and death cases.^{1}^{1}1
Reproducibility: We opensource our code
https://github.com/RoseSTLLab/AutoODEDSL; the COVID19 data we use is from John Hopkins Dataset https://github.com/CSSEGISandData/COVID19. To our surprise, we found that physicsbased models significantly outperform deep learning methods for COVID19 forecasting. We adopt automatic differentiation for parameter estimation. Specifically, we use numerical integration to generate state estimate and minimize the difference between estimate and the ground truth with automatic differentiation (Baydin2017AutomaticDI; Paszke17diff). This leads to a novel hybrid method for COVID19 forecasting, AutoODECOVID, which obtains a 57.4% reduction in mean absolute errors for 7days ahead prediction.To understand the inferior performance of DL in forecasting COVID19 dynamics, we experiment with several other dynamical systems: Sine, SEIR, LotkaVolterra and FitzHugh–Nagumo. We observe that DL models cannot cope with two distribution shift scenarios that may naturally occur in dynamical system learning: nonstationary dynamics and dynamics with different parameters. Our findings highlight the unique challenge of using DL models for learning dynamical systems and call for more robust methods that can generalize well for the forecasting task. To summarize, our contributions are the following:

We study the learning of dynamical systems for forecasting, even with unobserved variables. We perform a benchmark study of both physicsbased and datadriven models for predicting the COVID19 dynamics among the cumulative confirmed, removed and death cases.

We propose a hybrid model, AutoODECOVID, with a novel compartmental model and automatic differentiation for forecasting COVID19. Our method obtains a 57.4% reduction in mean absolute errors for 7days ahead prediction compared with the best DL competitor.

We provide empirical evidence to explain the inferior performance of DL by learning LotkaVolterra, FitzHugh–Nagumo and SEIR dynamics. We found that widelyused DL models fail to learn the correct dynamics when there is distribution shift either in the data space or the parameters of the dynamical system.
2 Related Work
Learning Dynamical System
The seminal work by Brunton2015Sparse proposed to solve ODEs by creating a dictionary of possible terms and applying sparse regression to select appropriate terms. But it assumes that the chosen library is sufficient. Physicsinformed deep learning directly solves differential equations with neural nets given space and time as input (mazier2; Ali2018Equ; Sirignano2018DGM). This type of methods cannot be used for forecasting since future would always lie outside of the training domain and neural nets cannot extrapolate to unseen domain (Kouw2018domain; Amodei2016Safety). Local methods, such as ARIMA and Gaussian SSMs (Salina19gaussian; Rasmussen16Gaussian; Du16point) learn the parameters individually for each time series. Hybrid DL models, e.g. ayed2019learning; Wang2020symmetry; chen19; Ayed2019LearningDS integrate differential equations in DL for temporal dynamics forecasting.
Deep Sequence Models
Since accurate numerical computation requires lots of manual engineering and theoretical properties may have not been well understood, deep sequence models have been widely used for learning dynamical systems. Sequence to sequence models and the Transformer, have an encoderdecoder structure that can directly map input sequences to output sequences with different lengths (Vaswani2017AttentionIA; Wu2020Deep; Li2020Enhancing; Rangapuram2018DeepSS; Flunkert2017DeepARPF). Fully connected neural networks can also be used autoregressively to produce multiple timestep forecasts (Benidis2020NeuralFI; Lim2020TimeSF). Neural ODE (chen19) is based on the assumption that the data is governed by an ODE system and able to generate continuous predictions. When the data is spatially correlated, deep graph models, such as graph convolution networks and graph attention networks (Petar2017Attention), have also been used.
Epidemic Forecasting
Compartmental models are commonly used for modeling epidemics. Chen20SIR
proposes a timedependent SIR model that uses ridge regression to predict the transmission and recovery rates over time. A potential limitation with this method is that it does not consider the incubation period and unreported cases.
Pei20Initial modified the compartments in the SEIR model into the subpopulation commuting among different places, and estimated the model parameters using iterated filtering methods. Wang20Survival proposes a populationlevel survivalconvolution method to model the number of infectious people as a convolution of newly infected cases and the proportion of individuals remaining infectious over time. zou20 proposes the SuEIR model that incorporates the unreported cases, and the effect of the exposed group on susceptibles. Davis20transmission shows the importance of simultaneously modeling the transmission rate among the fifty U.S. states since transmission between states is significant.3 Learning Dynamical Systems
3.1 Problem Formulation
Denote as observed variables and as the unobserved variables, we aim to learn a dynamical system given as
(3.1) 
In practice, we have observations as inputs. The task of learning dynamical systems is to learn and , and produce accurate forecasts , where is called the forecasting horizon.
3.2 DataDriven Modeling
For datadriven models, we assume both and are unknown. We are given training and test samples either as sliced subsequences from a long sequence (same parameters, different initial conditions) or independent samples from the system (different parameters, same initial conditions). In particular, let be the training data distribution and be the test data distribution. DL seeks a hypothesis that maps a sequence of past values to future values:
(3.2) 
where denotes individual sample, is the input length and is the output length.
Following the standard statistical learning setting, a deep sequence model minimizes the training loss , where is the training sample,
is a loss function. For example, for square loss, we have
The test error is given as . The goal is to achieve small test error and small indicates good generalization ability.
A fundamental difficulty of forecasting in dynamical system is the distributional shift that naturally occur in learning dynamical systems (Kouw2018domain; Amodei2016Safety). In forecasting, the data in the future often lie outside the training domain
, and requires methods to extrapolate to the unseen domain. This is in contrast to classical machine learning theory, where generalization refers to model adapting to unseen data drawn from the same distribution
(Hastie17element; Poggio12theory).3.3 PhysicsBased Modeling
Physicsbased modeling assumes we already have an appropriate system of ODEs to describe the underlying dynamics. We know the function and , but not the parameters. We can use automatic differentiation to estimate the unknown parameters and the initial values . We coin this procedure as AutoODE. Similar approaches have been used in other papers (Rackauckas2020UniversalDE; zou20) but have not been well formalized. The main procedure is described in the algorithm 1. In the meanwhile, we need to ensure and have enough correlation that AutoODE can correctly learn all the parameters based on the observable only. If and are not correlated or loosely correlated, we may be not able to estimate solely based on the observations of .
Algorithm 1: AutoODE 

0: Initialize the unknown parameters in Eqn. 3.1 randomly. 
1: Discretize Eqn. 3.1 and apply 4th order Runge Kutta (RK4) Method. 
2: Generate estimation for : 
3: Minimize the forecasting loss with the Adam optimizer, 
4: After convergence, use estimated and 4th order Runge Kutta Method to 
generate final prediction, . 
NeuralODE (chen19)
uses the adjoint method to differentiate through the numerical solver. Adjoint methods are more efficient in higher dimensional neural network models which require complex numerical integration. In our case, since we are dealing with low dimension ordinary differential equations and the RK4 is sufficient to generate accurate predictions. We can directly implement the RK4 in Pytorch and make it fully differentiable.
4 Case study: COVID19 Forecasting
We benchmark different methods for predicting the COVID19 dynamics among the cumulative confirmed, removed and death cases. We instantiate a special instance of AutoODE, with a novel SuEIR compartmental model, called AutoODECOVID.
(4.1) 
4.1 AutoODECOVID
We present the AutoODECOVID model given in Eqn. (4.1) and details are listed as below. We estimate the unknown parameters , , , and , which correspond to the transmission, incubation, discovery, and recovery rates, respectively. We also need to estimate unknown variables, the cumulative numbers of susceptibles (), exposed () and unreported () cases, and predict the cumulative numbers of infected (), removed () and death () cases. The total population is assumed to be constant for each U.S. states .
Low Rank Approximation to the Sparse Transmission Matrix
We introduce a sparse transmission matrix to model the transmission rate among the 50 U.S. states. is the elementwise product of the U.S. states adjacency matrix and the transmission matrix , that is, . is a sparse 10 matrix that indicates whether two states are adjacent to each other. learns the transmission rates among the all 50 states from the data. means that we omit the transmission between the states that are not adjacent to each other. We make sparse with as contains too many parameters and we want to avoid overfitting. To further reduce the number of parameters and improve the computational efficiency to , we use a low rank approximation to generate the correlation matrix , where for .
Piecewise Linear Transmission Rate
Most compartmental models assume the transmission rate is constant. For COVID19, the transmission rate of COVID19 changes over time due to government regulations, such as school closures and social distancing. Even though we focus on shortterm forecasting (7 days ahead), it is possible that the transmission rate may change during the training period. Instead of a constant approximation to , we use a piecewise linear function over time , and set the breakpoints, slopes and biases as trainable parameters.
Death Rate Modeling:
The relationship between the numbers of accumulated removed and death cases can be close to linear, exponential or concave in different states. We assume the death rate as a linear combination of to cover both the convex and concave functions, where and are set as learnable parameters.
Weighted Loss Function
We set the unknown parameters in Eqn. (4.1) as trainable, and apply AutoODE to minimize the following weighted loss function:
with weights and loss function which we will specify in the following section. We utilize these weights to balance the loss of the three states due to scaling differences, and also reweigh the loss at different time steps. We give larger weights to more recent data points by setting . The constants, and are tuned on the validation set.
MAE  

FC  8379  5330  257  559  701  30  775  654  33 
Seq2Seq  5172  2790  99  781  700  40  728  787  35 
Transformer  8225  2937  2546  1282  1308  46  1301  1253  41 
NeuralODE  7283  5371  173  682  661  43  858  791  35 
GCN  6843  3107  266  1066  923  55  1605  984  44 
GAN  4155  2067  153  1003  898  51  1065  833  40 
SuEIR  1746  1984  136  639  778  39  888  637  47 
AutoODE  818  1079  109  514  538  41  600  599  39 
4.2 Experiments on forecasting COVID19 Dynamics
We use the COVID19 data from Apr 14 to Sept 12 provided by Johns Hopkins University (covid19jhu). It contains the cumulative numbers of infected (), recovered () and death () cases.
We investigate six DL models on forecasting COVID19 trajectories: sequence to sequence with LSTMs (Seq2Seq), Transformer, autoregressive fully connected neural nets (FC), NeuralODE, graph convolution networks (GCN) and graph attention networks (GAN). We standardize , and
time series of each state individually to avoid one set of features dominating another. We use sliding windows to generate samples of sequences before the target week and split them into training and validation sets. We perform exhaustive search to tune the hyperparameters on the validation set.
We also compare SuEIR (zou20) and AutoODECOVID
. We rescale the trajectories of the number of cumulative cases of each state by the population of that state. We use the quantile regression loss
(MQCNN2018Wen) for all models. All the DL models are trained to predict the number of daily new cases instead of the number of cumulative cases because we want to detread the time series, and put the training and test samples in the same approximate range. All experiments were conducted on Amazon Sagemaker (sagemaker).Table 1 shows the 7day ahead forecasting mean absolute errors of three features , and for the weeks of July 13, Aug 23 and Sept 6. We can see that AutoODECOVID overall performs better than SuEIR and all the DL models. FC and Seq2Seq have better prediction accuracy of death counts but all DL models have much bigger errors on the prediction of week July 13. Figure 1 visualizes the 7day ahead COVID19 predictions of , and in Massachusetts by AutoODECOVID and the best performing DL model, FC. The prediction by AutoODECOVID
is closer to the target and has smaller confidence intervals. This demonstrates the effectiveness of our hybrid model, as well as the benefits of our novel compartmental model design.
5 Generalization in Learning Dynamical Systems
We explore the potential reasons behind inadequate performance of DL models on COVID19 forecasting by studying their generalization ability. We experimentally explore the two cases, distribution shift in the data where the observations range changes; and parameter domains where the parameter range of the system in training and test differs.
5.1 Dynamical Systems
Apart from COVID19 trajectories, we also investigate the following three nonlinear dynamics.
LotkaVolterra (Lv)
system (Ahmad1993OnTN) of Eqn.(5.1) describes the dynamics of biological systems in which predators and preys interact, where denotes the number of species interacting and denotes the population size of species at time . The unknown parameters , and denote the intrinsic growth rate of species , the carrying capacity of species when the other species are absent, and the interspecies competition between two different species, respectively.
FitzHugh–Nagumo (Fhn)
FitzHugh and, independently, Nagumo derived the Eqn.(5.2
) to qualitatively describe the behaviour of spike potentials in the giant axon of squid neurons. The system describes the reciprocal dependencies of the voltage
across an axon membrane and a recovery variable summarizing outward currents. The unknown parameters , , and are dimensionless and positive, and determines how fast changes relative to .Seir
system of Eqn.(5.3) models the spread of infectious diseases (Tillett1992Dynamics). It has four compartments: Susceptible () denotes those who potentially have the disease, Exposed () models the incubation period, Infected () denotes the infectious who currently have the disease, and Removed/Recovered () denotes those who have recovered from the disease or have died. The total population is assumed to be constant and the sum of these four states. The unknown parameters , and denote the transmission, incubation, and recovery rates, respectively.
(5.1) 
(5.2) 
(5.3) 
5.2 Interpolation vs. Extrapolation
We define and as the training and the test data distributions. And the and denote parameter distributions of training and test sets, where the parameter here refers to the coefficients and the initial values of dynamical systems. A distribution is a function that map a sample space to the interval [0,1] if it a continuous distribution, or a subset of that interval if it is a discrete distribution. The domain of a distribution , i.e. , refers to the set of values (sample space) for which that distribution is defined.
We define two types of interpolation and extrapolation tasks. Regarding the data domain, we define a task as an interpolation task when the data domain of the test data is a subset of the domain of the training data, i.e.,
, and then extrapolation occurs . Regarding the parameter domain, an interpolation task indicates that , and an extrapolation task indicates that .5.3 Generalization in dynamical systems: unseen data in the different data domain
Through a simple experiment on learning the Sine curves, we show deep sequence models have poor generalization on extrapolation tasks regarding the data domain, i.e. . Specifically, we generate 2k Sine samples of length 60 with different frequencies and phases, and randomly split them into training, validation and interpolationtest sets. The extrapolationtest set is the interpolationtest set shifted up by 1. We investigate four models, including Seq2Seq (sequence to sequence with LSTMs), Transformer, FC (autoregressive fully connected neural nets) and NeuralODE. All models are trained to make 30 steps ahead prediction given the previous 30 steps.
Table 3 shows that all models have substantially larger errors on the extrapolation test set. Figure 3 shows Seq2Seq predictions on an interpolation (left) and an extrapolation (right) test samples. We can see that Seq2Seq makes accurate predictions on the interpolationtest sample, while it fails to generalize when the same samples are shifted up only by 1.
capbtabboxtable[][]
5.4 Generalization in dynamical systems: unseen data with different system parameters
Even when , deep sequence models can still fail to learn the correct dynamics if there is a distributional shift in the parameter domain, i.e., .
For each of the three dynamics in section 5.1, we generate 6k synthetic time series samples with different system parameters and initial values. The training/validation/interpolationtest sets for each dataset have the same range of system parameters while the extrapolationtest set contains samples from a different range. Table 2 shows the parameter distribution of test sets. For each dynamics, we perform two experiments to evaluate the models’ extrapolation generalization ability on initial values and system parameters. All samples are normalized so that .
System Parameters  Initial Values  

Interpolation  Extrapolation  Interpolation  Extrapolation  
LV  
FHN  
SEIR 
RMSE  LV  FHN  SEIR  

Int  Ext  Int  Ext  Int  Ext  Int  Ext  Int  Ext  Int  Ext  
Seq2Seq  0.050  0.215  0.028  0.119  0.093  0.738  0.079  0.152  1.12  4.14  2.58  7.89 
FC  0.078  0.227  0.044  0.131  0.057  0.402  0.057  0.120  1.04  3.20  1.82  5.85 
Transformer  0.074  0.231  0.067  0.142  0.102  0.548  0.111  0.208  1.09  4.23  2.01  6.13 
NeuralODE  0.091  0.196  0.050  0.127  0.163  0.689  0.124  0.371  1.25  3.27  2.01  5.82 
AutoODE  0.057  0.054  0.018  0.028  0.059  0.058  0.066  0.069  0.89  0.91  0.96  1.02 
Table 3 shows the prediction RMSEs of the models on initial values and system parameter interpolation and extrapolation test sets. We observe that the models’ prediction errors on extrapolation test sets are much larger than the error on interpolation test sets. Figures 55 show that Seq2Seq and FC fail to make accurate prediction when tested outside of the parameter distribution even though they make accurate predictions for parameter interpolation test samples.
Our AutoODE always obtains the lowest errors as it would be not affected by the range of parameters or the initial values. However, it is a local method and we need to train one model for each sample. In contrast, DL models can only mimic the behaviors of SEIR, LV and FHN dynamics rather than understanding the underlying mechanisms.
We are still at the early or middle stage of the COVID19 epidemic. The recovery rate may increase as we gain more treatment experience. If the situation is not getting better, the government regulation would become stricter, leading to smaller contact rate . Thus, there is a great chance that new test samples are outside of the parameter domain of training data. In that case, the DL models would not make accurate prediction for COVID19.
6 Conclusion
We study the problem of forecasting nonlinear dynamical systems and propose a hybrid method AutoODECOVID for forecasting COVID19 dynamics. From a benchmark study of DL and physicsbased models, we find that FC and Seq2Seq have better prediction accuracy on the number of deaths, while the DL methods generally much worse than the physicsbased models on the number of infected and removed cases. This is mainly due to the distribution shift in the COVID19 dynamics. On several other nonlinear dynamical systems, we experimentally show that four DL models fail to generalize under shifted distributions in both the data and the parameter domains. Even though these models are powerful enough to memorize the training data, and perform well on the interpolation tasks. Our study provides important insights on learning real world dynamical systems: to achieve accurate forecasts with DL, we need to ensure that both the data and dynamical system parameters in the training set are sufficient enough to cover the domains of the test set. Future works include incorporating compartmental models into deep learning models and derive theoretical generalization bounds of dynamics forecasting for deep learning models.
Comments
There are no comments yet.