1. Introduction
1.1. Background and Literature Review
The modeling and prediction of dynamical systems and timeseries is an important goal in numerous domains, including biomedicine, climatology, robotics, and the social sciences. Traditional approaches to modeling these systems appeal to careful study of their mechanisms, and the design of targeted equations to represent them. These carefully built mechanistic models have impacted humankind in numerous arenas, including our ability to land spacecraft on celestial bodies, provide highfidelity numerical weather prediction, and artificially regulate physiologic processes, through the use of pacemakers and artificial pancreases, for example. This paper focuses on the learning of model error: we assume that an imperfect mechanistic model is known, and that data is used to improve it. We introduce a framework for this problem, focusing on distinctions between Markovian and nonMarkovian model error, providing a unifying review of relevant literature, developing some underpinning theory related to both the Markovian and nonMarkovian settings, and presenting numerical experiments which illustrate our key findings.
To set our work in context, we first review the use of datadriven methods for timedependent problems, organizing the literature review around four themes comprising Sections 1.1.4, 1.1.3, 1.1.2 and 1.1.1; these are devoted, respectively, to pure datadriven methods, hybrid methods that build on mechanistic models, nonMarkovian models that describe memory, and applications of the various approaches. Having set the work in context, in Section 1.2 we detail the contributions we make, and describe the organization of the paper.
1.1.1. DataDriven Modeling of Dynamical Systems
A recent wave of machine learning successes in datadriven modeling, especially in imaging sciences, has shown that we can demand even more from existing models, or that we can design models of more complex phenomena than heretofore. Traditional models built from, for example, low order polynomials and/or linearized model reductions, may appear limited when compared to the flexible function approximation frameworks provided by neural networks and kernel methods. Neural networks, for example, have a long history of success in modeling dynamical systems (Narendra and Parthasarathy, 1992; GonzálezGarcía et al., 1998; Krischer et al., 1993; RicoMartinez et al., 1994; RicoMartines et al., 1993; RicoMartínez et al., 1992; Lagaris et al., 1998)
and recent developments in deep learning for operators continue to propel this trend
(Lu et al., 2020; Bhattacharya et al., 2020; Li et al., 2021).The success of neural networks arguably relies on balanced expressivity and generalizability, but other methods also excel in learning parsimonious and generalizable representations of dynamics. A particularly popular methodology is to perform sparse regression over a dictionary of vector fields, including the use of thresholding approaches (SINDy) (Brunton et al., 2016) and regularized polynomial regression (Tran and Ward, 2017; Schaeffer et al., 2017, 2018, 2020). Nonparametric methods, like Gaussian process models (Rasmussen and Williams, 2006), have also been used widely for modeling nonlinear dynamics (Wang et al., 2005; Frigola et al., 2014; Kocijan, 2016; Chkrebtii et al., 2016)
. While a good choice of kernel is often essential for the success of these methods, recent progress has been made towards automatic hyperparameter tuning via parametric kernel flows
(Hamzi and Owhadi, 2021). Successes with Gaussian process models were also extended to high dimensional problems by using random feature map approximations (Rahimi and Recht, 2008a)within the context of datadriven learning of parametric partial differential equations and solution operators
(Nelsen and Stuart, 2020). Advancements to datadriven methods based on Koopman operator theory and dynamic mode decomposition have also had significant impact; they offer exciting new possibilities for predicting nonlinear dynamics from data (Tu et al., 2014; Korda et al., 2020; Alexander and Giannakis, 2020).It is important to consider whether to model in discrete or continuoustime, as both have potential advantages. The primary positive for continuoustime modeling lies in its flexibility and interpretability. It allows for training on irregularly sampled timeseries; moreover, the learned righthandside can be solved numerically at any timestep, and hence used in a variety of applications. Traditional implementations of continuoustime learning require accurate estimation of timederivatives of the state, but this may be circumvented using approaches that leverage autodifferentiation software
(Rubanova et al., 2019; Kaheman et al., 2020)or methods which learn from statistics derived from timeseries, such as moments or correlation functions
(Schneider et al., 2020). Discretetime approaches, on the other hand, are easily deployed when train and test data sample rates are the same. Moreover, they allow for “nonintrusive” model correction, as additions are applied outside of the numerical integrator; this may be relevant for practical integration with complex simulation software.Both nonparametric and parametric model classes are used in the learning of dynamical systems, with the latter connecting to the former via the representer theorem, when Gaussian process regression
(Rasmussen and Williams, 2006) is used (Burov et al., 2020; Gilani et al., 2021; Harlim et al., 2021).1.1.2. Hybrid Mechanistic and DataDriven Modeling
Attempts to transform domains that have relied on traditional mechanistic models, by using purely datadriven (i.e. de novo or “learn from scratch”) approaches, often fall short. Now, there is a growing recognition by machine learning researchers that these mechanistic models are very valuable (Miller et al., 2021), as they represent the distillation of centuries of data collected in countless studies and interpreted by domain experts. Recent studies have consistently found advantages of hybrid methods that blend mechanistic knowledge and datadriven techniques. Not only do these methods improve predictive performance (Pathak et al., 2018b), but they also reduce data demands (Rackauckas et al., 2020) and improve interpretability and trustworthiness, which is essential for many applications. This is exemplified by work in autonomous drone landing (Shi et al., 2018) and helicopter flying (Rackauckas et al., 2021), as well as predictions for COVID19 mortality risk (Sottile et al., 2021) and COVID19 treatment response (Qian et al., 2021).
The question of how best to use the power of data and machine learning to leverage and build upon our existing mechanistic knowledge is thus of widespread current interest. This question and research direction has been anticipated over the last thirty years of research activity at the interface of dynamical systems with machine learning (RicoMartinez et al., 1994; Wilson and Zorzetto, 1997; Lovelett et al., 2020), and now a critical mass of effort is developing. A variety of studies have been tackling these questions in weather and climate modeling (Kashinath et al., 2021; Farchi et al., 2021); even in the imaging sciences, where pure machine learning has been spectacularly successful, emerging work shows that incorporating knowledge of underlying physical mechanisms improves performance in a variety of image recognition tasks (Ba et al., 2019).
As noted and studied by Ba et al. (2019); Freno and Carlberg (2019) and others, there are a few common highlevel approaches for blending machine learning with mechanistic knowledge: (1) use machine learning to learn additive residual corrections for the mechanistic model (Saveriano et al., 2017; Shi et al., 2018; Kaheman et al., 2019; Harlim et al., 2021; Farchi et al., 2021); (2) use the mechanistic model as an input or feature for a machine learning model (Pathak et al., 2018b; Lei et al., 2020; Borra et al., 2020)
; (3) use mechanistic knowledge in the form of a differential equation as a final layer in a neural network representation of the solution, or equivalently define the loss function to include approximate satisfaction of the differential equation
(Raissi et al., 2019, 2018; Chen et al., 2021; Smith et al., 2020); and (4) use mechanistic intuition to constrain or inform the machine learning architecture (Haber and Ruthotto, 2017; Maulik et al., 2021). Many other successful studies have developed specific designs that further hybridize these and other perspectives (Hamilton et al., 2017; Freno and Carlberg, 2019; Yi Wan et al., 2020; Jia et al., 2021). In addition, parameter estimation for mechanistic models is a wellstudied topic in data assimilation, inverse problems, and other mechanistic modeling communities, but recent approaches that leverage machine learning for this task may create new opportunities for accounting for temporal parameter variations (Miller et al., 2020) and unknown observation functions (Linial et al., 2021).An important distinction should be made between physicsinformed surrogate modeling and what we refer to as hybrid modeling. Surrogate modeling primarily focuses on replacing highcost, highfidelity mechanistic model simulations with similarly accurate models that are cheap to evaluate. These efforts have shown great promise by training machine learning models on expensive highfidelity simulation data, and have been especially successful when the underlying physical (or other domainspecific) mechanistic knowledge and equations are incorporated into the model training (Raissi et al., 2019) and architecture (Maulik et al., 2021). We use the term hybrid modeling, on the other hand, to indicate when the final learned system involves interaction (and possibly feedback) between mechanismbased and datadriven models (Pathak et al., 2018b).
In this work, we focus primarily on hybrid methods that learn residuals to an imperfect mechanistic model. The benefits of this form of hybrid modeling, which we and many others have observed, are not yet fully understood in a theoretical sense. Intuitively, nominal mechanistic models are most useful when they encode key nonlinearities that are not readily inferred using general model classes and modest amounts of data. Indeed, classical approximation theorems for fitting polynomials, fourier modes, and other common function bases directly reflect this relationship by bounding the error with respect to a measure of complexity of the target function (e.g. Lipschitz constants, moduli of continuity, Sobolev norms, etc.) (DeVore and Lorentz, 1993)[Chapter 7]. Recent work by E et al. (2019) provides a priori error bounds for twolayer neural networks and kernelbased regressions, with constants that depend explicitly on the norm of the target function in the modelhypothesis space (a Barron space and a reproducing kernel Hilbert space, resp.). At the same time, problems for which mechanistic models only capture lowcomplexity trends (e.g. linear) may still be good candidates for hybrid learning (over purely datadriven), as an accurate linear model reduces the parametric burden for the machinelearning task; this effect is likely accentuated in datastarved regimes. Furthermore, even in cases where datadriven models perform satisfactorily, a hybrid approach may improve interpretability, trustworthiness, and controllability without sacrificing performance.
Hybrid models are often cast in Markovian, memoryfree settings where the learned dynamical system (or its learned residuals) are solely dependent on the observed states. This approach can be highly effective when measurements of all relevant states are available or when the influence of the unobserved states is adequately described by a function of the observables. This is the perspective employed by Shi et al. (2018), where they learn corrections to physical equations of motion for an autonomous vehicle in regions of state space where the physics perform poorly— these residual errors are driven by unmodeled turbulence during landing, but can be predicted using the observable states of the vehicle (i.e. position, velocity, and acceleration). This is also the perspective taken in applications of highdimensional multiscale dynamical systems, wherein subgrid closure models parameterize the effects of expensive finescale interactions (e.g. cloud simulations) as functions of the coarse variables (Grabowski, 2001; Khairoutdinov and Randall, 2001; Tan et al., 2018; Brenowitz and Bretherton, 2018; O’Gorman and Dwyer, 2018; Rasp et al., 2018; Schneider et al., 2020; Beucler et al., 2021). The result is a hybrid dynamical system with a physicsbased equation defined on the coarse variables with a Markovian correction term that accounts for the effects of the expensive fine scale dynamics.
1.1.3. NonMarkovian DataDriven Modeling
Unobserved and unmodeled processes are often responsible for model errors that cannot be represented in a Markovian fashion within the observed variables alone. This need has driven substantial advances in memorybased modeling. One approach to this is the use of delay embeddings (Takens, 1981). These methods are inherently tied to discrete time representations of the data and, although very successful in many applied contexts, are of less value when the goal of datadriven learning is to fit continuoustime models; this is a desirable modeling goal in many settings.
An alternative to understanding memory is via the MoriZwanzig formalism, which is a fundamental building block in the presentation of memory and hidden variables and may be employed for both discretetime and continuoustime models. Although initially developed primarily in the context of statistical mechanics, it provides the basis for understanding hidden variables in dynamical systems, and thus underpins many generic computational tools applied in this setting (Chorin et al., 2000; Zhu et al., 2018; Gouasmi et al., 2017). It has been successfully applied to problems in fluid turbulence (Duraisamy et al., 2019; Parish and Duraisamy, 2017) and molecular dynamics (Li et al., 2017; Hijón et al., 2010). Studies by Ma et al. (2019); Wang et al. (2020) demonstrated how the MoriZwanzig formalism motivates the use of recurrent neural networks (RNNs) (Rumelhart et al., 1986; Goodfellow et al., 2016) as a deep learning approach to nonMarkovian closure modeling. Indeed, closure modeling using RNNs has become an exciting new way to learn memorybased closures (Kani and Elsheikh, 2017; Chattopadhyay et al., 2020b; Harlim et al., 2021).
In contrast, Lin and Lu (2021) use the MoriZwanzig formalism to directly justify a delayembedding approach to closure modeling. Harlim et al. (2021) performed extensive comparisons between these delayembedding and RNNbased closure approaches. Although the original formulation of MoriZwanzig as a general purpose approach to modeling partially observed systems was in continuoustime (Chorin et al., 2000), many practical implementations adopt a discretetime picture (Darve et al., 2009; Lin and Lu, 2021). This causes the learned memory terms to depend on sampling rates, which, in turn, can inhibit flexibility and interpretability of the model.
Recent advances in continuoustime memorybased modeling, however, may be applicable to these nonMarkovian hybrid model settings. The theory of continuoustime RNNs (i.e. formulated as differential equations, rather than a recurrence relation) was studied in the 1990s (Funahashi and Nakamura, 1993; Beer, 1995), albeit for equations with a specific additive structure. This structure was exploited in a continuoustime reservoir computing approach by Lu et al. (2018) for reconstructing chaotic attractors from data; we note that comparisons between RNNs and reservoir computing (a subclass of RNNs with random parameters fixed in the recurrent state) in discretetime have yielded mixed conclusions in terms of their relative efficiencies and ability to retain memory (Vlachas et al., 2020; Chattopadhyay et al., 2020a). Recent formulations of continuoustime RNNs have departed slightly from the additive structure, and have focused on constraints and architectures that ensure stability of the resulting dynamical system (Chang et al., 2019; Erichson et al., 2020; Niu et al., 2019; Rubanova et al., 2019; Sherstinsky, 2020). In addition, significant theoretical work has been performed for linear RNNs in continuoustime (Li et al., 2020). Nevertheless, these various methods have not yet been formulated within a hybrid modeling framework, nor has their approximation power been carefully evaluated in that context. A recent step in this direction, however, is the work by Gupta and Lermusiaux (2021), which tackles nonMarkovian hybrid modeling in continuoustime with neural networkbased delay differential equations.
1.1.4. Applications of DataDriven Modeling
In order to deploy hybrid methods in realworld scenarios, we must also be able to cope with noisy, partial observations. Accommodating the learning of model error in this setting, as well as state estimation, is an active area of research in the data assimilation (DA) community (Pulido et al., 2018; Farchi et al., 2021; Bocquet et al., 2020). Learning dynamics from noisy data is generally nontrivial for nonlinear systems—there is a chickenandegg problem in which accurate state estimation typically relies on the availability of correct models, and correct models are most readily identified using accurate state estimates. Recent studies have addressed this challenge by attempting to jointly learn the noise and the dynamics. Gottwald and Reich (2021)
approach this problem from a data assimilation perspective, and employ an Ensemble Kalman Filter (EnKF) to iteratively update the parameters for their dynamics model, then filter the current state using the updated dynamics. Similar studies were performed by
Brajard et al. (2021), and applied specifically in model error scenarios (Brajard et al., 2020; Farchi et al., 2021; Wikner et al., 2020). Kaheman et al. (2019) approached this problem from a variational perspective, performing a single optimization over all noise sequences and dynamical parameterizations. Nguyen et al. (2019)used an ExpectationMaximization perspective to compare these variational and ensemblebased approaches, and further study is needed to understand the tradeoffs between these styles of optimization.
We note that these data assimilators are themselves dynamical systems, which can be tuned (using optimization and machine learning) to provide more accurate state updates and more efficient state identification. However, while learning improved DA schemes is sometimes viewed as a strategy for coping with model error (Zhu and Kamachi, 2000), we see the optimization of DA and the correction of model errors as two separate problems that should be addressed individually.
When connecting models of dynamical systems to realworld data, it is also essential to recognize that available observables may live in a very different space than the underlying dynamics. Recent studies have shown ways to navigate this using autoencoders and dynamical systems models to jointly learn a latent embedding and dynamics in that latent space
(Champion et al., 2019). Proof of concepts for similar approaches primarily focus on imagebased inputs, but have potential for applications in medicine (Linial et al., 2021) and reduction of nonlinear partial differential equations (Maulik et al., 2021).1.2. Our Contributions
Despite this large and recent body of work in datadriven learning methods and hybrid modeling strategies, many challenges remain for understanding how to best combine mechanistic and machinelearned models; indeed, the answer is highly dependent on the application. Here, we construct a mathematical framework that unifies many of the common approaches for blending mechanistic and machine learning models; having done so we provide strong evidence for the value of hybrid approaches. Our contributions are listed as follows:

We provide an overarching framework for learning model error from data in dynamical systems settings, studying both discrete and continuoustime models, together with Markovian and memorydependent representations of the model error. This formulation is agnostic to choice of mechanistic model and class of machine learning functions.

We study the Markovian learning problem in the context of ergodic continuoustime dynamical systems, proving bounds on the excess risk and generalization error.

We present a simple approximation theorem for learning memorydependent (nonMarkovian) model error in continuoustime using recurrent neural networks.

We describe numerical experiments which demonstrate the utility of learning model error in comparison both with pure datadriven learning and with pure (but slightly imperfect) mechanistic modeling.

We describe numerical experiments comparing the benefits of learning discrete versus continuoustime models.
In Section 2, we address contribution 1. by defining the general settings of interest for dynamical systems in both continuous and discretetime. We then link these underlying systems to a machine learning framework in Sections 3 and 4; in the former we formulate the problem in the setting of statistical learning, and in the latter we define concrete optimization problems found from finite parameterizations of the hypothesis class in which the model error is sought. Section 5 is focused on specific choices of architectures, and underpinning theory for machine learning methods with these choices: we analyze linear methods from the perspective of learning theory in the context of ergodic dynamical systems (contribution 2.); and we describe an approximation theorem for continuoustime hybrid recurrent neural networks (contribution 3.). Finally, Section 6 presents our detailed numerical experiments; we apply the methods in Section 5 to exemplar dynamical systems of the forms outlined in Section 2, and highlight our findings (contributions 4. and 5.).
2. Dynamical Systems Setting
We present a general framework for modeling a dynamical system with Markovian model error, first in continuoustime (Section 2.1) and then in discretetime (Section 2.2). We then extend the framework to the setting of nonMarkovian model error (Section 2.3), including a parameter which enables us to smoothly transition from scaleseparated problems (where Markovian closure is likely to be accurate) to problems where the unobserved variables are not scaleseparated from those observed (where Markovian closure is likely to fail and memory needs to be accounted for). It is important to note that the continuoustime formulation necessarily assumes an underlying datagenerating process that is continuous in nature. The discretetime formulation can be viewed as a discretization of an underlying continuous system, but can also represent systems that are truly discrete.
The settings that we present are all intended to represent and classify common situations that arise in modeling and predicting dynamical systems. In particular, we stress two key features. First, we point out that mechanistic models (later referred to as a vector field
or flow map ) are often available and may provide predictions with reasonable fidelity. However, these models are often simplifications of the true system, and thus can be improved with datadriven approaches. Nevertheless, they provide a useful starting point that can reduce the complexity and datahunger of the learning problems. In this context, we study tradeoffs between discrete and continuoustime framings. While we begin with fullyobserved contexts in which the dynamics are Markovian with respect to the observed state , we later note that we may only have access to partial observations of a larger system . By restricting our interest to prediction of these observables, we show how a latent dynamical process (e.g. a RNN) has the power to reconstruct the correct dynamics for our observables.2.1. ContinuousTime
Consider the following dynamical system
(2.1) 
and define . If then (2.1) has solution for any , the maximal interval of existence.
The primary model error scenario we envisage in this section is one in which the vector field can only be partially known or accessed: we assume that
where is known to us and is not known. For any (regardless of its fidelity), there exists a function such that eq. 2.1 can be rewritten as
(2.2) 
However, for this paper, it is useful to think of as being small relative to ; the function accounts for model error. While the approach in (2.2) is targeted at learning residuals of , can alternatively be reconstructed from through a different function using the form
(2.3) 
Both approaches are defined on spaces that allow perfect reconstruction of . However, the first formulation hypothesizes that the missing information is additive; the second formulation provides no such indication. Because the first approach ensures substantial usage of , it has advantages in settings where is trusted by practitioners and model explainability is important. The second approach will likely see advantages in settings where there is a simple nonadditive form of model error, including coordinate transformations and other (possibly statedependent) nonlinear warping functions of the nominal physics . Note that the use of in representing the model error in the augmentedinput setting of eq. 2.3 includes the case of not leveraging at all. It is, hence, potentially more useful than simply adopting an dependent model error; but it requires learning a more complex function.
The augmentedinput method also has connections to model stacking (Wolpert, 1992); this perspective can be useful when there are model hypotheses:
Our goal is to use machine learning to approximate these corrector functions using our nominal knowledge and observations of a trajectory , for some , from the true system (2.1). In this work, we consider only the case of learning in equation (2.2). For now the reader may consider given without noise so that, in principle, is known and may be leveraged. In practice this will not be the case, for example if the data are highfrequency but discrete in time; we address this issue in what follows.
2.2. DiscreteTime
Consider the following dynamical system
(2.4) 
and define . If the map yields solution ^{1}^{1}1Here we define , the nonnegative integers, including zero. As in the continuoustime setting, we assume we only have access to an approximate mechanistic model . Again, we can correct using an additive residual term in a model of form
(2.5) 
or by feeding as an input to a corrective warping function
we focus our experiments on the additive residual framing in eq. 2.5.
Note that the discretetime formulation can be made compatible with continuoustime data sampled uniformly at rate . To see this, consider eq. 2.1 and its solution operator . We then define
(2.6) 
which can be obtained via numerical integration of . Similarly, we can consider the solution operator defined by the vector field to define a discrete map
(2.7) 
In these settings, for .
2.3. Partially Observed Systems (ContinuousTime)
The framework in Sections 2.2 and 2.1 assumes that the system dynamics are Markovian with respect to observable . Most of our experiments are performed in the fullyobserved Markovian case. However, this assumption rarely holds in realworld systems. Consider a blockonaspring experiment conducted in an introductory physics laboratory. In principle, the system is strictly governed by the position and momentum of the block (i.e. ), along with a few scalar parameters. However (as most students’ error analysis reports will note), the dynamics are also driven by a variety of external factors, like a wobbly table or a poorly greased track. The magnitude, timescale, and structure of the influence of these different factors are rarely known; and yet, they are somehow encoded in the discrepancy between the nominal equations of motion and the (noisy) observations of this multiscale system.
Thus we also consider the setting in which the dynamics of is not Markovian. If we consider to be the observable states of a Markovian system in dimension higher than , then we can write the full system as
equationparentequation
(2.8a)  
(2.8b) 
Here , , and is a constant measuring the degree of scaleseparation (which is large when is small). The system yields solution ^{2}^{2}2With defined analogously to for any the maximal interval of existence. We view as the complicated, unresolved, or unobserved aspects of the true underlying system.
For any (regardless of its fidelity), there exists a function such that eq. 2.8 can be rewritten as equationparentequation
(2.9a)  
(2.9b) 
Now observe that, by considering the solution of equation (2.9b) as a function of the history of , the influence of on the solution can be captured by a parameterized (w.r.t. ) family of operators on the historical trajectory , unobserved initial condition , and scaleseparation parameter such that
(2.10) 
Our goal is to use machine learning to find a Markovian model, in which is part of the state variable, using our nominal knowledge and observations of a trajectory , for some , from the true system (2.8); note that is not observed and nothing is assumed known about the vector field or the parameter
Note that equations (2.8), (2.9) and (2.10) are all equivalent formulations of the same problem and have identical solutions. The third formulation points towards two intrinsic difficulties: the unknown “function” to be learned is in fact defined by a family of operators mapping the Banach space of path history into ; secondly the operator is parameterized by which is unobserved. We will address the first issue by showing that the operators can be arbitrarily wellapproximated from within a family of differential equations in dimension ; the second issue may be addressed by techniques from data assimilation (Asch et al., 2016; Law et al., 2015; Reich and Cotter, 2015) once this approximating family is learned. We emphasize, however, that we do not investigate the practicality of this approach to learning nonMarkovian systems and much remains to be done in this area.
It is also important to note that these nonMarkovian operators can sometimes be adequately approximated by invoking a Markovian model for and simply learning function as in Section 2.1. For example, when and the dynamics, with fixed, are sufficiently mixing, the averaging principle (Bensoussan et al., 2011; VandenEijnden and others, 2003; Pavliotis and Stuart, 2008) may be invoked to deduce that
for some as in Section 2.1.
It is highly advantageous to identify settings where Markovian modeling is sufficient, as it is a simpler learning problem. We find that learning is necessary when there is significant memory required to explain the dynamics of ; learning is sufficient when memory effects are minimal. In Section 6, we show that Markovian closures can perform well for certain tasks even when the scaleseparation factor is not small. In Section 3 we demonstrate how the family of operators
may be represented through ordinary differential equations, appealing to ideas which blend continuoustime RNNs with an assumed known vector field
2.4. Partially Observed Systems (DiscreteTime)
The discretetime analog of the previous setting is as follows. We consider mapping equationparentequation
(2.11a)  
(2.11b) 
with , , yielding solutions and
As in the continuoustime setting, we assume that while we do not know the full equations , we do have an approximate model that is Markovian on the observable . Hence, we use our available physics and rewrite eq. 2.11 as
equationparentequation
(2.12a)  
(2.12b) 
and can again, analogously to (2.10), write a solution in the space of observables as
(2.13) 
with , a function of the historical trajectory and the unobserved initial condition . As shown for continuoustime in (2.10) and for discretetime in (2.13), the true residual corrector function for is now a nonMarkovian term that depends on the history of the observable state and initial condition . While this dependence on initial condition may appear daunting, many applications show substantial memory fading properties that allow for this dependence to be dropped. Indeed if this discretetime system is computed from the time map for (2.1) then, for and when averaging scenarios apply, a memoryless model as in Section 2.2 may be used.
Remark 2.1.
While our entire work has focused on immutable mechanistic models and , these models typically have tunable parameters. Of course, we can jointly learn parameters for the mechanistic model and closure term. However, the lack of identifiability between modifying the closure and modifying the physics brings up an interesting question in explainability. Future work might focus on decoupling the learning of parameters and closure terms so that maximal expressivity is first squeezed out of the mechanistic model (Plumlee and Joseph, 2017). This decoupling can be especially important when we are interested in inferring the parameters, whose estimates are biased by unknown model error (Plumlee, 2017).
3. Statistical Learning for Ergodic Dynamical Systems
Here, we present a learning theory framework within which to consider methods for discovering model error from data. We outline the learning theory in a continuoustime Markovian setting (using possibly discretely sampled data), and point to its analogs in discretetime and nonMarkovian settings.
In the discretetime settings, we assume access to discretely sampled training data , where is a uniform sampling rate and we assume that In the continuoustime settings, we assume access to continuoustime training data ; Section 6.2 discusses the important practical question of estimating from discrete (but high frequency) data. In either case, consider the problem of identifying (where represents the model, or hypothesis, class) that minimizes a loss function quantifying closeness of to In the Markovian setting we choose a measure on and define the loss
If we assume that, at the true , is ergodic with invariant density , then we can exchange time and space averages to see that, for an infinitely long trajectory ,
Since we may only have access to a trajectory dataset of finite length , it is natural to define
and note that, by ergodicity,
Finally, we can use eq. 2.2 to get
(3.1) 
This, possibly regularized, is a natural loss function to employ when continuoustime data is available, and should be viewed as approximating
We can use these definitions to frame the problem of learning model error in the language of statistical learning (Vapnik, 2013). If we let denote the hypothesis class over which we seek to minimize then we may define
The risk associated with seeking from the class is defined by , noting that this is if The risk measures the intrinsic error incurred by seeking to learn from the restricted class , which typically does not include it is an approximation theoretic concept which encodes the richness of the hypothesis class . The risk may be decreased by increasing the expressiveness of . Thus risk is independent of the data employed. To quantify the effect of data volume on learning , it is helpful to introduce the following two concepts. The excess risk is defined by
and represents the additional approximation error incurred by using data defined over a finite time horizon in the estimate of the function The generalization error is
and represents the discrepancy between training error, which is defined using a finite trajectory, and idealized test error, which is defined using an infinite length trajectory (or, equivalently, the invariant measure ), when evaluated at the estimate of the function obtained from finite data. We return to study excess risk and generalization error in the context of linear models for , and under ergodicity assumptions on the data generating process, in Section 5.2.
We have introduced a machine learning framework in the continuoustime Markovian setting, but it may be adopted in discretetime and in nonMarkovian settings. In Section 4, we define appropriate objective functions for each of these cases.
Remark 3.1.
The developments we describe here for learning in ordinary differential equations can be extended to the case of learning stochastic differential equations; see (Bento et al., 2011; Kutoyants, 2004). In that setting, consistency in the large limit is wellunderstood. It would be interesting to build on the learning theory perspective described here to study statistical consistency for ordinary differential equations; the approaches developed in the work by McGoff et al. (2015); Su and Mukherjee (2021) are potentially useful in this regard.
4. Parameterization of the Loss Function
In this section, we define explicit optimizations for learning (approximate) model error functions for the Markovian settings, and model error operators for the nonMarkovian settings; both continuous and discretetime formulations are given. We defer discussion of specific approximation architectures to the next section. Here we make a notational transition from optimization over (possibly nonparametric) functions to functions parameterized by that characterize the class .
In all the numerical experiments in this paper, we study the use of continuous and discretetime approaches to model data generated by a continuoustime process. The setup in this section reflects this setting, in which two key parameters appear: , the continuoustime horizon for the data; and , the frequency of the data. The latter parameter will always appear in the discretetime models; but it may also be implicit in continuoustime models which need to infer continuous time quantities from discretely sampled data. We relate and by We present the general forms of (with optional regularization terms ). Optimization via derivativebased methodology requires either analytic differentiation of the dynamical system model with respect to parameters, or the use of autodifferentiable ordinary differential equation solvers (Rubanova et al., 2019).
4.1. ContinuousTime Markovian Learning
Here, we approximate the Markovian closure term in (2.2) with a parameterized function . Assuming full knowledge of , we learn the correction term for the flow field by minimizing the following objective function of :
(4.1) 
Note that
Notable examples that leverage this framing include: the paper (Kaheman et al., 2019), where are coefficients for a library of loworder polynomials and is a sparsitypromoting regularization defined by the SINDy framework; the paper (Shi et al., 2018), where are parameters of a deep neural network and encodes constraints on the Lipschitz constant for provided by spectral normalization; and the paper (Watson, 2019) which applies this approach to the Lorenz ’96 Multiscale system using neural networks with an regularization on the weights.
4.2. DiscreteTime Markovian Learning
Here, we learn the Markovian correction term in (2.5) for the discrete solution map by minimizing:
(4.2) 
(Recall that ) This is the natural discretetime analog of (4.1) and may be derived analogously, starting from a discrete analog of the loss where now is assumed to be an ergodic measure for (2.4). If a discrete analog of (3.1) is defined, then parameterization of , and regularization, leads to (4.2). This is the underlying model assumption in the work by Farchi et al. (2021).
4.3. ContinuousTime NonMarkovian Learning
We can attempt to recreate the dynamics in for eq. 2.10 by modeling the nonMarkovian residual term. A common approach is to augment the dynamics space with a variable as follows:
equationparentequation
(4.3a)  
(4.3b) 
where we seek a large enough and parametric models expressive enough such that the dynamics in are reproduced by eq. 4.3. We learn a good parameterization for eq. 4.3 by minimizing the following objective:
(4.4)  
Specific functional forms of (and their corresponding parameter inference strategies) reduce eq. 4.3 to various approaches. For the continuoustime RNN structure we discuss in the next section we choose to be linear in and independent of , whilst is a neural network; we may generalize and consider latent ordinary differential equations in which and are both neural networks. It is intuitive that the latter may be more expressive and afford a smaller than the former. To our knowledge, studies of continuoustime nonMarkovian closures using RNNs have not yet been published; this appears to be a direction that would be interesting to pursue further and we will discuss it from a theoretical standpoint in Section 5.
Remark 4.1.
The recent paper (Gupta and Lermusiaux, 2021) proposes an interesting, and more computationally tractable, approach to learning model error in the presence of memory. They propose to learn a closure operator for a delay differential equation with finite memory :
4.4. DiscreteTime NonMarkovian Learning
Here, we aim to recreate the dynamics in for eq. 2.13 by modeling the nonMarkovian residual term, and do this by augmenting the dynamics space with a variable as follows:
equationparentequation
(4.6a)  
(4.6b) 
where we seek a large enough and expressive enough such that the dynamics in are reproduced by eq. 4.6. Learning is performed by minimizing the following objective
(4.7)  
(Recall that ) The functional form of (and their corresponding parameter inference strategies) reduce eq. 4.6 to various approaches, including recurrent neural networks, latent ordinary differential equations, and delayembedding maps (e.g. to get a delay embedding map, is a shift operator). Pathak et al. (2018b) used reservoir computing (a random features approach to RNN, described in the next section) with regularization to study an approach similar to eq. 4.6, but included as a feature in and instead of using it as the central model upon which to learn residuals. The datadriven superparameterization approach in (Chattopadhyay et al., 2020b) also appears to follow the underlying assumption of eq. 4.6.
4.5. Connections to PhysicsInformed Neural Networks
It is also of interest to consider how physicsbased regularization ideas from Physicsinformed Neural Networks (PINNs) (Raissi et al., 2019), which were designed to improve partial differential equation surrogate modeling, can be leveraged for modeling ordinary differential equations. Investigation into this topic is still ongoing (Tipireddy et al., 2019; Yazdani et al., 2020; Antonelo et al., 2021). A natural starting point is to consider the objective function
(4.8) 
however, if the trajectory is not long enough to cover all relevant parts of statespace required for prediction, an alternative is to consider the objective function
(4.9) 
here defines a bounded region in Euclidean space to which the dynamics to be learned is assumed to be confined. In both these settings regularization parameter can be learned e.g. by via crossvalidation. It may be desirable to fit a discretetime model, given imperfect continuous time vector field, in which case a natural objective function is
(4.10) 
Again the second term may be generalized to an integral over space rather than time, if the trajectory data does not explore enough of the statespace. These physicsbased regularizations may be especially useful when available data are limited or when derivatives are poorly estimated from data; regularization with respect to deviations from provides cheap, approximate data augmentation.
5. Underpinning Theory
In this section we identify specific hypothesis classes We do this using random feature maps (Rahimi and Recht, 2008a) in the Markovian settings (Section 5.1), and using recurrent neural networks in the memorydependent setting. We then discuss these problems from a theoretical standpoint. In Section 5.2 we study excess risk and generalization error in the context of linear models (a setting which includes the random features model as a special case). And we conclude by discussing the use of RNNs (Goodfellow et al., 2016)[Chapter 10] for the nonMarkovian settings (discrete and continuoustime) in Section 5.3; we present an approximation theorem for continuoustime hybrid RNN models. Throughout this section, the specific use of random feature maps and of recurrent neural networks is for illustrative purposes only; other models could, of course, be used.
5.1. Markovian Modeling with Random Feature Maps
In principle, any hypothesis class can be used to learn . However, we focus on models that are easily trained on largescale complex systems and yet have proven approximation power for functions between finitedimensional Euclidean spaces. For the Markovian modeling case, we use random feature maps; like traditional neural networks, they possess arbitrary approximation power (Rahimi and Recht, 2008c, b), but further benefit from a quadratic minimization problem in the training phase, as do kernel or Gaussian process methods. In our case studies, we found random feature models sufficiently expressive, we found optimization easily implementable, and we found the learned models generalized well. Moreover, their linearity with respect to unknown parameters enables a straightforward analysis of excess risk and generalization error in Section 5.2. Details on the derivation and specific design choices for our random feature modeling approach can be found in Section 8.3, where we explain how we sample random feature functions and stack them to form a vectorvalued feature map . Given this random function , we define the hypothesis class
(5.1) 
When studying statistical learning, we will also consider the hypothesis class
(5.2) 
where the are i.i.d. draws of function in the case
5.1.1. ContinuousTime
In the continuoustime framing, our Markovian closure model uses hypothesis class (5.1) and thus takes the form
We rewrite eq. 4.1 for this particular case with an regularization parameter :
(5.3) 
We employ the notation for the outerproduct between matrices , and the following notation for timeaverage
of Equation 5.3 is quadratic and convex in and thus is globally minimized for the unique which makes the derivative of zero. Consequently, the minimizer satisfies the following linear equation (derived in the Appendix Section 8.4):
(5.4) 
Here, is the identity and
(5.5)  
Of course is not known, but can be computed from data.
5.1.2. DiscreteTime
In the discretetime framing, our Markovian closure model takes the form
We rewrite eq. 4.2 for this particular case with an regularization parameter :
(5.6) 
Equation 5.6 is quadratic in and thus is globally minimized at which satisfies the following linear equation:
(5.7) 
Here
(5.8)  
The derivation is analogous to that for the continuous time setting, as detailed in Section 8.4. Of course is not known, but and so
may be calculated from data. This formulation closely mirrors the fully datadriven linear regression approach in
(Gottwald and Reich, 2021). By computing and solving solving the linear equation eq. 5.7 for , we obtain our approximation .5.2. Learning Theory for Markovian Models with Linear Hypothesis Class
In this subsection we provide estimates of the excess risk and generalization error in the context of learning in (2.2) from a trajectory over time horizon . We study ergodic continuoustime models in the setting of Section 4.1 and seek to learn from the hypothesis class defined in (5.2). Here and For the learning theory presented in this subsection, it is not necessary to use the random features construction of the ; however, we note that, in that setting, universal approximation for random features (Rahimi and Recht, 2008a) shows that the loss function may be made arbitrarily small by choice of large enough and appropriate choice of parameters. We also note that the theory we present in this subsection is also readily generalized to working with hypothesis class (5.1).
We make the following ergodicity assumption about the data generation process:
Assumption A1.
Equation (2.2) possesses a compact global attractor supporting invariant measure Furthermore the dynamical system on is ergodic with respect to
and satisfies a central limit theorem of the following form: for all Hölder continuous bounded
, there is such that(5.9) 
where denotes convergence in distribution with respect to . Furthermore, almost surely with respect to ,
(5.10) 
Remark 5.1.
In the work of Melbourne and coworkers, such an assumption is proven to hold for a class of differential equations, including the Lorenz ’63 model at, and in a neighbourhood of, the classical parameter values: in (Holland and Melbourne, 2007) the central limit theorem is established; and in (Bahsoun et al., 2020) the continuity of in is proven. Whilst it is in general very difficult to prove such results for any given chaotic dynamical system, there is strong empirical evidence for such results in many chaotic dynamical systems that arise in practice. This combination of theory and empirical evidence justify studying the learning of model error under Assumption A1. Tran and Ward (2017) were the first to make use of the theory of Melbourne and coworkers to study learning of chaotic differential equations from timeseries.
Given from hypothesis class (5.2) we define
(5.11) 
and
(5.12) 
(Regularization is not needed in this setting because the data is plentiful—a continuoustime trajectory—and the number of parameters is finite.) Then solve the linear systems
where
These facts can be derived analogously to the derivation in Section 8.4. Given and we also define
Recall that it is assumed that , and are We make the following assumption regarding the vector fields defining hypothesis class
Assumption A2.
The functions are Hölder continuous on . In addition, the matrix is invertible.
Theorem 5.2.
Let Assumptions A1 and A2 hold. Then the scaled excess risk (resp. scaled generalization error ) is bounded above by (resp.
), where random variable
(resp. ) converges in distribution to (resp. )) w.r.t. . Furthermore, there is constant such that, almost surely w.r.t. ,The proof is provided in the appendix Section 8.1.
Remark 5.3.
The convergence in distribution shows that, with high probability with respect to initial data, the excess risk and the generalization error are bounded above by terms of size
This can be improved to give an almost sure result, at the cost of the factor of . The theorem shows that (ignoring log factors and acknowledging the probabilistic nature of any such statements) trajectories of length are required to produce bounds on the excess risk and generalization error of size . In turn, the bounds on excess risk and generalization error also show that empirical risk minimization (of ) approaches the theoretically analyzable concept of risk minimization (of ) over hypothesis class (5.2). The sum of the excess risk and the generalization error givesWe note that is computable, once the approximate solution has been identified; thus, when combined with an estimate for , this leads to an estimate for the risk associated with the hypothesis class used. Work by Zhang et al. (2021) demonstrates that error bounds on learned model error terms can be extended to bound error on reproduction of invariant statistics for ergodic Itô diffusions (a generalization of the ordinary differential equation setting we have presented). Moreover, E et al. (2019) provide a direction for proving similar bounds on model error learning using nonlinear function classes (e.g. twolayer neural networks).
Finally we remark on the dependence of the risk and generalization error bounds on the size of the model error. It is intuitive that the amount of data required to learn model error should decrease as the size of the model error decreases. This is demonstrated numerically in Section 6.4 (c.f. Figures 1(b) and 1(a)). Here we comment that Theorem 5.2 also exhibits this feature: examination of the proof in Appendix 8.1 shows that all upper bounds on terms appearing in the excess and generalization error are proportional to itself or to , its approximation given an infinite amount of data; note that if the hypothesis class contains the truth.
5.3. NonMarkovian Modeling with Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are one of the de facto tools for modeling systems with memory. Here, we show straightforward residual implementations of RNNs for continuous and discretetime, with the goal of modeling nonMarkovian model error. For example, in discretetime, the form in eq. 4.6 can be specified as
equationparentequation
(5.13a)  
(5.13b) 
The parameters can be learned to describe memorydependent model error. Note that inherent in choosing these matrices and vector is a choice of embedding dimension for variable ; this will typically be larger than dimension of itself. The goal is to choose them so that output matches output of (2.12), without observation of or knowledge of and
We note here that more sophisticated variants, such as LongShort Term Memory RNNs
(Hochreiter and Schmidhuber, 1997), are often more effective, but similar in nature. In continuoustime, we specify the form in eq. 4.3 asequationparentequation
(5.14a)  
(5.14b) 
The goal is to choose so that output matches output of (2.9), without observation of or knowledge of and As in discretetime, inherent in choosing these matrices and vector is a choice of embedding dimension for variable which will typically be larger than dimension of itself.
In both continuous and discretetime settings, the idea is to create a recurrent state of sufficiently large dimension whose evolution equation takes
as input and, after a final linear transformation, approximates the missing dynamics
A full discussion of implementation of this model is beyond the scope of this paper. We refer readers to (Goodfellow et al., 2016)for background on discrete RNN implementations and backpropagation through time (BPTT). For implementations of continuoustime RNNs, it is common to leverage the success of the automatic BPTT code written in PyTorch and Tensorflow by discretizing
eq. 5.14 with an ordinary differential equation solver that is compatible with these autodifferentiation tools (e.g. torchdiffeq (Rubanova et al., 2019)). This compatibility can also be achieved by use of explicit RungeKutta schemes (Queiruga et al., 2020). Note that the discretization of eq. 5.14 can (and perhaps should) be much finer than the data sampling rate , but that this requires reliable estimation of from discrete data.Existing approximation theory for discretetime RNNs (Schäfer and Zimmermann, 2007) can be used to show that eq. 5.13 can approximate discretetime systems arbitrarily well. However, the approximation theorem for continuoustime RNNs was proved for additivestyle equations (Funahashi and Nakamura, 1993); we extend the result to a more common (nonadditive) form, and do so in the context of residualbased learning as in eq. 5.14.
We state the theorem after making three assumptions upon which it rests:
Assumption A3.
Functions are all globally Lipschitz.
Note that this implies that is also globally Lipschitz.
Assumption A4.
Let
Comments
There are no comments yet.