A Framework for Machine Learning of Model Error in Dynamical Systems

The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. We cast the problem in both continuous- and discrete-time, for problems in which the model error is memoryless and in which it has significant memory, and we compare data-driven and hybrid approaches experimentally. Our formulation is agnostic to the chosen machine learning model. Using Lorenz '63 and Lorenz '96 Multiscale systems, we find that hybrid methods substantially outperform solely data-driven approaches in terms of data hunger, demands for model complexity, and overall predictive performance. We also find that, while a continuous-time framing allows for robustness to irregular sampling and desirable domain-interpretability, a discrete-time framing can provide similar or better predictive performance, especially when data are undersampled and the vector field cannot be resolved. We study model error from the learning theory perspective, defining excess risk and generalization error; for a linear model of the error used to learn about ergodic dynamical systems, both errors are bounded by terms that diminish with the square-root of T. We also illustrate scenarios that benefit from modeling with memory, proving that continuous-time recurrent neural networks (RNNs) can, in principle, learn memory-dependent model error and reconstruct the original system arbitrarily well; numerical results depict challenges in representing memory by this approach. We also connect RNNs to reservoir computing and thereby relate the learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/02/2021

Data-driven balancing of linear dynamical systems

We present a novel reformulation of balanced truncation, a classical mod...
02/15/2021

Using Data Assimilation to Train a Hybrid Forecast System that Combines Machine-Learning and Knowledge-Based Components

We consider the problem of data-assisted forecasting of chaotic dynamica...
09/03/2021

Continuous-Time Behavior Trees as Discontinuous Dynamical Systems

Behavior trees represent a hierarchical and modular way of combining sev...
04/11/2019

GP-HD: Using Genetic Programming to Generate Dynamical Systems Models for Health Care

The huge wealth of data in the health domain can be exploited to create ...
05/13/2020

Kernel Analog Forecasting: Multiscale Test Problems

Data-driven prediction is becoming increasingly widespread as the volume...
07/27/2019

Time-series machine-learning error models for approximate solutions to parameterized dynamical systems

This work proposes a machine-learning framework for modeling the error i...
02/10/2018

On Symplectic Optimization

Accelerated gradient methods have had significant impact in machine lear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Background and Literature Review

The modeling and prediction of dynamical systems and time-series is an important goal in numerous domains, including biomedicine, climatology, robotics, and the social sciences. Traditional approaches to modeling these systems appeal to careful study of their mechanisms, and the design of targeted equations to represent them. These carefully built mechanistic models have impacted humankind in numerous arenas, including our ability to land spacecraft on celestial bodies, provide high-fidelity numerical weather prediction, and artificially regulate physiologic processes, through the use of pacemakers and artificial pancreases, for example. This paper focuses on the learning of model error: we assume that an imperfect mechanistic model is known, and that data is used to improve it. We introduce a framework for this problem, focusing on distinctions between Markovian and non-Markovian model error, providing a unifying review of relevant literature, developing some underpinning theory related to both the Markovian and non-Markovian settings, and presenting numerical experiments which illustrate our key findings.

To set our work in context, we first review the use of data-driven methods for time-dependent problems, organizing the literature review around four themes comprising Sections 1.1.4, 1.1.3, 1.1.2 and 1.1.1; these are devoted, respectively, to pure data-driven methods, hybrid methods that build on mechanistic models, non-Markovian models that describe memory, and applications of the various approaches. Having set the work in context, in Section 1.2 we detail the contributions we make, and describe the organization of the paper.

1.1.1. Data-Driven Modeling of Dynamical Systems

A recent wave of machine learning successes in data-driven modeling, especially in imaging sciences, has shown that we can demand even more from existing models, or that we can design models of more complex phenomena than heretofore. Traditional models built from, for example, low order polynomials and/or linearized model reductions, may appear limited when compared to the flexible function approximation frameworks provided by neural networks and kernel methods. Neural networks, for example, have a long history of success in modeling dynamical systems (Narendra and Parthasarathy, 1992; González-García et al., 1998; Krischer et al., 1993; Rico-Martinez et al., 1994; Rico-Martines et al., 1993; Rico-Martínez et al., 1992; Lagaris et al., 1998)

and recent developments in deep learning for operators continue to propel this trend

(Lu et al., 2020; Bhattacharya et al., 2020; Li et al., 2021).

The success of neural networks arguably relies on balanced expressivity and generalizability, but other methods also excel in learning parsimonious and generalizable representations of dynamics. A particularly popular methodology is to perform sparse regression over a dictionary of vector fields, including the use of thresholding approaches (SINDy) (Brunton et al., 2016) and -regularized polynomial regression (Tran and Ward, 2017; Schaeffer et al., 2017, 2018, 2020). Non-parametric methods, like Gaussian process models (Rasmussen and Williams, 2006), have also been used widely for modeling nonlinear dynamics (Wang et al., 2005; Frigola et al., 2014; Kocijan, 2016; Chkrebtii et al., 2016)

. While a good choice of kernel is often essential for the success of these methods, recent progress has been made towards automatic hyperparameter tuning via parametric kernel flows

(Hamzi and Owhadi, 2021). Successes with Gaussian process models were also extended to high dimensional problems by using random feature map approximations (Rahimi and Recht, 2008a)

within the context of data-driven learning of parametric partial differential equations and solution operators

(Nelsen and Stuart, 2020). Advancements to data-driven methods based on Koopman operator theory and dynamic mode decomposition have also had significant impact; they offer exciting new possibilities for predicting nonlinear dynamics from data (Tu et al., 2014; Korda et al., 2020; Alexander and Giannakis, 2020).

It is important to consider whether to model in discrete- or continuous-time, as both have potential advantages. The primary positive for continuous-time modeling lies in its flexibility and interpretability. It allows for training on irregularly sampled timeseries; moreover, the learned right-hand-side can be solved numerically at any timestep, and hence used in a variety of applications. Traditional implementations of continuous-time learning require accurate estimation of time-derivatives of the state, but this may be circumvented using approaches that leverage autodifferentiation software

(Rubanova et al., 2019; Kaheman et al., 2020)

or methods which learn from statistics derived from time-series, such as moments or correlation functions

(Schneider et al., 2020). Discrete-time approaches, on the other hand, are easily deployed when train and test data sample rates are the same. Moreover, they allow for “non-intrusive” model correction, as additions are applied outside of the numerical integrator; this may be relevant for practical integration with complex simulation software.

Both non-parametric and parametric model classes are used in the learning of dynamical systems, with the latter connecting to the former via the representer theorem, when Gaussian process regression

(Rasmussen and Williams, 2006) is used (Burov et al., 2020; Gilani et al., 2021; Harlim et al., 2021).

1.1.2. Hybrid Mechanistic and Data-Driven Modeling

Attempts to transform domains that have relied on traditional mechanistic models, by using purely data-driven (i.e. de novo or “learn from scratch”) approaches, often fall short. Now, there is a growing recognition by machine learning researchers that these mechanistic models are very valuable (Miller et al., 2021), as they represent the distillation of centuries of data collected in countless studies and interpreted by domain experts. Recent studies have consistently found advantages of hybrid methods that blend mechanistic knowledge and data-driven techniques. Not only do these methods improve predictive performance (Pathak et al., 2018b), but they also reduce data demands (Rackauckas et al., 2020) and improve interpretability and trustworthiness, which is essential for many applications. This is exemplified by work in autonomous drone landing (Shi et al., 2018) and helicopter flying (Rackauckas et al., 2021), as well as predictions for COVID-19 mortality risk (Sottile et al., 2021) and COVID-19 treatment response (Qian et al., 2021).

The question of how best to use the power of data and machine learning to leverage and build upon our existing mechanistic knowledge is thus of widespread current interest. This question and research direction has been anticipated over the last thirty years of research activity at the interface of dynamical systems with machine learning (Rico-Martinez et al., 1994; Wilson and Zorzetto, 1997; Lovelett et al., 2020), and now a critical mass of effort is developing. A variety of studies have been tackling these questions in weather and climate modeling (Kashinath et al., 2021; Farchi et al., 2021); even in the imaging sciences, where pure machine learning has been spectacularly successful, emerging work shows that incorporating knowledge of underlying physical mechanisms improves performance in a variety of image recognition tasks (Ba et al., 2019).

As noted and studied by Ba et al. (2019); Freno and Carlberg (2019) and others, there are a few common high-level approaches for blending machine learning with mechanistic knowledge: (1) use machine learning to learn additive residual corrections for the mechanistic model (Saveriano et al., 2017; Shi et al., 2018; Kaheman et al., 2019; Harlim et al., 2021; Farchi et al., 2021); (2) use the mechanistic model as an input or feature for a machine learning model (Pathak et al., 2018b; Lei et al., 2020; Borra et al., 2020)

; (3) use mechanistic knowledge in the form of a differential equation as a final layer in a neural network representation of the solution, or equivalently define the loss function to include approximate satisfaction of the differential equation

(Raissi et al., 2019, 2018; Chen et al., 2021; Smith et al., 2020); and (4) use mechanistic intuition to constrain or inform the machine learning architecture (Haber and Ruthotto, 2017; Maulik et al., 2021). Many other successful studies have developed specific designs that further hybridize these and other perspectives (Hamilton et al., 2017; Freno and Carlberg, 2019; Yi Wan et al., 2020; Jia et al., 2021). In addition, parameter estimation for mechanistic models is a well-studied topic in data assimilation, inverse problems, and other mechanistic modeling communities, but recent approaches that leverage machine learning for this task may create new opportunities for accounting for temporal parameter variations (Miller et al., 2020) and unknown observation functions (Linial et al., 2021).

An important distinction should be made between physics-informed surrogate modeling and what we refer to as hybrid modeling. Surrogate modeling primarily focuses on replacing high-cost, high-fidelity mechanistic model simulations with similarly accurate models that are cheap to evaluate. These efforts have shown great promise by training machine learning models on expensive high-fidelity simulation data, and have been especially successful when the underlying physical (or other domain-specific) mechanistic knowledge and equations are incorporated into the model training (Raissi et al., 2019) and architecture (Maulik et al., 2021). We use the term hybrid modeling, on the other hand, to indicate when the final learned system involves interaction (and possibly feedback) between mechanism-based and data-driven models (Pathak et al., 2018b).

In this work, we focus primarily on hybrid methods that learn residuals to an imperfect mechanistic model. The benefits of this form of hybrid modeling, which we and many others have observed, are not yet fully understood in a theoretical sense. Intuitively, nominal mechanistic models are most useful when they encode key nonlinearities that are not readily inferred using general model classes and modest amounts of data. Indeed, classical approximation theorems for fitting polynomials, fourier modes, and other common function bases directly reflect this relationship by bounding the error with respect to a measure of complexity of the target function (e.g. Lipschitz constants, moduli of continuity, Sobolev norms, etc.) (DeVore and Lorentz, 1993)[Chapter 7]. Recent work by E et al. (2019) provides a priori error bounds for two-layer neural networks and kernel-based regressions, with constants that depend explicitly on the norm of the target function in the model-hypothesis space (a Barron space and a reproducing kernel Hilbert space, resp.). At the same time, problems for which mechanistic models only capture low-complexity trends (e.g. linear) may still be good candidates for hybrid learning (over purely data-driven), as an accurate linear model reduces the parametric burden for the machine-learning task; this effect is likely accentuated in data-starved regimes. Furthermore, even in cases where data-driven models perform satisfactorily, a hybrid approach may improve interpretability, trustworthiness, and controllability without sacrificing performance.

Hybrid models are often cast in Markovian, memory-free settings where the learned dynamical system (or its learned residuals) are solely dependent on the observed states. This approach can be highly effective when measurements of all relevant states are available or when the influence of the unobserved states is adequately described by a function of the observables. This is the perspective employed by Shi et al. (2018), where they learn corrections to physical equations of motion for an autonomous vehicle in regions of state space where the physics perform poorly— these residual errors are driven by un-modeled turbulence during landing, but can be predicted using the observable states of the vehicle (i.e. position, velocity, and acceleration). This is also the perspective taken in applications of high-dimensional multiscale dynamical systems, wherein sub-grid closure models parameterize the effects of expensive fine-scale interactions (e.g. cloud simulations) as functions of the coarse variables (Grabowski, 2001; Khairoutdinov and Randall, 2001; Tan et al., 2018; Brenowitz and Bretherton, 2018; O’Gorman and Dwyer, 2018; Rasp et al., 2018; Schneider et al., 2020; Beucler et al., 2021). The result is a hybrid dynamical system with a physics-based equation defined on the coarse variables with a Markovian correction term that accounts for the effects of the expensive fine scale dynamics.

1.1.3. Non-Markovian Data-Driven Modeling

Unobserved and unmodeled processes are often responsible for model errors that cannot be represented in a Markovian fashion within the observed variables alone. This need has driven substantial advances in memory-based modeling. One approach to this is the use of delay embeddings (Takens, 1981). These methods are inherently tied to discrete time representations of the data and, although very successful in many applied contexts, are of less value when the goal of data-driven learning is to fit continuous-time models; this is a desirable modeling goal in many settings.

An alternative to understanding memory is via the Mori-Zwanzig formalism, which is a fundamental building block in the presentation of memory and hidden variables and may be employed for both discrete-time and continuous-time models. Although initially developed primarily in the context of statistical mechanics, it provides the basis for understanding hidden variables in dynamical systems, and thus underpins many generic computational tools applied in this setting (Chorin et al., 2000; Zhu et al., 2018; Gouasmi et al., 2017). It has been successfully applied to problems in fluid turbulence (Duraisamy et al., 2019; Parish and Duraisamy, 2017) and molecular dynamics (Li et al., 2017; Hijón et al., 2010). Studies by Ma et al. (2019); Wang et al. (2020) demonstrated how the Mori-Zwanzig formalism motivates the use of recurrent neural networks (RNNs) (Rumelhart et al., 1986; Goodfellow et al., 2016) as a deep learning approach to non-Markovian closure modeling. Indeed, closure modeling using RNNs has become an exciting new way to learn memory-based closures (Kani and Elsheikh, 2017; Chattopadhyay et al., 2020b; Harlim et al., 2021).

In contrast, Lin and Lu (2021) use the Mori-Zwanzig formalism to directly justify a delay-embedding approach to closure modeling. Harlim et al. (2021) performed extensive comparisons between these delay-embedding and RNN-based closure approaches. Although the original formulation of Mori-Zwanzig as a general purpose approach to modeling partially observed systems was in continuous-time (Chorin et al., 2000), many practical implementations adopt a discrete-time picture (Darve et al., 2009; Lin and Lu, 2021). This causes the learned memory terms to depend on sampling rates, which, in turn, can inhibit flexibility and interpretability of the model.

Recent advances in continuous-time memory-based modeling, however, may be applicable to these non-Markovian hybrid model settings. The theory of continuous-time RNNs (i.e. formulated as differential equations, rather than a recurrence relation) was studied in the 1990s (Funahashi and Nakamura, 1993; Beer, 1995), albeit for equations with a specific additive structure. This structure was exploited in a continuous-time reservoir computing approach by Lu et al. (2018) for reconstructing chaotic attractors from data; we note that comparisons between RNNs and reservoir computing (a subclass of RNNs with random parameters fixed in the recurrent state) in discrete-time have yielded mixed conclusions in terms of their relative efficiencies and ability to retain memory (Vlachas et al., 2020; Chattopadhyay et al., 2020a). Recent formulations of continuous-time RNNs have departed slightly from the additive structure, and have focused on constraints and architectures that ensure stability of the resulting dynamical system (Chang et al., 2019; Erichson et al., 2020; Niu et al., 2019; Rubanova et al., 2019; Sherstinsky, 2020). In addition, significant theoretical work has been performed for linear RNNs in continuous-time (Li et al., 2020). Nevertheless, these various methods have not yet been formulated within a hybrid modeling framework, nor has their approximation power been carefully evaluated in that context. A recent step in this direction, however, is the work by Gupta and Lermusiaux (2021), which tackles non-Markovian hybrid modeling in continuous-time with neural network-based delay differential equations.

1.1.4. Applications of Data-Driven Modeling

In order to deploy hybrid methods in real-world scenarios, we must also be able to cope with noisy, partial observations. Accommodating the learning of model error in this setting, as well as state estimation, is an active area of research in the data assimilation (DA) community (Pulido et al., 2018; Farchi et al., 2021; Bocquet et al., 2020). Learning dynamics from noisy data is generally non-trivial for nonlinear systems—there is a chicken-and-egg problem in which accurate state estimation typically relies on the availability of correct models, and correct models are most readily identified using accurate state estimates. Recent studies have addressed this challenge by attempting to jointly learn the noise and the dynamics. Gottwald and Reich (2021)

approach this problem from a data assimilation perspective, and employ an Ensemble Kalman Filter (EnKF) to iteratively update the parameters for their dynamics model, then filter the current state using the updated dynamics. Similar studies were performed by

Brajard et al. (2021), and applied specifically in model error scenarios (Brajard et al., 2020; Farchi et al., 2021; Wikner et al., 2020). Kaheman et al. (2019) approached this problem from a variational perspective, performing a single optimization over all noise sequences and dynamical parameterizations. Nguyen et al. (2019)

used an Expectation-Maximization perspective to compare these variational and ensemble-based approaches, and further study is needed to understand the trade-offs between these styles of optimization.

We note that these data assimilators are themselves dynamical systems, which can be tuned (using optimization and machine learning) to provide more accurate state updates and more efficient state identification. However, while learning improved DA schemes is sometimes viewed as a strategy for coping with model error (Zhu and Kamachi, 2000), we see the optimization of DA and the correction of model errors as two separate problems that should be addressed individually.

When connecting models of dynamical systems to real-world data, it is also essential to recognize that available observables may live in a very different space than the underlying dynamics. Recent studies have shown ways to navigate this using autoencoders and dynamical systems models to jointly learn a latent embedding and dynamics in that latent space

(Champion et al., 2019). Proof of concepts for similar approaches primarily focus on image-based inputs, but have potential for applications in medicine (Linial et al., 2021) and reduction of nonlinear partial differential equations (Maulik et al., 2021).

1.2. Our Contributions

Despite this large and recent body of work in data-driven learning methods and hybrid modeling strategies, many challenges remain for understanding how to best combine mechanistic and machine-learned models; indeed, the answer is highly dependent on the application. Here, we construct a mathematical framework that unifies many of the common approaches for blending mechanistic and machine learning models; having done so we provide strong evidence for the value of hybrid approaches. Our contributions are listed as follows:

  1. We provide an overarching framework for learning model error from data in dynamical systems settings, studying both discrete- and continuous-time models, together with Markovian and memory-dependent representations of the model error. This formulation is agnostic to choice of mechanistic model and class of machine learning functions.

  2. We study the Markovian learning problem in the context of ergodic continuous-time dynamical systems, proving bounds on the excess risk and generalization error.

  3. We present a simple approximation theorem for learning memory-dependent (non-Markovian) model error in continuous-time using recurrent neural networks.

  4. We describe numerical experiments which demonstrate the utility of learning model error in comparison both with pure data-driven learning and with pure (but slightly imperfect) mechanistic modeling.

  5. We describe numerical experiments comparing the benefits of learning discrete- versus continuous-time models.

In Section 2, we address contribution 1. by defining the general settings of interest for dynamical systems in both continuous- and discrete-time. We then link these underlying systems to a machine learning framework in Sections 3 and 4; in the former we formulate the problem in the setting of statistical learning, and in the latter we define concrete optimization problems found from finite parameterizations of the hypothesis class in which the model error is sought. Section 5 is focused on specific choices of architectures, and underpinning theory for machine learning methods with these choices: we analyze linear methods from the perspective of learning theory in the context of ergodic dynamical systems (contribution 2.); and we describe an approximation theorem for continuous-time hybrid recurrent neural networks (contribution 3.). Finally, Section 6 presents our detailed numerical experiments; we apply the methods in Section 5 to exemplar dynamical systems of the forms outlined in Section 2, and highlight our findings (contributions 4. and 5.).

2. Dynamical Systems Setting

We present a general framework for modeling a dynamical system with Markovian model error, first in continuous-time (Section 2.1) and then in discrete-time (Section 2.2). We then extend the framework to the setting of non-Markovian model error (Section 2.3), including a parameter which enables us to smoothly transition from scale-separated problems (where Markovian closure is likely to be accurate) to problems where the unobserved variables are not scale-separated from those observed (where Markovian closure is likely to fail and memory needs to be accounted for). It is important to note that the continuous-time formulation necessarily assumes an underlying data-generating process that is continuous in nature. The discrete-time formulation can be viewed as a discretization of an underlying continuous system, but can also represent systems that are truly discrete.

The settings that we present are all intended to represent and classify common situations that arise in modeling and predicting dynamical systems. In particular, we stress two key features. First, we point out that mechanistic models (later referred to as a vector field

or flow map ) are often available and may provide predictions with reasonable fidelity. However, these models are often simplifications of the true system, and thus can be improved with data-driven approaches. Nevertheless, they provide a useful starting point that can reduce the complexity and data-hunger of the learning problems. In this context, we study trade-offs between discrete- and continuous-time framings. While we begin with fully-observed contexts in which the dynamics are Markovian with respect to the observed state , we later note that we may only have access to partial observations of a larger system . By restricting our interest to prediction of these observables, we show how a latent dynamical process (e.g. a RNN) has the power to reconstruct the correct dynamics for our observables.

2.1. Continuous-Time

Consider the following dynamical system

(2.1)

and define . If then (2.1) has solution for any , the maximal interval of existence.

The primary model error scenario we envisage in this section is one in which the vector field can only be partially known or accessed: we assume that

where is known to us and is not known. For any (regardless of its fidelity), there exists a function such that eq. 2.1 can be rewritten as

(2.2)

However, for this paper, it is useful to think of as being small relative to ; the function accounts for model error. While the approach in (2.2) is targeted at learning residuals of , can alternatively be reconstructed from through a different function using the form

(2.3)

Both approaches are defined on spaces that allow perfect reconstruction of . However, the first formulation hypothesizes that the missing information is additive; the second formulation provides no such indication. Because the first approach ensures substantial usage of , it has advantages in settings where is trusted by practitioners and model explainability is important. The second approach will likely see advantages in settings where there is a simple non-additive form of model error, including coordinate transformations and other (possibly state-dependent) nonlinear warping functions of the nominal physics . Note that the use of in representing the model error in the augmented-input setting of eq. 2.3 includes the case of not leveraging at all. It is, hence, potentially more useful than simply adopting an dependent model error; but it requires learning a more complex function.

The augmented-input method also has connections to model stacking (Wolpert, 1992); this perspective can be useful when there are model hypotheses:

Our goal is to use machine learning to approximate these corrector functions using our nominal knowledge and observations of a trajectory , for some , from the true system (2.1). In this work, we consider only the case of learning in equation (2.2). For now the reader may consider given without noise so that, in principle, is known and may be leveraged. In practice this will not be the case, for example if the data are high-frequency but discrete in time; we address this issue in what follows.

2.2. Discrete-Time

Consider the following dynamical system

(2.4)

and define . If the map yields solution 111Here we define , the non-negative integers, including zero. As in the continuous-time setting, we assume we only have access to an approximate mechanistic model . Again, we can correct using an additive residual term in a model of form

(2.5)

or by feeding as an input to a corrective warping function

we focus our experiments on the additive residual framing in eq. 2.5.

Note that the discrete-time formulation can be made compatible with continuous-time data sampled uniformly at rate . To see this, consider eq. 2.1 and its solution operator . We then define

(2.6)

which can be obtained via numerical integration of . Similarly, we can consider the solution operator defined by the vector field to define a discrete map

(2.7)

In these settings, for .

2.3. Partially Observed Systems (Continuous-Time)

The framework in Sections 2.2 and 2.1 assumes that the system dynamics are Markovian with respect to observable . Most of our experiments are performed in the fully-observed Markovian case. However, this assumption rarely holds in real-world systems. Consider a block-on-a-spring experiment conducted in an introductory physics laboratory. In principle, the system is strictly governed by the position and momentum of the block (i.e. ), along with a few scalar parameters. However (as most students’ error analysis reports will note), the dynamics are also driven by a variety of external factors, like a wobbly table or a poorly greased track. The magnitude, timescale, and structure of the influence of these different factors are rarely known; and yet, they are somehow encoded in the discrepancy between the nominal equations of motion and the (noisy) observations of this multiscale system.

Thus we also consider the setting in which the dynamics of is not Markovian. If we consider to be the observable states of a Markovian system in dimension higher than , then we can write the full system as

equationparentequation

(2.8a)
(2.8b)

Here , , and is a constant measuring the degree of scale-separation (which is large when is small). The system yields solution 222With defined analogously to for any the maximal interval of existence. We view as the complicated, unresolved, or unobserved aspects of the true underlying system.

For any (regardless of its fidelity), there exists a function such that eq. 2.8 can be rewritten as equationparentequation

(2.9a)
(2.9b)

Now observe that, by considering the solution of equation (2.9b) as a function of the history of , the influence of on the solution can be captured by a parameterized (w.r.t. ) family of operators on the historical trajectory , unobserved initial condition , and scale-separation parameter such that

(2.10)

Our goal is to use machine learning to find a Markovian model, in which is part of the state variable, using our nominal knowledge and observations of a trajectory , for some , from the true system (2.8); note that is not observed and nothing is assumed known about the vector field or the parameter

Note that equations (2.8), (2.9) and (2.10) are all equivalent formulations of the same problem and have identical solutions. The third formulation points towards two intrinsic difficulties: the unknown “function” to be learned is in fact defined by a family of operators mapping the Banach space of path history into ; secondly the operator is parameterized by which is unobserved. We will address the first issue by showing that the operators can be arbitrarily well-approximated from within a family of differential equations in dimension ; the second issue may be addressed by techniques from data assimilation (Asch et al., 2016; Law et al., 2015; Reich and Cotter, 2015) once this approximating family is learned. We emphasize, however, that we do not investigate the practicality of this approach to learning non-Markovian systems and much remains to be done in this area.

It is also important to note that these non-Markovian operators can sometimes be adequately approximated by invoking a Markovian model for and simply learning function as in Section 2.1. For example, when and the dynamics, with fixed, are sufficiently mixing, the averaging principle (Bensoussan et al., 2011; Vanden-Eijnden and others, 2003; Pavliotis and Stuart, 2008) may be invoked to deduce that

for some as in Section 2.1.

It is highly advantageous to identify settings where Markovian modeling is sufficient, as it is a simpler learning problem. We find that learning is necessary when there is significant memory required to explain the dynamics of ; learning is sufficient when memory effects are minimal. In Section 6, we show that Markovian closures can perform well for certain tasks even when the scale-separation factor is not small. In Section 3 we demonstrate how the family of operators

may be represented through ordinary differential equations, appealing to ideas which blend continuous-time RNNs with an assumed known vector field

2.4. Partially Observed Systems (Discrete-Time)

The discrete-time analog of the previous setting is as follows. We consider mapping equationparentequation

(2.11a)
(2.11b)

with , , yielding solutions and

As in the continuous-time setting, we assume that while we do not know the full equations , we do have an approximate model that is Markovian on the observable . Hence, we use our available physics and rewrite eq. 2.11 as

equationparentequation

(2.12a)
(2.12b)

and can again, analogously to (2.10), write a solution in the space of observables as

(2.13)

with , a function of the historical trajectory and the unobserved initial condition . As shown for continuous-time in (2.10) and for discrete-time in (2.13), the true residual corrector function for is now a non-Markovian term that depends on the history of the observable state and initial condition . While this dependence on initial condition may appear daunting, many applications show substantial memory fading properties that allow for this dependence to be dropped. Indeed if this discrete-time system is computed from the time map for (2.1) then, for and when averaging scenarios apply, a memoryless model as in Section 2.2 may be used.

Remark 2.1.

While our entire work has focused on immutable mechanistic models and , these models typically have tunable parameters. Of course, we can jointly learn parameters for the mechanistic model and closure term. However, the lack of identifiability between modifying the closure and modifying the physics brings up an interesting question in explainability. Future work might focus on decoupling the learning of parameters and closure terms so that maximal expressivity is first squeezed out of the mechanistic model (Plumlee and Joseph, 2017). This decoupling can be especially important when we are interested in inferring the parameters, whose estimates are biased by unknown model error (Plumlee, 2017).

3. Statistical Learning for Ergodic Dynamical Systems

Here, we present a learning theory framework within which to consider methods for discovering model error from data. We outline the learning theory in a continuous-time Markovian setting (using possibly discretely sampled data), and point to its analogs in discrete-time and non-Markovian settings.

In the discrete-time settings, we assume access to discretely sampled training data , where is a uniform sampling rate and we assume that In the continuous-time settings, we assume access to continuous-time training data ; Section 6.2 discusses the important practical question of estimating from discrete (but high frequency) data. In either case, consider the problem of identifying (where represents the model, or hypothesis, class) that minimizes a loss function quantifying closeness of to In the Markovian setting we choose a measure on and define the loss

If we assume that, at the true , is ergodic with invariant density , then we can exchange time and space averages to see that, for an infinitely long trajectory ,

Since we may only have access to a trajectory dataset of finite length , it is natural to define

and note that, by ergodicity,

Finally, we can use eq. 2.2 to get

(3.1)

This, possibly regularized, is a natural loss function to employ when continuous-time data is available, and should be viewed as approximating

We can use these definitions to frame the problem of learning model error in the language of statistical learning (Vapnik, 2013). If we let denote the hypothesis class over which we seek to minimize then we may define

The risk associated with seeking from the class is defined by , noting that this is if The risk measures the intrinsic error incurred by seeking to learn from the restricted class , which typically does not include it is an approximation theoretic concept which encodes the richness of the hypothesis class . The risk may be decreased by increasing the expressiveness of . Thus risk is independent of the data employed. To quantify the effect of data volume on learning , it is helpful to introduce the following two concepts. The excess risk is defined by

and represents the additional approximation error incurred by using data defined over a finite time horizon in the estimate of the function The generalization error is

and represents the discrepancy between training error, which is defined using a finite trajectory, and idealized test error, which is defined using an infinite length trajectory (or, equivalently, the invariant measure ), when evaluated at the estimate of the function obtained from finite data. We return to study excess risk and generalization error in the context of linear models for , and under ergodicity assumptions on the data generating process, in Section 5.2.

We have introduced a machine learning framework in the continuous-time Markovian setting, but it may be adopted in discrete-time and in non-Markovian settings. In Section 4, we define appropriate objective functions for each of these cases.

Remark 3.1.

The developments we describe here for learning in ordinary differential equations can be extended to the case of learning stochastic differential equations; see (Bento et al., 2011; Kutoyants, 2004). In that setting, consistency in the large limit is well-understood. It would be interesting to build on the learning theory perspective described here to study statistical consistency for ordinary differential equations; the approaches developed in the work by McGoff et al. (2015); Su and Mukherjee (2021) are potentially useful in this regard.

4. Parameterization of the Loss Function

In this section, we define explicit optimizations for learning (approximate) model error functions for the Markovian settings, and model error operators for the non-Markovian settings; both continuous- and discrete-time formulations are given. We defer discussion of specific approximation architectures to the next section. Here we make a notational transition from optimization over (possibly non-parametric) functions to functions parameterized by that characterize the class .

In all the numerical experiments in this paper, we study the use of continuous- and discrete-time approaches to model data generated by a continuous-time process. The set-up in this section reflects this setting, in which two key parameters appear: , the continuous-time horizon for the data; and , the frequency of the data. The latter parameter will always appear in the discrete-time models; but it may also be implicit in continuous-time models which need to infer continuous time quantities from discretely sampled data. We relate and by We present the general forms of (with optional regularization terms ). Optimization via derivative-based methodology requires either analytic differentiation of the dynamical system model with respect to parameters, or the use of autodifferentiable ordinary differential equation solvers (Rubanova et al., 2019).

4.1. Continuous-Time Markovian Learning

Here, we approximate the Markovian closure term in (2.2) with a parameterized function . Assuming full knowledge of , we learn the correction term for the flow field by minimizing the following objective function of :

(4.1)

Note that

Notable examples that leverage this framing include: the paper (Kaheman et al., 2019), where are coefficients for a library of low-order polynomials and is a sparsity-promoting regularization defined by the SINDy framework; the paper (Shi et al., 2018), where are parameters of a deep neural network and encodes constraints on the Lipschitz constant for provided by spectral normalization; and the paper (Watson, 2019) which applies this approach to the Lorenz ’96 Multiscale system using neural networks with an regularization on the weights.

4.2. Discrete-Time Markovian Learning

Here, we learn the Markovian correction term in (2.5) for the discrete solution map by minimizing:

(4.2)

(Recall that ) This is the natural discrete-time analog of (4.1) and may be derived analogously, starting from a discrete analog of the loss where now is assumed to be an ergodic measure for (2.4). If a discrete analog of (3.1) is defined, then parameterization of , and regularization, leads to (4.2). This is the underlying model assumption in the work by Farchi et al. (2021).

4.3. Continuous-Time Non-Markovian Learning

We can attempt to recreate the dynamics in for eq. 2.10 by modeling the non-Markovian residual term. A common approach is to augment the dynamics space with a variable as follows:

equationparentequation

(4.3a)
(4.3b)

where we seek a large enough and parametric models expressive enough such that the dynamics in are reproduced by eq. 4.3. We learn a good parameterization for eq. 4.3 by minimizing the following objective:

(4.4)

Specific functional forms of (and their corresponding parameter inference strategies) reduce eq. 4.3 to various approaches. For the continuous-time RNN structure we discuss in the next section we choose to be linear in and independent of , whilst is a neural network; we may generalize and consider latent ordinary differential equations in which and are both neural networks. It is intuitive that the latter may be more expressive and afford a smaller than the former. To our knowledge, studies of continuous-time non-Markovian closures using RNNs have not yet been published; this appears to be a direction that would be interesting to pursue further and we will discuss it from a theoretical standpoint in Section 5.

Remark 4.1.

The recent paper (Gupta and Lermusiaux, 2021) proposes an interesting, and more computationally tractable, approach to learning model error in the presence of memory. They propose to learn a closure operator for a delay differential equation with finite memory :

(4.5)

neural networks are used to learn the operator Alternatively, Gaussian processes were used to fit a specific stochastic generalization of the delay differential equation (4.5) in (Schneider et al., 2020).

4.4. Discrete-Time Non-Markovian Learning

Here, we aim to recreate the dynamics in for eq. 2.13 by modeling the non-Markovian residual term, and do this by augmenting the dynamics space with a variable as follows:

equationparentequation

(4.6a)
(4.6b)

where we seek a large enough and expressive enough such that the dynamics in are reproduced by eq. 4.6. Learning is performed by minimizing the following objective

(4.7)

(Recall that ) The functional form of (and their corresponding parameter inference strategies) reduce eq. 4.6 to various approaches, including recurrent neural networks, latent ordinary differential equations, and delay-embedding maps (e.g. to get a delay embedding map, is a shift operator). Pathak et al. (2018b) used reservoir computing (a random features approach to RNN, described in the next section) with regularization to study an approach similar to eq. 4.6, but included as a feature in and instead of using it as the central model upon which to learn residuals. The data-driven super-parameterization approach in (Chattopadhyay et al., 2020b) also appears to follow the underlying assumption of eq. 4.6.

4.5. Connections to Physics-Informed Neural Networks

It is also of interest to consider how physics-based regularization ideas from Physics-informed Neural Networks (PINNs) (Raissi et al., 2019), which were designed to improve partial differential equation surrogate modeling, can be leveraged for modeling ordinary differential equations. Investigation into this topic is still ongoing (Tipireddy et al., 2019; Yazdani et al., 2020; Antonelo et al., 2021). A natural starting point is to consider the objective function

(4.8)

however, if the trajectory is not long enough to cover all relevant parts of state-space required for prediction, an alternative is to consider the objective function

(4.9)

here defines a bounded region in Euclidean space to which the dynamics to be learned is assumed to be confined. In both these settings regularization parameter can be learned e.g. by via cross-validation. It may be desirable to fit a discrete-time model, given imperfect continuous time vector field, in which case a natural objective function is

(4.10)

Again the second term may be generalized to an integral over space rather than time, if the trajectory data does not explore enough of the state-space. These physics-based regularizations may be especially useful when available data are limited or when derivatives are poorly estimated from data; regularization with respect to deviations from provides cheap, approximate data augmentation.

5. Underpinning Theory

In this section we identify specific hypothesis classes We do this using random feature maps (Rahimi and Recht, 2008a) in the Markovian settings (Section 5.1), and using recurrent neural networks in the memory-dependent setting. We then discuss these problems from a theoretical standpoint. In Section 5.2 we study excess risk and generalization error in the context of linear models (a setting which includes the random features model as a special case). And we conclude by discussing the use of RNNs (Goodfellow et al., 2016)[Chapter 10] for the non-Markovian settings (discrete- and continuous-time) in Section 5.3; we present an approximation theorem for continuous-time hybrid RNN models. Throughout this section, the specific use of random feature maps and of recurrent neural networks is for illustrative purposes only; other models could, of course, be used.

5.1. Markovian Modeling with Random Feature Maps

In principle, any hypothesis class can be used to learn . However, we focus on models that are easily trained on large-scale complex systems and yet have proven approximation power for functions between finite-dimensional Euclidean spaces. For the Markovian modeling case, we use random feature maps; like traditional neural networks, they possess arbitrary approximation power (Rahimi and Recht, 2008c, b), but further benefit from a quadratic minimization problem in the training phase, as do kernel or Gaussian process methods. In our case studies, we found random feature models sufficiently expressive, we found optimization easily implementable, and we found the learned models generalized well. Moreover, their linearity with respect to unknown parameters enables a straightforward analysis of excess risk and generalization error in Section 5.2. Details on the derivation and specific design choices for our random feature modeling approach can be found in Section 8.3, where we explain how we sample random feature functions and stack them to form a vector-valued feature map . Given this random function , we define the hypothesis class

(5.1)

When studying statistical learning, we will also consider the hypothesis class

(5.2)

where the are i.i.d. draws of function in the case

5.1.1. Continuous-Time

In the continuous-time framing, our Markovian closure model uses hypothesis class (5.1) and thus takes the form

We rewrite eq. 4.1 for this particular case with an regularization parameter :

(5.3)

We employ the notation for the outer-product between matrices , and the following notation for time-average

of Equation 5.3 is quadratic and convex in and thus is globally minimized for the unique which makes the derivative of zero. Consequently, the minimizer satisfies the following linear equation (derived in the Appendix Section 8.4):

(5.4)

Here, is the identity and

(5.5)

Of course is not known, but can be computed from data.

To summarize, the algorithm proceeds as follows: 1) create a realization of random feature vector ; 2) compute the integrals in eq. 5.5 to obtain ; and 3) solve the linear matrix equation eq. 5.4 for . Together this leads to our approximation .

5.1.2. Discrete-Time

In the discrete-time framing, our Markovian closure model takes the form

We rewrite eq. 4.2 for this particular case with an regularization parameter :

(5.6)

Equation 5.6 is quadratic in and thus is globally minimized at which satisfies the following linear equation:

(5.7)

Here

(5.8)

The derivation is analogous to that for the continuous time setting, as detailed in Section 8.4. Of course is not known, but and so

may be calculated from data. This formulation closely mirrors the fully data-driven linear regression approach in

(Gottwald and Reich, 2021). By computing and solving solving the linear equation eq. 5.7 for , we obtain our approximation .

5.2. Learning Theory for Markovian Models with Linear Hypothesis Class

In this subsection we provide estimates of the excess risk and generalization error in the context of learning in (2.2) from a trajectory over time horizon . We study ergodic continuous-time models in the setting of Section 4.1 and seek to learn from the hypothesis class defined in (5.2). Here and For the learning theory presented in this subsection, it is not necessary to use the random features construction of the ; however, we note that, in that setting, universal approximation for random features (Rahimi and Recht, 2008a) shows that the loss function may be made arbitrarily small by choice of large enough and appropriate choice of parameters. We also note that the theory we present in this subsection is also readily generalized to working with hypothesis class (5.1).

We make the following ergodicity assumption about the data generation process:

Assumption A1.

Equation (2.2) possesses a compact global attractor supporting invariant measure Furthermore the dynamical system on is ergodic with respect to

and satisfies a central limit theorem of the following form: for all Hölder continuous bounded

, there is such that

(5.9)

where denotes convergence in distribution with respect to . Furthermore, almost surely with respect to ,

(5.10)
Remark 5.1.

In the work of Melbourne and co-workers, such an assumption is proven to hold for a class of differential equations, including the Lorenz ’63 model at, and in a neighbourhood of, the classical parameter values: in (Holland and Melbourne, 2007) the central limit theorem is established; and in (Bahsoun et al., 2020) the continuity of in is proven. Whilst it is in general very difficult to prove such results for any given chaotic dynamical system, there is strong empirical evidence for such results in many chaotic dynamical systems that arise in practice. This combination of theory and empirical evidence justify studying the learning of model error under Assumption A1. Tran and Ward (2017) were the first to make use of the theory of Melbourne and coworkers to study learning of chaotic differential equations from time-series.

Given from hypothesis class (5.2) we define

(5.11)

and

(5.12)

(Regularization is not needed in this setting because the data is plentiful—a continuous-time trajectory—and the number of parameters is finite.) Then solve the linear systems

where

These facts can be derived analogously to the derivation in Section 8.4. Given and we also define

Recall that it is assumed that , and are We make the following assumption regarding the vector fields defining hypothesis class

Assumption A2.

The functions are Hölder continuous on . In addition, the matrix is invertible.

Theorem 5.2.

Let Assumptions A1 and A2 hold. Then the scaled excess risk (resp. scaled generalization error ) is bounded above by (resp.

), where random variable

(resp. ) converges in distribution to (resp. )) w.r.t. . Furthermore, there is constant such that, almost surely w.r.t. ,

The proof is provided in the appendix Section 8.1.

Remark 5.3.

The convergence in distribution shows that, with high probability with respect to initial data, the excess risk and the generalization error are bounded above by terms of size

This can be improved to give an almost sure result, at the cost of the factor of . The theorem shows that (ignoring log factors and acknowledging the probabilistic nature of any such statements) trajectories of length are required to produce bounds on the excess risk and generalization error of size . In turn, the bounds on excess risk and generalization error also show that empirical risk minimization (of ) approaches the theoretically analyzable concept of risk minimization (of ) over hypothesis class (5.2). The sum of the excess risk and the generalization error gives

We note that is computable, once the approximate solution has been identified; thus, when combined with an estimate for , this leads to an estimate for the risk associated with the hypothesis class used. Work by Zhang et al. (2021) demonstrates that error bounds on learned model error terms can be extended to bound error on reproduction of invariant statistics for ergodic Itô diffusions (a generalization of the ordinary differential equation setting we have presented). Moreover, E et al. (2019) provide a direction for proving similar bounds on model error learning using nonlinear function classes (e.g. two-layer neural networks).

Finally we remark on the dependence of the risk and generalization error bounds on the size of the model error. It is intuitive that the amount of data required to learn model error should decrease as the size of the model error decreases. This is demonstrated numerically in Section 6.4 (c.f. Figures 1(b) and 1(a)). Here we comment that Theorem 5.2 also exhibits this feature: examination of the proof in Appendix 8.1 shows that all upper bounds on terms appearing in the excess and generalization error are proportional to itself or to , its approximation given an infinite amount of data; note that if the hypothesis class contains the truth.

5.3. Non-Markovian Modeling with Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are one of the de facto tools for modeling systems with memory. Here, we show straightforward residual implementations of RNNs for continuous- and discrete-time, with the goal of modeling non-Markovian model error. For example, in discrete-time, the form in eq. 4.6 can be specified as

equationparentequation

(5.13a)
(5.13b)

The parameters can be learned to describe memory-dependent model error. Note that inherent in choosing these matrices and vector is a choice of embedding dimension for variable ; this will typically be larger than dimension of itself. The goal is to choose them so that output matches output of (2.12), without observation of or knowledge of and

We note here that more sophisticated variants, such as Long-Short Term Memory RNNs

(Hochreiter and Schmidhuber, 1997), are often more effective, but similar in nature. In continuous-time, we specify the form in eq. 4.3 as

equationparentequation

(5.14a)
(5.14b)

The goal is to choose so that output matches output of (2.9), without observation of or knowledge of and As in discrete-time, inherent in choosing these matrices and vector is a choice of embedding dimension for variable which will typically be larger than dimension of itself.

In both continuous- and discrete-time settings, the idea is to create a recurrent state of sufficiently large dimension whose evolution equation takes

as input and, after a final linear transformation, approximates the missing dynamics

A full discussion of implementation of this model is beyond the scope of this paper. We refer readers to (Goodfellow et al., 2016)

for background on discrete RNN implementations and backpropagation through time (BPTT). For implementations of continuous-time RNNs, it is common to leverage the success of the automatic BPTT code written in PyTorch and Tensorflow by discretizing

eq. 5.14 with an ordinary differential equation solver that is compatible with these autodifferentiation tools (e.g. torchdiffeq (Rubanova et al., 2019)). This compatibility can also be achieved by use of explicit Runge-Kutta schemes (Queiruga et al., 2020). Note that the discretization of eq. 5.14 can (and perhaps should) be much finer than the data sampling rate , but that this requires reliable estimation of from discrete data.

Existing approximation theory for discrete-time RNNs (Schäfer and Zimmermann, 2007) can be used to show that eq. 5.13 can approximate discrete-time systems arbitrarily well. However, the approximation theorem for continuous-time RNNs was proved for additive-style equations (Funahashi and Nakamura, 1993); we extend the result to a more common (non-additive) form, and do so in the context of residual-based learning as in eq. 5.14.

We state the theorem after making three assumptions upon which it rests:

Assumption A3.

Functions are all globally Lipschitz.

Note that this implies that is also globally Lipschitz.

Assumption A4.

Let