Bayesian differential programming for robust systems identification under uncertainty

04/15/2020 ∙ by Yibo Yang, et al. ∙ University of Pennsylvania 0

This paper presents a machine learning framework for Bayesian systems identification from noisy, sparse and irregular observations of nonlinear dynamical systems. The proposed method takes advantage of recent developments in differentiable programming to propagate gradient information through ordinary differential equation solvers and perform Bayesian inference with respect to unknown model parameters using Hamiltonian Monte Carlo. This allows us to efficiently infer posterior distributions over plausible models with quantified uncertainty, while the use of sparsity-promoting priors enables the discovery of interpretable and parsimonious representations for the underlying latent dynamics. A series of numerical studies is presented to demonstrate the effectiveness of the proposed methods including nonlinear oscillators, predator-prey systems, chaotic dynamics and systems biology. Taken all together, our findings put forth a novel, flexible and robust workflow for data-driven model discovery under uncertainty.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

abstractingagent

This is absolutely incredible work, truly impressed at the implications and the robustness of the framework, a question though - in one of the examples you decided to calibrate the parameters of the yeast glycolysis process - a rather complex non-linear model, however - given data of the process and no governing equations, could you also have discovered the governing dynamics just like you did with the pendulum process using the dictionary and the NN? Has this been tried? That would have been truly fascinating

Authors

page 20

page 24

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the era of big data, dynamical systems discovery has received a lot of attention, primarily due to the significant growth of accessible data across different scientific disciples, including systems biology [ruoff2003temperature], bio-medical imaging [kak2002principles], fluid dynamics [raissi2020hidden], climate modeling [palmer1999nonlinear], physical chemistry [feinberg1974dynamics], etc. Extracting a set of interpretable underlying features that describe the evolution of a dynamical system is crucial for developing physical insight and gaining a better understand the natural phenomena under study [haller2002lagrangian]. Perhaps more importantly, this can enable the reliable forecasting of future states and subsequently lead to effective intervention strategies for design and control of complex systems [tantet2018crisis, bemporad1999control, lu2019nonparametric].

Machine learning methods and data-driven modeling techniques have already proven their utility in solving high-dimensional problems in computer vision

[krizhevsky2012imagenet]

, natural language processing

[bahdanau2014neural], etc. Due to their capability of extracting features from high dimensional and multi-fidelity noisy data [yang2019conditional, han2018solving], these methods are also gaining traction in modeling and simulating physical and biological systems. The evolution of such systems can be typically characterized by differential equations, and several techniques have been developed to construct predictive algorithms that can synergistically combine data and mechanistic prior knowledge. Such scientific machine learning approaches are currently employed to distill dynamics from time-series data [rackauckas2020universal, gholami2019anode, chen2018neural, brunton2016discovering, rudy2017data, brennan2018data, raissi2018multistep, qin2019data], infer the solution of differential equations [raissi2017inferring, raissi2019physics, zhu2019physics, yang2019adversarial, yang2020physics, wang2020understanding], infer parameters, latent variables and unknown constitutive laws [wang2019non, raissi2020hidden, raissi2018hidden, tartakovsky2018learning], as well as tackle forward and inverse problems in complex application domains including cardiovascular flow dynamics, [kissas2020machine], metamaterials [chen2019physics], cardiac electrophysiology [sahli2020physics], etc.

Specific to systems identification, most recent data-driven approaches [rackauckas2020universal, brunton2016discovering, qin2019data, chen2018neural, gholami2019anode] heavily rely on the quality of the observations and are not designed to return predictions with quantified uncertainty. For instance, the sparse regression methods put forth in [brunton2016discovering, rudy2017data] can exhibit unstable behavior if the data is highly noisy and are not able to directly digest time-series data with irregular sampling frequency or missing values. On the other hand, recent approaches leveraging differentiable programming [rackauckas2020universal, chen2018neural, gholami2019anode]

can support irregularly sampled data, but are only designed to provide point-estimates for the discovered dynamics with no characterization of predictive uncertainty. Such lack of robustness and missing capabilities may limit the use of existing techniques to idealized settings and pose the need for a more flexible framework that can effectively accommodate noisy, sparse and irregularly sampled data to infer posterior distributions over plausible models, and subsequently yield robust future forecasts with quantified uncertainty.

In this work, we aim to address the aforementioned capability gaps by formulating a fully Bayesian framework for robust systems identification from imperfect time-series data. Our specific contributions can be summarized in the following points:

  • We leverage recent developments in differentiable programming [rackauckas2020universal, chen2018neural, gholami2019anode] to enable gradient back-propagation through numerical ODE solvers, and utilize this information to construct accelerated Hamiltonian Monte Carlo schemes for Bayesian inference.

  • The proposed workflow is computationally efficient, end-to-end differentiable, and can directly accommodate sparse, noisy and irregularly sampled time-series data.

  • Equipped with sparsity-promoting priors, we can recover interpretable and parsimonious representations for the latent dynamics and, unlike existing approaches that are only designed to produce point estimates, entire posterior distributions over plausible models are obtained.

  • This probabilistic formulation is key for safe-guarding against erroneous data, incomplete model parametrizations, as well as for producing reliable future forecasts with quantified uncertainty.

  • We demonstrate enhanced capabilities and robustness against state-of-the-art methods for systems identification [brunton2016discovering, rudy2017data] across a range of benchmark problems.

Taken all together, our findings put forth a novel, flexible and robust workflow for data-driven model discovery under uncertainty that can potentially lead to improved algorithms for robust forecasting, control and model-based reinforcement learning of complex systems.

The rest of this paper is organized as follows. Section 2 presents the proposed method and its corresponding technical ingredients. In section 2.2, a review of Bayesian inference and Hamiltonian Monte Carlo sampling is given. In section 2.1, we provide details on how the proposed method can effectively propagate gradient information through ODE solvers leveraging differential programming. In section 2.5, we discuss a simple but essential step for pre-processing the observed data in order to obtain robust training behavior. In section 3 the effectiveness of the proposed method is tested on a series of numerical studies, including nonlinear oscillators, predator-prey systems, chaotic dynamics, and a realistic example in systems biology. Finally, in section 4 we summarize our key findings, discuss the limitations of the proposed approach, and carve out directions for future investigation.

2 Methods

This section provides a comprehensive overview of the key ingredients that define our work, namely differential programming via Neural ordinary differential equations (NeuralODEs) [chen2018neural] and Bayesian inference with Hamiltonian Monte Carlo sampling [neal2011mcmc]. Our presentation is focused on describing how these techniques are be interfaced to obtain a novel, efficient, and robust workflow for parsimonious model discovery from imperfect time-series data.

2.1 Differentiable programming

In their original work, Chen et. al. [chen2018neural] introduced a general framework for propagating gradient information through classical numerical solvers for ordinary differential equations (ODEs) that blends classical adjoint methods [pontryagin2018mathematical] with modern developments in automatic differentiation [van2018automatic]. To illustrate the main concepts, consider a general dynamical system of the form

(1)

where denotes the state space of the -dimensional dynamical system, and

is a vector unknown parameters that parametrizes the latent dynamics

. A systems identification task is now summarized as follows. Given some observations evaluated at time instances , , one would like to learn the

that best parametrizes the underlying dynamics. A typical approach for identifying these optimal parameters is to define a loss function

that measures the discrepancy between the observed data and the model’s predictions for a given , i.e.,

(2)

where is the predicted value under a given set of estimated model parameters obtained by integrating the dynamical system with some ODE solver. could be any metric to evaluate the distance / evaluating the discrepancy between the true value and the predicted one (e.g., loss, loss [hastie2009elements] or KL-divergence [ghahramani2015probabilistic], Wasserstein distance [gulrajani2017improved], etc). A sufficient way to minimize the loss function is through gradient descent [su2014differential, bottou2010large], however appropriate methods need to be employed for effectively computing the gradient of the loss function with respect to the parameters, namely . This is done by defining the adjoint of the dynamical system as . Then, the dynamical system describing the evolution of the adjoint can be derived as [pontryagin2018mathematical, chen2018neural]

(3)

Note that the ajoint can be computed by an additional call to the chosen ODE solver, and the target derivative can be then computed as

(4)

where can be evaluated via automatic differentiation [van2018automatic, chen2018neural].

The main advantages of this approach can be summarized in the following points:

  • The data does not need to be collected on a regular time grid.

  • The time-step between an observed data pair can be relatively large. Within each , a classical numerical scheme can be used to integrate equation 4, where is discretized in equal spaced steps, with the step size being typically chosen according to the stability properties of the underlying ODE solver.

  • As this setup only assumes dependency between individual input-output pairs , the observed time-series data does not need to be continuous and could be selected from different time intervals.

The specific choices of , and will be discussed for each of the different examples given in this work. In general, the choice of is made based on the following trade-off between accuracy and computational complexity. To this end, small may lead to a less accurate model, while large would lead to a massive computational graph that can significantly slow down model training.

Throughout this work, the following loss function is employed

(5)

where denotes the output of a numerical ODE solver. Throughout this work, we use the classical fourth order Runge-Kutta [iserles2009first], although more general choices can be employed [rackauckas2020universal]. The training data-set consists of pairs . To simplify notation, let be defined as , such that , then the training data-set is given by: .

Notice that the aforementioned workflow is only capable of producing deterministic point estimates for the unknown parameters , typically corresponding to a local minimum of equation 5. In many practical cases it is desirable to obtain a distribution over plausible parameter configurations that can effectively characterize the uncertainty in the estimates due to noise and sparsity in the observed data, as well as potential misspecifications in the model parametrization. A framework for accounting for such uncertainties can be constructed by adopting a Bayesian approach via the use of effective sampling algorithms for approximating a posterior distribution over the unknown model parameters , as discussed in the next section.

2.2 Bayesian inference with Hamiltonian Monte Carlo

The Bayesian formalism provides a natural way to account for uncertainty, while also enabling the injection of prior information for the unknown model parameters (e.g., sparsity), as well as for modeling sparse and noisy observations in the training data-set. Perhaps more importantly, it enables the complete statistical characterization for all inferred parameters in our model. The latter, is encapsulated in the posterior distribution which can be factorized as

(6)

where is a likelihood function that measures the discrepancy between the observed data and the model’s predictions for a given set of parameters , is a prior distribution parametrized by a set of parameters that can help us encode any domain knowledge about the unknown model parameters , and contains a set of parameters that aim to characterize the noise process that may be corrupting the observations. In this work we employ a hierarchical Bayesian approach corresponding to the following likelihood and priors

(7)

The use of a Gaussian likelihood stems from assuming a simple isotropic Gaussian noise model with zero mean and precision . The Laplace prior over the unknown model parameters [williams1995bayesian]

can promote sparsity in the inferred model representations and enable us to effectively neglect the influence of any irrelevant parameters. The Gamma distribution is a common choice for the prior distributions over the unknown precision parameters

and . For additional motivation and alternatives choices, the interested reader is referred to [geweke1993bayesian, gelman2013bayesian, winkler1967assessment, bernardo1979reference, berger1990robust]. Finally, the logarithm of the precision variables and is used to ensure that their estimated values remain positive during model training.

The posterior distribution defined in equation 7 is not analytically tractable in general due to our modeling assumptions on the likelihood and priors, as well as due to the presence of non-linearity in the latent dynamics of equation 1. Typically, sampling from this unnormalized distribution is difficult and computationally expensive, especially when the dimension of is large. Hamiltonian Monte Carlo (HMC) [neal2011mcmc] is a powerful tool to handle Bayesian inference tasks in high dimensions by utilizing gradient information to effectively generate approximate posterior samples. To illustrate the key ideas behind HMC sampling, let us denote as the vector including all the unknown parameters that need to be inferred from data. The starting point for building an HMC sampler is to define a Hamiltonian function

(8)

where is the potential energy of the original system usually taken as the logarithm of the unnormalized distribution in equation 7, and is the kinetic energy of the system introduced by the auxiliary velocity variables . The evolution of and can be expressed by taking gradients of the Hamiltonian as

(9)

A Markov chain can be simulated by integrating this dynamical system using an energy preserving leapfrog numerical scheme

[neal2011mcmc], with referring to the step size of the HMC and indicating the number of leapfrog steps taken at each HMC iteration. The choice of is problem dependent and is always chosen from . The leap frog step is often chose as . Then the update rule of the parameters and is then given by

(10)

Note here that computing the gradient is difficult and expensive as the likelihood function is computed by the predictions of an ODE solver. In this case, the differentiable programming framework discussed in section 2.1 is used to address this issue by effectively enabling gradient back-propagation through the ODE solver. Note that all parameters in can be either updated simultaneously, or separately for and using a Metropolis-within-Gibbs scheme [gilks1995adaptive, millar2000non].

2.3 Learning dynamics with Bayesian differential programming

Here we distinguish between three different problem settings that cover a broad range of practical applications. The first class consists of problems in which the model form of the underlying latent dynamics is completely unknown. In this case, one can parametrize the unknown dynamical system using black-box function approximators such as deep neural networks

[chen2018neural, gholami2019anode, raissi2018multistep], or aim to distill a more parsimonious and interpretable representation by constructing a comprehensive dictionary over all possible interactions and try to infer a predictive, yet minimal model form [brunton2016discovering, rudy2017data, qin2019data]. The second class of problems contains cases where a model form for the underlying dynamics is prescribed by domain knowledge, but a number of unknown parameters needs to be calibrated in order to accurately explain the observed data [tartakovsky2018learning, yazdani2019systems]. Finally, a third class of problems arises as a hybrid of the aforementioned settings, in which some parts of the model form may be known from domain knowledge, but there exists additional functional terms that need to be inferred [rackauckas2020universal]. As we will see in the following, the proposed workflow can seamlessly accommodate all of the aforementioned cases in a unified fashion, while remaining robust with respect to incomplete model parametrizations, as well as imperfections in the observed data. To this end, a general framework can be constructed by parametrizing the unknown dynamics as

(11)

where represents a matrix of unknown coefficients, with being the length of a dictionary , that may or may not be constructed using domain knowledge. Specifically, represents the possible terms that may appear in the right hand side of the ordinary differential equation, which could encapsulate a known model form or, more generally, a prescribed dictionary of features (e.g., polynomials, Fourier modes, etc., and combinations thereof) [brunton2016discovering, rudy2017data, champion2019data]. On the other hand, denotes a black-box function approximator (e.g., a neural network) parametrized by that aims to account for any missing interactions that are not explicitly captured by [rackauckas2020universal].

Here we must note that domain knowledge can play a crucial role in enhancing the efficiency of the resulting inference scheme as it can effectively reduce the size of the dictionary, and, consequently, the number of data required to train the model [zhu2019physics, raissi2019physics]. Such knowledge is also critical to constrain the space of admissible solutions such that key physical principles are faithfully captured by the inferred model (e.g., convergence to equilibrium limit cycles in chemical systems [schnakenberg1979simple], conservation of mass and momentum in fluid dynamics [raissi2018hidden], etc).

Although the use of dictionary learning (with or without specific domain knowledge) offers a flexible paradigm for recovering interpretable dynamic representations, it may not always be sufficient to explain the observed data, as important terms may be missing from the model parametrization. To address this shortcoming one can try to capture these missing interactions via the use of closure terms that often lack physical intuition, and hence can be represented by a black-box function approximator with parameters , such as a deep neural network [rackauckas2020universal].

Under this setup, our goal is to employ the algorithmic framework outlined in sections 2.1 and 2.2 to perform probabilistic inference over plausible sets of model parameters that yield interpretable, parsimonious, and predictive representations, as well as the precision parameters and of the Bayesian hierarchical model in equation 7.

2.4 Generating forecasts with quantified uncertainty

The goal of the HMC algorithm described in section 2.2

is to produce a faithful set of samples that concentrate in regions of high-probability in the posterior distribution

. Approximating this distribution is central to our workflow as it enables the generation of future forecasts with quantified uncertainty via computing the predictive posterior distribution

(12)

This predictive distribution provides a complete statistical characterization for the forecasted states by encapsulating epistemic uncertainty in the inferred dynamics, as well as accounting for the fact that the model was trained on a finite set of noisy observations. This allows us to generate plausible realizations of by sampling from the predictive posterior distribution as

(13)

where, and are approximate samples from computed during model training via HMC sampling, denotes any numerical integrator that takes some initial condition and predicts the system’s state at any time , and accounts for the noise corrupting the observations used during model training. Moreover, the maximum a-posteriori (MAP) estimate of the model parameters is given as follows:

(14)

and use it to obtain a point estimate prediction of the predicted states as

(15)

Finally, it is straightforward to utilize the posterior samples of to approximate the first- and second-order statistics of the predicted states for any given initial condition as

(16)
(17)

where denotes the number of samples drawn from the Hamiltonian Markov Chain used to simulate the posterior, i.e., ,

. Note that higher-order moments are also readily computable in a similar manner.

2.5 Model initialization and data pre-processing

To promote robustness and stability in the training of the proposed machine learning pipeline, users should be cognizant of several important aspects. First, although the proposed Bayesian approach can naturally safe-guard against over-fitting, it is important that a reasonable amount of training data is provided – relative to the complexity of the system – in order to mitigate any effects of prior misspecification. Second, the training data should be appropriately normalized in order to prevent gradient pathologies during back-propagation [glorot2010understanding]. The specific utility of this step will be demonstrated in the numerical examples presented in this work, using a standard normalization of the form

(18)

where

is the dimension-wise standard deviation of the training data and the division is an element-wise operation. Notice that this modification directly implies that the assumed parametrization of the underlying dynamical system also needs to be normalized accordingly (see section

3 for a more detailed discussion). A third important remark here, is that the noise precision obtained from the model aims to reflect the noise level in the observed data. However, it may not be the true noise level because the initial condition of the ODE can also be noisy. This point will be further discussed in section 3.

Another important point relates to the initialization of the Hamiltonian Monte Carlo Markov Chain sampler. To this end, in order to mitigate poor mixing and convergence to local minima, a preconditioning step is considered to find a reasonable initial guess for the unknown variables that parametrize the underlying latent dynamics. This step is typically carried out by minimizing the reconstruction loss of the training data using an regularization,

(19)

using a small number (

) of stochastic gradient descent iterations. Notice that this essentially aims at obtaining a rough point estimate for

, where the use of regularization stems from employing a sparsity-promoting Laplace prior. This preconditioning step is closely related to the SINDy algorithm of Brunton et. al. [brunton2016discovering], however without the limitations of requiring training data with small noise amplitude and sampled on regular time grid with small a time step. Moreover, the numerical experiments carried out indicate that tuning the hyper-parameter has almost no effect on the obtained results for all the problems considered in this work, as this simply serves as an initialization step for the HMC sampler. Hence, in all examples considered, we take . The minimization of equation 19 is carried out using stochastic Adam updates [kingma2014adam] with a learning rate of .

Note that this preconditioning is not precisely equivalent to the MAP estimation of the posterior distribution over all model parameters , since the initialization of the precision parameters and follows a different treatment. Specifically, for all the problems considered in section 3, the parameters ’s and ’s of the prior Gamma distributions are chosen to be . Moreover, the precision of the Gaussian noise distribution is initialized as follows. If the preconditioning precision is larger than , which means the training data appears to be nearly noise-free, the initial guess for is set to to avoid numerical stagnancy of the HMC sampler. Otherwise, if the preconditioning precision is less than , then it is used as the initial guess for . This empirical initialization strategy has a positive effect in accelerating the convergence of the HMC sampler for all the examples considered in section 3. Finally, in all of our numerical examples, the Hamiltonian Monte Carlo step-size (see equation 10) is taken as order . We would like to point out, this choice of is the safest choice, one can increase its value as long as we do not observe a lot of rejection during the training. The number of Monte Carlo realizations is and for computing the response statistics we chose (see equation 16), unless otherwise noted.

1Inputs: Time-series training data-pairs .
2 Outputs: Posterior samples for the unknown model parameters , generated forecasts from the predictive posterior distribution .
3 Initialize all trainable parameters model with small values .
4 Minimize the objective function in equation 19 via stochastic Adam updates to construct a sensible initial guess for the Hamiltonian Monte Carlo sampler.
5 Given the pre-conditioned initial guess for the model parameters, the initial value for the data precision, and the learning rate of the Hamiltonian Markov Chain, perform HMC updates to obtain a sequence of approximate posterior samples , where .
Use the converged posterior sampler to compute the MAP estimate of the parameters (see 14), predict the MAP trajectory (see 15), and compute the response statistics using the predictive distribution (see 16).
Algorithm 1 Bayesian differential programming for robust systems identification under uncertainty.

3 Results

In this section we present a comprehensive collection of numerical studies that aim to illustrate the key contributions of this work, and place them in context the existing SINDy framework of Brunton et. al. [brunton2016discovering], which is currently considered as the state-of-tha-art method for dictionary learning of dynamical systems. Specifically, we expand on five benchmark problems that cover all possible cases discussed in section 2.3 in terms of parametrizing the latent dynamics using a dictionary, domain knowledge, black-box approximations, or a combination thereof. The algorithmic settings used across all cases follow the discussion provided in 2.5, unless otherwise noticed. All code and data presented in this section will be made publicly available at https://github.com/PredictiveIntelligenceLab/BayesianDifferentiableProgramming.

3.1 Dictionary learning for a two-dimensional nonlinear oscillator

Let us start with a pedagogical example on dictionary learning for inferring the dynamics of a two-dimensional damped oscillator from scattered time-series data [raissi2018multistep]. The exact system dynamics are given by

(20)

Our goal here is to recover this dynamical system directly from data using a dictionary parametrization containing polynomial terms with up to 3rd order interactions, taking the form

(21)

where the ’s are unknown scalar coefficients that will be estimated using the proposed Bayesian differential programming method. The model’s active non-zero parameters are highlighted with red color for clarity. Our goal is to infer a posterior distribution for the parameters , while it is obvious that the coefficient matrix and dictionary will increase in size with the order of the polynomial features used to parametrize the system. In all presented experiments, we assume an initial condition of at , we generate a set of training data by simulating the exact dynamics of equation 20 in the time interval with , and .

3.1.1 Effects of sparsity in the training data

In what follows, we will investigate the performance of our algorithms with respect to the temporal resolution and the noise level in the observed data. Moreover, we will provide a comprehensive comparison with the SINDy algorithm of Brunton et. al. [brunton2016discovering].

To examine the performance of our method with respect to the temporal resolution of the observed data we distinguish between two cases: (i) the data is generated with relatively small , such that there are training data samples, and (ii) the data is generated with relatively large , such that there are only training data samples. In order to establish a comparison against the SINDy algorithm, we will also assume that the training data is noise-free and sampled on a regular temporal grid, as required by the SINDy setup [brunton2016discovering].

Figure 1 presents the results obtained for the case corresponding to high-resolution training data, generated using a time-step of . Specifically, it depicts the training data together with the exact simulated trajectory of the system, as well as the predicted forecasts generated by the proposed Bayesian framework, as well as the deterministic SINDy approach. We observe that both methods are able to correctly recover the underlying dynamics and produce sensible forecasts of future states, while our proposed approach can also yield well-calibrated uncertainty estimates. The latter are extracted from the inferred posterior distribution , pair-density plots of which are depicted in figure for the active identified parameters in 2(b). The identified parameters for both methods are summarized in table 1, and show accurate agreement with exact model form while SINDy’s parameters are summarized in table 2.

Figure 1: Two-dimensional damped oscillator with high-resolution training data: (a) Learned dynamics versus the true dynamics and the training data for . (b) SINDy’s prediction for versus the true dynamics and the training data. (c) Learned dynamics versus the true dynamics and the training data for . (d) SINDy’s prediction for versus the true dynamics and the training data.
Figure 2: Two-dimensional damped oscillator high-resolution training data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the inferred active model parameters.
0.00034 0.00022 -0.00045 -0.000056 -0.000018 -0.00063 -0.10 0.00056 0.00025 2.00
0.00038 -0.0027 -0.0000023 -0.00016 -0.00087 -0.000025 -2.010 -0.0016 -0.00095 -0.010
Table 1: Dictionary learning for a two-dimensional nonlinear oscillator: MAP estimation of the inferred model parameters using high-resolution training data ().
0 0 0 0 0 0 -0.10 0 0 2.0
0 0 0 0 0 0 -2.0 0 0 -0.10
Table 2: Dictionary learning for a two-dimensional nonlinear oscillator: Point estimates for the dictionary coefficients obtained by the SINDy algorithm [brunton2016discovering] using high-resolution training data ().

Although both methods give sufficiently accurate results for the high-resolution training data case, a discrepancy arises when considering low-resolution training data, generated using a time-step of . As seen in figures 3 and 4, the SINDy algorithm fails when the training data is sparse, while the proposed approach remains robust as the Bayesian formulation outlined in equation 7 enables a faithful statistical characterization of the latent system dynamics even under sparse observations. Indeed, the proposed method does not only have the capability of accurately identifying the parameters even for relatively large (see table 3), but also provides reasonable uncertainty estimates for the extrapolated long-term forecasts. In contrast, the SINDy algorithm gives inaccurate estimations for the dictionary parameters (see table 4), consequently leading to large errors in the forecasted system states. As shown in figures 3(b),(c), the estimated trajectories clearly deviate from the exact solution since its oscillatory frequency is considerably higher than the exact one, and it is not capturing the decay of the oscillations’ peak. Moreover, SINDy’s results for the large case highly depend on the choice of the sequential least-squares threshold parameter value, and slightly different values of this hyper-parameter give significantly different results.

Figure 3: Two-dimensional damped oscillator with low-resolution training data: (a) Learned dynamics versus the true dynamics and the training data for . (b) SINDy’s prediction for versus the true dynamics and the training data. (c) Learned dynamics versus the true dynamics and the training data for . (d) SINDy’s prediction for versus the true dynamics and the training data.
Figure 4: Two-dimensional damped oscillator with low-resolution training data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the inferred active model parameters.
-0.00523 0.00569 -0.0095 0.0094 0.0116 0.0057 -0.0093 -0.054
0.0048 0.0195 -0.0072 -0.0214 0.0024 -0.0075 0.0279 -0.0145
Table 3: Dictionary learning for a two-dimensional nonlinear oscillator: MAP estimation of the inferred model parameters using low-resolution training data ().
0 0 0 0 0 0 0 0 0 2.10
0 0 0 0 0 0 -1.87 0 0 0
Table 4: Dictionary learning for a two-dimensional nonlinear oscillator: Point estimates for the dictionary coefficients obtained by the SINDy algorithm [brunton2016discovering] using low-resolution training data().

3.1.2 Effects of noise in the training data

Here we aim to investigate the sensitivity of the proposed methods with respect to the presence of noise in the training data. To this end, we generate a training data-set with equi-spaced data-pairs in by simulating the exact dynamics of equation 20, and deliberately corrupt the observations with uncorrelated Gaussian noise of the form . The dictionary parameters inferred using the proposed Bayesian differential programming approach are summarized in table 5. In figure 5, we show the predicted trajectories of our model and the predicted results from SINDy using the noisy data on regular grid with . The noisy training data and the posterior distribution of the unknown parameters are given in figure 6

0.0042 -0.0059 -0.0045 -0.0072 -0.0078 -0.0071 -0.10 0.047 0.014 2.0
-0.0029 -0.010 -0.0073 0.0017 0.017 0.0064 -2.0 0.015 -0.049
Table 5: Dictionary learning for a two-dimensional nonlinear oscillator: MAP estimation of the inferred model parameters using noisy training data.

showcasing good agreement with the exact active parameter values that corresponding to the entries highlighted in red. This result illustrates the robust performance of our framework in identifying interpretable and parsimonious system representations, even in the presence of noise in the training data. On the other hand, the SINDy algorithm yields the following identified dictionary in table 6.

Figure 5: Two-dimensional damped oscillator with noisy training data: (a) Learned dynamics versus the true dynamics and the training data for . (b) SINDy’s prediction for versus the true dynamics and the training data. (c) Learned dynamics versus the true dynamics and the training data for . (d) SINDy’s prediction for versus the true dynamics and the training data.
Figure 6: Two-dimensional damped oscillator with noisy training data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the inferred active model parameters.
0 0 0 0 0 0 -0.139 0 0 2.0
0 0 0 0 0 0 -1.97 0 -0.188 0
Table 6: Dictionary learning for a two-dimensional nonlinear oscillator: estimation of the SINDy’s parameters using noisy training data.

We can see although the predicted trajectory of SINDy is quite close to the true trajectory, the identified parameters are quite different from the true model form.Even though SINDy employs a total variation diminishing (TVD) regularization to safe-guard against small noise corruptions in the training data [brunton2016discovering], it is still prone to providing inaccurate results in cases where the training data is imperfect. This limitation is addressed in the proposed framework by explicitly accounting for the effects of noise in the training data using the hierarchical Bayesian model in equation 6. Here, we must note that this source of uncertainty is also effectively propagated through the system’s dynamics to yield a sensible characterization of uncertainty in the predicted forecasts, as discussed in section 2.4.

3.2 Parameter inference in a predator-prey system with irregularly sampled observations

This case study is designed to illustrate the capability of the proposed framework to accommodate noisy and irregularly sampled time-series data; a common practical setting that cannot be effectively handled by SINDy and other popular data-driven systems identification methods [brunton2016discovering, rudy2017data, lusch2018deep, raissi2018multistep, qin2019data]. To this end, let us consider a classical prey-predator system described by the Lotka–Volterra equations

(22)

which is known to exhibit a stable limit cycle behavior for , and . Without loss of generality, here we will assume a dictionary that precisely contains the active terms of the system, however, our training data-pairs will be irregularly sampled in the interval by randomly sub-sampling an exact trajectory of system 22 starting from an initial condition set to . Given this irregular training data, our goal is to demonstrate the performance of the proposed Bayesian framework in identifying the unknown model parameters with quantified uncertainty, as well as in producing sensible forecasts of extrapolated future states. Specifically, two different experiments are considered: (i) the model is trained on noise free data and (ii) the training is carried out using data perturbed by white noise.

The numerical results obtained for the noise-free data case are given in figures 7. The corresponding training data and the posterior distribution of the parameters can be found in figure 8. Similarly, the numerical results obtained for the noisy data case are shown in figures 9 and 10. In both cases, it is evident that the proposed Bayesian differential programming approach is (i) able to provide an accurate estimation for the unknown model parameters, (ii) yield a MAP estimator with a predicted trajectory that closely matches the exact system’s dynamics, (iii) return a posterior distribution over plausible models that captures both the epistemic uncertainty of the assumed parametrization and the uncertainty induced by training on a finite amount of noisy training data, and (iv) propagate this uncertainty through the system’s dynamics to characterize variability in the predicted future states.

Another interesting observation here is that the Hamiltonian Monte Carlo sampler is very efficient in identifying the importance/sensitivity of each inferred parameter in the model. For instance, less important parameters have the highest uncertainty, as observed in the posterior density plots shown in figures 8(b) and 10(b). Specifically, notice how the posterior distribution of and has a considerably larger standard deviation than the other parameters, implying that the evolution of this dynamical system is less sensitive with respect to these parameters.

Figure 7: Parameter inference in a predator-prey system from irregularly sampled, noise-free data: (a) Learned dynamics versus the true dynamics and the training data of . (b) Learned dynamics versus the true dynamics and the training data of .
Figure 8: Parameter inference in a predator-prey system from irregularly sampled, noise-free data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the inferred model parameters.
Figure 9: Parameter inference in a predator-prey system from irregularly sampled, noisy data: (a) Learned dynamics versus the true dynamics and the training data of . (b) Learned dynamics versus the true dynamics and the training data of .
Figure 10: Parameter inference in a predator-prey system from irregularly sampled, noisy data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the inferred model parameters.

3.3 Safe-guarding against model inadequacy: a damped pendulum case study

In this example we aim to demonstrate the effects of a misspecified model parametrization, and highlight how the proposed Bayesian framework can help us detect such problematic cases and safe-guard against them. Such cases may arise when domain knowledge is insufficient to guide the selection of a parsimonious model form, as well as when important interaction terms may be missing from a dictionary representation. To illustrate the main ideas, let us consider a simple damped pendulum system described by

(23)

where here we choose , and . A set of sparse and irregularly sampled training data-pairs can be generated by randomly sub-sampling a simulated trajectory of the exact dynamics in , starting from an initial condition . This imperfect data-set can be the used to recover an interpretable model representation via dictionary learning, albeit here we deliberately choose to use an incomplete dictionary containing polynomial terms only up to 1st order, i.e.,

(24)

where the unkown model parameters are , with the active terms being marked with red color for clarity. Moreover, a blue marker is used to highlight a mismatched term in our incomplete dictionary, hinting its erroneous capacity to approximate the true term using just as a feature. It is evident that such a dictionary choice cannot faithfully capture the exact physics of the problem, and is hence destined to yield inaccurate predictions. Nevertheless, here we argue that this discrepancy between the true form and the assumed incomplete parametrization of the dynamical system can be effectively detected by our Bayesian workflow via inspecting the inferred posterior distribution over all parameters , which is expected to exhibit high entropy in presence of a misspecified model parametrization and imperfect training data.

The results of this experiment are summarized in figures 11 and 12. In particular, figure 11 shows the scattered training data, the exact simulated trajectory, the predicted trajectory corresponding to the MAP estimate of the inferred model parameters, as well as a two standard deviations band around a set of plausible forecasted states. As expected, the use of a misspecified dictionary leads to an inferred model that has difficulty in fitting the observed data and yields predictions with high variability. This model inadequacy is also clearly captured in the posterior distribution over the inferred parameters shown in figure 12. There we can see that the inferred density for the

parameter exhibits very high variance, as the

coefficient crucially corresponds to the misspecified term in our dictionary. This is a direct indication that the inferred model is inadequate to capture the observed reality, leading to forecasts with inevitably large error-bars. This innate ability of our framework to detect model misspecification via a rigorous quantification of predictive uncertainty can prove crucial in risk-sensitive applications, and provides an important capability that is currently missing from the recently active literature on data-driven model discovery [brunton2016discovering, rudy2017data, champion2019data, qin2019data, rackauckas2020universal, raissi2018multistep].

Figure 11: Damped pendulum with noisy data: (a) Learned dynamics versus the true dynamics and the training data of . (b) Learned dynamics versus the true dynamics and the training data of .
Figure 12: Damped pendulum with noisy data: (a) Phase plot of the training data and the true trajectory. (b) Posterior distribution of the model’s parameters.

A natural question to now ask is: can we still recover an accurate predictive model even if the true dynamics can only be partially captured by our chosen dictionary representation? To tackle this question we turn our attention to the hybrid learning setting discussed in section 2.3, in which some parts of our model can be captured by sparsely selecting interpretable terms from a dictionary, while other missing parts or closure terms can be accounted for via a black-box function approximator. To this end, let us revisit our damped pendulum case study, and endow our dictionary learning parametrization with the ability to approximate the missing term via a deep neural network as

(25)

where is a matrix of unknown coefficients corresponding to a dictionary containing polynomial interactions up to 2nd order, and is a fully-connected deep neural network with hidden layers of dimension , a hyperbolic tangent activation, and a set of unknown weight and bias parameters denoted by . Notice that the neural network admits only as an input to make sure that the resulting parametrized dynamical system has a unique solution. Under this setup, the proposed Bayesian framework can be employed to jointly infer the dictionary and the neural network parameters that define our model representation, namely .

Figure 13 shows the scattered training data, the exact simulated trajectory, the predicted trajectory corresponding to the MAP estimate of the inferred model parameters, as well as a two standard deviations band around a set of plausible forecasted states. It is evident that the revised hybrid model formulation can now correctly identify the true underlying dynamics, leading to an accurate predicted MAP trajectory, while the predicted uncertainty effectively diminishes and concentrates around the ground truth. Moreover, the inferred coefficients of the irrelevant terms are very close to zero, while the active terms are all properly identified. This is also true for the neural network approximation of the missing closure term , as depicted in figure 14 which also includes uncertainty estimates over the neural network outputs. This simple example illustrates the great flexibility of the proposed Bayesian framework in seamlessly distilling parsimonious and interpretable models from imperfect data via physics-informed dictionary learning, as well harnessing the power of black-box representations for approximating missing closure terms.

Figure 13: Hybrid learning: on damped pendulum: (a) Learned dynamics versus the true dynamics and the training data of . (b) Posterior distribution of the model’s parameters.
Figure 14: Hybrid learning learned forcing terms: (a) Unknown term learnt by black-box function versus the true term where the blue shaded area denotes the range we have training data. (b) Learned dynamics versus the true dynamics and the training data of .

3.4 Learning and forecasting chaotic dynamics with quantified uncertainty

This example aims to illustrate the performance of the proposed framework in a more challenging setting involving dictionary learning with noisy and irregularly sampled observation of a chaotic time-series, leading to a 30-dimensional inference problem. To this end, we consider the classical Lorenz system [lorenz1963deterministic]

(26)

where , and are chosen to induce a chaotic response [lorenz1963deterministic]. Using an initial condition of , we simulate the exact dynamics to generate a training-data set consisting of irregularly sampled observations in the time interval . The training data-pairs are also corrupted by isotropic uncorrelated Gaussian noise of the form (see figure 15(a)). Using this observed time-series, our goal is to employ the proposed Bayesian framework to recover an interpretable and parsimonious representation for the chaotic Lorenz dynamics by sparsely selecting terms from a dictionary containing polynomial interactions up to second order, i.e.,

(27)

where ’s are the scalar coefficients to be inferred. The non-zero parameters contributing to the exact Lorenz dynamics are highlighted with red for clarity. In this case, the dictionary has parameters. Since the scale of this problem’s variables is large, we will normalize the observed data using a uniform strategy to consistently preserve the attractor’s geometry. Specifically, instead of using equation 18, the training data is normalized as follows:

(28)

such that is taken equal to the maximum standard deviation of the training data across the different dimensions.

The results of this experiment are summarized in table 7 and figures 16 and 15. Specifically, table 7 presents the computed MAP estimators for the dictionary coefficients , highlighting the ability of the proposed algorithms to correctly recover the active terms that define a parsimonious representation of the Lorenz system. In this case, due to the chaotic nature of the dynamics, any small imperfections in the inferred dynamics can lead to predicted trajectories that deviate from the ground truth. However, the predicted uncertainty estimates are expected to capture the variability induced by the Lorenz attractor. Indeed, as shown in figure 16, we observe small uncertainty estimates in the regions where the predicted MAP trajectory closely follows the ground truth solution, while the prediction of future extrapolated states exhibits high variance that faithfully captures the evolution of different trajectories in both branches of the butterfly-shaped attractor. This robust characterization of predictive uncertainty is one of the key attributes of the proposed Bayesian approach that enables the computation of the joint posterior distribution over all model parameters, as shown in figure 15(b) for the active dictionary coefficients.

Figure 15: Learning chaotic Lorenz dynamics: (a) Phase plot of the training data and the true trajectory. (b) Pair density plots of the inferred posterior distribution for the active model parameters.
Figure 16: Learning chaotic Lorenz dynamics: Ground truth solution versus the predicted MAP trajectory and its associated uncertainty estimate. Panels (a), (b) and (c) correspond to each dimension of the state variable .
0.0088 -11.37 10.69 -0.0719 0.0053 -0.0024 0.0115 0.0363 0.0025 -0.0149
0.0550 28.35 -1.030 -0.0214 -0.0162 0.0072 0.0044 -1.010 0.0005 -0.0041
-0.0585 -0.1527 0.1411 -2.555 0.0377 0.9680 0.0165 0.0041 -0.0063 -0.0052
Table 7: Dictionary learning for Lorenz system: MAP estimation of the inferred model parameters using noisy training data.

3.5 Bayesian calibration of a Yeast Glycolysis model

In this final example our goal is to investigate the performance of the proposed algorithms in a realistic problem in systems biology. To this end, we consider a yeast glycolysis process which can be described by a 7-dimensional dynamical system [ruoff2003temperature, yazdani2019systems] of the form

(29)

where the terms and on the right hand side are defined as

(30)

Here we will assume that this model form is known from existing domain knowledge, and our goal is to (i) compute the posterior distribution over the unknown model parameters from a small set of noisy and irregularly sampled observations, and (ii) test the ability of the inferred model to accurately generalize from different initial conditions that were not observed during model training.

A training data-set is generated by randomly sub-sampling irregularly distributed observations from a single simulated trajectory of the system from an initial condition: in the time interval , assuming a ground truth set of parameters obtained from the experimental data provided in [ruoff2003temperature]: mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, , mM, nM, mM and . Again, we will consider two cases corresponding to noise-free and noisy training data, perturbed with a noise proportional to its standard deviation for the latter case.

Table 8 summarizes the inferred MAP estimators for the unknown model parameters. We observe that, in both the noise-free and the noisy cases, all inferred parameters closely agree with the ground truth values used to generate the training data as reported in [ruoff2003temperature]. Uncertainty estimates for the inferred parameters can also be deducted from the computed posterior distribution , as presented in the box plots of figure 17

where the minimum, maximum, median, first quantile and third quantile obtained from the HMC simulations for each parameter are presented. Finally, notice that all true values fall between the predicted quantiles, while the MAP estimators of all the parameters have considerably small relative errors compared with the true parameter values, as shown in table

8.

Noise free 2.53 101.88 5.95 16.25 101.57 1.30 12.19 1.83 13.23 4.00 0.52 0.01 1.02 4.00
Noisy 2.53 100.04 5.67 16.14 100.50 1.27 12.25 1.82 13.25 3.99 0.52 0.01 1.06 4.00
Table 8: Yeast Glycolysis dynamics: MAP estimation of the inferred model parameters using noise-free and noisy training data, respectively.
Figure 17: Yeast Glycolysis dynamics: Uncertainty estimation of the inferred model parameters obtained using the proposed Bayesian differential programming method. Estimates for the minimum, maximum, median, first quantile and third quantile are provided, while the true parameter values are highlighted in blue.

The inferred model also allows us to produce forecasts of future states with quantified uncertainty via the predictive posterior distribution of equation 12. These extrapolated states are shown in figures 18 and 19, for noise free and noisy data case, respectively. It is evident that the uncertainty estimation obtained via sampling from the joint posterior distribution over all model parameters is able to well capture the reference trajectory of the high dimensional dynamical system. As expected, the uncertainty is larger when considering noisy data compared to the noise free case. This can be explained by the higher precision obtained with the noise-free data, which is reflected on the uncertainty obtained with the samples from the predictive distribution.

Figure 18: Learning Yeast Glycolysis dynamics from noise-free data: (a) Learned dynamics versus the true dynamics and the training data of . (b) Learned dynamics versus the true dynamics and the training data of . (c) Learned dynamics versus the true dynamics and the training data of . (d) Learned dynamics versus the true dynamics and the training data of . (e) Learned dynamics versus the true dynamics and the training data of . (f) Learned dynamics versus the true dynamics and the training data of . (f) Learned dynamics versus the true dynamics and the training data of .
Figure 19: Learning Yeast Glycolysis dynamics from noisy data: (a) Learned dynamics versus the true dynamics and the training data of . (b) Learned dynamics versus the true dynamics and the training data of . (c) Learned dynamics versus the true dynamics and the training data of . (d) Learned dynamics versus the true dynamics and the training data of . (e) Learned dynamics versus the true dynamics and the training data of . (f) Learned dynamics versus the true dynamics and the training data of . (f) Learned dynamics versus the true dynamics and the training data of .

Finally, to illustrate the generalization capability of the inferred model with respect to different initial conditions than those used during training, we have assessed the quality of the predicted states starting from a random initial condition that is uniformly sampled within (). Figure 20 shows the prediction obtained with the randomly picked initial condition . The close agreement with the reference solution indicates that the inferred model is well capable of generalizing both in terms of handling different initial conditions, as well as extrapolating to reliable future forecasts with quantified uncertainty.

Figure 20: Learning Yeast Glycolysis dynamics from noisy data:: Future forecasts with quantified uncertainty from a previously unseen initial condition (i.e. an initial condition that was not used during model training).

4 Conclusions

We have presented a novel machine learning framework for robust systems identification under uncertainty. The proposed framework leverages state-of-the-art differential programming techniques in combination with gradient-enhanced sampling schemes for scalable Bayesian inference of high-dimensional posterior distributions, to infer interpretable and parsimonious representations of complex dynamical systems from imperfect (e.g. sparse, noisy, irregularly sampled) data. The developed methods are general as they can seamlessly combine dictionary learning, domain knowledge and black-box approximations, all in a computationally efficient workflow with end-to-end uncertainty quantification. The effectiveness of the proposed techniques has been systematically investigated and compared to state-of-the-art approaches across a collection of prototype problems.

Although the proposed Bayesian differential programming framework provides great flexibility to infer a distribution over plausible parsimonious representations of a dynamical system, a number of technical issues need to be further investigated. The first relates to devising more effective initialization procedures for Markov Chain Monte Carlo sampling. Here we have partially addressed this via the MAP preconditioning algorithm discussed in section 2.5, however during the preconditioning process, since the form of the dynamical system is unknown, the intermediate estimations of the parameters may cause the system to become stiff. Moreover, for cases where only very sparse observations are available, the model needs to be integrated with a large time-step and stiffness of the system can lead to numerical instabilities during model training. A possible enhancement in this direction is to use more general stiffly stable ODE solvers as discussed in [rackauckas2020universal, gholami2019anode] or more sophisticated time-step annealing strategies [wang2020understanding]. Another potential future work could be identifying the uncertainty in model’s parameters with partial observations, such that some variables of the system are not accessible. Such task would involve physics-informed regularization on the unknown latent dynamics of the system. Approaches used in [yazdani2019systems] could be helpful for solving this problem. A third open question is how to adapt the proposed method to stochastic dynamical systems where the dynamics itself may be driving by a stochastic process. Approaches mentioned in [li2020scalable]

could be useful. Finally, the proposed Bayesian differential programming framework can be extended to parameter identification for partial differential equations (PDEs). The latter generally translates into a high dimensional dynamical systems after discretization. The learning task in this context could be carried out not only for the PDEs’ parameters, but also for the discretization scheme

[rackauckas2020universal].

Acknowledgements

This work received support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DE-SC0019116), the Defense Advanced Research Projects Agency under the Physics of Artificial Intelligence program (grant HR00111890034), and the Air Force Office of Scientific Research (grant XXXX).

References