1 Introduction
In the era of big data, dynamical systems discovery has received a lot of attention, primarily due to the significant growth of accessible data across different scientific disciples, including systems biology [ruoff2003temperature], biomedical imaging [kak2002principles], fluid dynamics [raissi2020hidden], climate modeling [palmer1999nonlinear], physical chemistry [feinberg1974dynamics], etc. Extracting a set of interpretable underlying features that describe the evolution of a dynamical system is crucial for developing physical insight and gaining a better understand the natural phenomena under study [haller2002lagrangian]. Perhaps more importantly, this can enable the reliable forecasting of future states and subsequently lead to effective intervention strategies for design and control of complex systems [tantet2018crisis, bemporad1999control, lu2019nonparametric].
Machine learning methods and datadriven modeling techniques have already proven their utility in solving highdimensional problems in computer vision
[krizhevsky2012imagenet][bahdanau2014neural], etc. Due to their capability of extracting features from high dimensional and multifidelity noisy data [yang2019conditional, han2018solving], these methods are also gaining traction in modeling and simulating physical and biological systems. The evolution of such systems can be typically characterized by differential equations, and several techniques have been developed to construct predictive algorithms that can synergistically combine data and mechanistic prior knowledge. Such scientific machine learning approaches are currently employed to distill dynamics from timeseries data [rackauckas2020universal, gholami2019anode, chen2018neural, brunton2016discovering, rudy2017data, brennan2018data, raissi2018multistep, qin2019data], infer the solution of differential equations [raissi2017inferring, raissi2019physics, zhu2019physics, yang2019adversarial, yang2020physics, wang2020understanding], infer parameters, latent variables and unknown constitutive laws [wang2019non, raissi2020hidden, raissi2018hidden, tartakovsky2018learning], as well as tackle forward and inverse problems in complex application domains including cardiovascular flow dynamics, [kissas2020machine], metamaterials [chen2019physics], cardiac electrophysiology [sahli2020physics], etc.Specific to systems identification, most recent datadriven approaches [rackauckas2020universal, brunton2016discovering, qin2019data, chen2018neural, gholami2019anode] heavily rely on the quality of the observations and are not designed to return predictions with quantified uncertainty. For instance, the sparse regression methods put forth in [brunton2016discovering, rudy2017data] can exhibit unstable behavior if the data is highly noisy and are not able to directly digest timeseries data with irregular sampling frequency or missing values. On the other hand, recent approaches leveraging differentiable programming [rackauckas2020universal, chen2018neural, gholami2019anode]
can support irregularly sampled data, but are only designed to provide pointestimates for the discovered dynamics with no characterization of predictive uncertainty. Such lack of robustness and missing capabilities may limit the use of existing techniques to idealized settings and pose the need for a more flexible framework that can effectively accommodate noisy, sparse and irregularly sampled data to infer posterior distributions over plausible models, and subsequently yield robust future forecasts with quantified uncertainty.
In this work, we aim to address the aforementioned capability gaps by formulating a fully Bayesian framework for robust systems identification from imperfect timeseries data. Our specific contributions can be summarized in the following points:

We leverage recent developments in differentiable programming [rackauckas2020universal, chen2018neural, gholami2019anode] to enable gradient backpropagation through numerical ODE solvers, and utilize this information to construct accelerated Hamiltonian Monte Carlo schemes for Bayesian inference.

The proposed workflow is computationally efficient, endtoend differentiable, and can directly accommodate sparse, noisy and irregularly sampled timeseries data.

Equipped with sparsitypromoting priors, we can recover interpretable and parsimonious representations for the latent dynamics and, unlike existing approaches that are only designed to produce point estimates, entire posterior distributions over plausible models are obtained.

This probabilistic formulation is key for safeguarding against erroneous data, incomplete model parametrizations, as well as for producing reliable future forecasts with quantified uncertainty.

We demonstrate enhanced capabilities and robustness against stateoftheart methods for systems identification [brunton2016discovering, rudy2017data] across a range of benchmark problems.
Taken all together, our findings put forth a novel, flexible and robust workflow for datadriven model discovery under uncertainty that can potentially lead to improved algorithms for robust forecasting, control and modelbased reinforcement learning of complex systems.
The rest of this paper is organized as follows. Section 2 presents the proposed method and its corresponding technical ingredients. In section 2.2, a review of Bayesian inference and Hamiltonian Monte Carlo sampling is given. In section 2.1, we provide details on how the proposed method can effectively propagate gradient information through ODE solvers leveraging differential programming. In section 2.5, we discuss a simple but essential step for preprocessing the observed data in order to obtain robust training behavior. In section 3 the effectiveness of the proposed method is tested on a series of numerical studies, including nonlinear oscillators, predatorprey systems, chaotic dynamics, and a realistic example in systems biology. Finally, in section 4 we summarize our key findings, discuss the limitations of the proposed approach, and carve out directions for future investigation.
2 Methods
This section provides a comprehensive overview of the key ingredients that define our work, namely differential programming via Neural ordinary differential equations (NeuralODEs) [chen2018neural] and Bayesian inference with Hamiltonian Monte Carlo sampling [neal2011mcmc]. Our presentation is focused on describing how these techniques are be interfaced to obtain a novel, efficient, and robust workflow for parsimonious model discovery from imperfect timeseries data.
2.1 Differentiable programming
In their original work, Chen et. al. [chen2018neural] introduced a general framework for propagating gradient information through classical numerical solvers for ordinary differential equations (ODEs) that blends classical adjoint methods [pontryagin2018mathematical] with modern developments in automatic differentiation [van2018automatic]. To illustrate the main concepts, consider a general dynamical system of the form
(1) 
where denotes the state space of the dimensional dynamical system, and
is a vector unknown parameters that parametrizes the latent dynamics
. A systems identification task is now summarized as follows. Given some observations evaluated at time instances , , one would like to learn thethat best parametrizes the underlying dynamics. A typical approach for identifying these optimal parameters is to define a loss function
that measures the discrepancy between the observed data and the model’s predictions for a given , i.e.,(2) 
where is the predicted value under a given set of estimated model parameters obtained by integrating the dynamical system with some ODE solver. could be any metric to evaluate the distance / evaluating the discrepancy between the true value and the predicted one (e.g., loss, loss [hastie2009elements] or KLdivergence [ghahramani2015probabilistic], Wasserstein distance [gulrajani2017improved], etc). A sufficient way to minimize the loss function is through gradient descent [su2014differential, bottou2010large], however appropriate methods need to be employed for effectively computing the gradient of the loss function with respect to the parameters, namely . This is done by defining the adjoint of the dynamical system as . Then, the dynamical system describing the evolution of the adjoint can be derived as [pontryagin2018mathematical, chen2018neural]
(3) 
Note that the ajoint can be computed by an additional call to the chosen ODE solver, and the target derivative can be then computed as
(4) 
where can be evaluated via automatic differentiation [van2018automatic, chen2018neural].
The main advantages of this approach can be summarized in the following points:

The data does not need to be collected on a regular time grid.

The timestep between an observed data pair can be relatively large. Within each , a classical numerical scheme can be used to integrate equation 4, where is discretized in equal spaced steps, with the step size being typically chosen according to the stability properties of the underlying ODE solver.

As this setup only assumes dependency between individual inputoutput pairs , the observed timeseries data does not need to be continuous and could be selected from different time intervals.
The specific choices of , and will be discussed for each of the different examples given in this work. In general, the choice of is made based on the following tradeoff between accuracy and computational complexity. To this end, small may lead to a less accurate model, while large would lead to a massive computational graph that can significantly slow down model training.
Throughout this work, the following loss function is employed
(5)  
where denotes the output of a numerical ODE solver. Throughout this work, we use the classical fourth order RungeKutta [iserles2009first], although more general choices can be employed [rackauckas2020universal]. The training dataset consists of pairs . To simplify notation, let be defined as , such that , then the training dataset is given by: .
Notice that the aforementioned workflow is only capable of producing deterministic point estimates for the unknown parameters , typically corresponding to a local minimum of equation 5. In many practical cases it is desirable to obtain a distribution over plausible parameter configurations that can effectively characterize the uncertainty in the estimates due to noise and sparsity in the observed data, as well as potential misspecifications in the model parametrization. A framework for accounting for such uncertainties can be constructed by adopting a Bayesian approach via the use of effective sampling algorithms for approximating a posterior distribution over the unknown model parameters , as discussed in the next section.
2.2 Bayesian inference with Hamiltonian Monte Carlo
The Bayesian formalism provides a natural way to account for uncertainty, while also enabling the injection of prior information for the unknown model parameters (e.g., sparsity), as well as for modeling sparse and noisy observations in the training dataset. Perhaps more importantly, it enables the complete statistical characterization for all inferred parameters in our model. The latter, is encapsulated in the posterior distribution which can be factorized as
(6) 
where is a likelihood function that measures the discrepancy between the observed data and the model’s predictions for a given set of parameters , is a prior distribution parametrized by a set of parameters that can help us encode any domain knowledge about the unknown model parameters , and contains a set of parameters that aim to characterize the noise process that may be corrupting the observations. In this work we employ a hierarchical Bayesian approach corresponding to the following likelihood and priors
(7)  
The use of a Gaussian likelihood stems from assuming a simple isotropic Gaussian noise model with zero mean and precision . The Laplace prior over the unknown model parameters [williams1995bayesian]
can promote sparsity in the inferred model representations and enable us to effectively neglect the influence of any irrelevant parameters. The Gamma distribution is a common choice for the prior distributions over the unknown precision parameters
and . For additional motivation and alternatives choices, the interested reader is referred to [geweke1993bayesian, gelman2013bayesian, winkler1967assessment, bernardo1979reference, berger1990robust]. Finally, the logarithm of the precision variables and is used to ensure that their estimated values remain positive during model training.The posterior distribution defined in equation 7 is not analytically tractable in general due to our modeling assumptions on the likelihood and priors, as well as due to the presence of nonlinearity in the latent dynamics of equation 1. Typically, sampling from this unnormalized distribution is difficult and computationally expensive, especially when the dimension of is large. Hamiltonian Monte Carlo (HMC) [neal2011mcmc] is a powerful tool to handle Bayesian inference tasks in high dimensions by utilizing gradient information to effectively generate approximate posterior samples. To illustrate the key ideas behind HMC sampling, let us denote as the vector including all the unknown parameters that need to be inferred from data. The starting point for building an HMC sampler is to define a Hamiltonian function
(8) 
where is the potential energy of the original system usually taken as the logarithm of the unnormalized distribution in equation 7, and is the kinetic energy of the system introduced by the auxiliary velocity variables . The evolution of and can be expressed by taking gradients of the Hamiltonian as
(9)  
A Markov chain can be simulated by integrating this dynamical system using an energy preserving leapfrog numerical scheme
[neal2011mcmc], with referring to the step size of the HMC and indicating the number of leapfrog steps taken at each HMC iteration. The choice of is problem dependent and is always chosen from . The leap frog step is often chose as . Then the update rule of the parameters and is then given by(10)  
Note here that computing the gradient is difficult and expensive as the likelihood function is computed by the predictions of an ODE solver. In this case, the differentiable programming framework discussed in section 2.1 is used to address this issue by effectively enabling gradient backpropagation through the ODE solver. Note that all parameters in can be either updated simultaneously, or separately for and using a MetropoliswithinGibbs scheme [gilks1995adaptive, millar2000non].
2.3 Learning dynamics with Bayesian differential programming
Here we distinguish between three different problem settings that cover a broad range of practical applications. The first class consists of problems in which the model form of the underlying latent dynamics is completely unknown. In this case, one can parametrize the unknown dynamical system using blackbox function approximators such as deep neural networks
[chen2018neural, gholami2019anode, raissi2018multistep], or aim to distill a more parsimonious and interpretable representation by constructing a comprehensive dictionary over all possible interactions and try to infer a predictive, yet minimal model form [brunton2016discovering, rudy2017data, qin2019data]. The second class of problems contains cases where a model form for the underlying dynamics is prescribed by domain knowledge, but a number of unknown parameters needs to be calibrated in order to accurately explain the observed data [tartakovsky2018learning, yazdani2019systems]. Finally, a third class of problems arises as a hybrid of the aforementioned settings, in which some parts of the model form may be known from domain knowledge, but there exists additional functional terms that need to be inferred [rackauckas2020universal]. As we will see in the following, the proposed workflow can seamlessly accommodate all of the aforementioned cases in a unified fashion, while remaining robust with respect to incomplete model parametrizations, as well as imperfections in the observed data. To this end, a general framework can be constructed by parametrizing the unknown dynamics as(11) 
where represents a matrix of unknown coefficients, with being the length of a dictionary , that may or may not be constructed using domain knowledge. Specifically, represents the possible terms that may appear in the right hand side of the ordinary differential equation, which could encapsulate a known model form or, more generally, a prescribed dictionary of features (e.g., polynomials, Fourier modes, etc., and combinations thereof) [brunton2016discovering, rudy2017data, champion2019data]. On the other hand, denotes a blackbox function approximator (e.g., a neural network) parametrized by that aims to account for any missing interactions that are not explicitly captured by [rackauckas2020universal].
Here we must note that domain knowledge can play a crucial role in enhancing the efficiency of the resulting inference scheme as it can effectively reduce the size of the dictionary, and, consequently, the number of data required to train the model [zhu2019physics, raissi2019physics]. Such knowledge is also critical to constrain the space of admissible solutions such that key physical principles are faithfully captured by the inferred model (e.g., convergence to equilibrium limit cycles in chemical systems [schnakenberg1979simple], conservation of mass and momentum in fluid dynamics [raissi2018hidden], etc).
Although the use of dictionary learning (with or without specific domain knowledge) offers a flexible paradigm for recovering interpretable dynamic representations, it may not always be sufficient to explain the observed data, as important terms may be missing from the model parametrization. To address this shortcoming one can try to capture these missing interactions via the use of closure terms that often lack physical intuition, and hence can be represented by a blackbox function approximator with parameters , such as a deep neural network [rackauckas2020universal].
Under this setup, our goal is to employ the algorithmic framework outlined in sections 2.1 and 2.2 to perform probabilistic inference over plausible sets of model parameters that yield interpretable, parsimonious, and predictive representations, as well as the precision parameters and of the Bayesian hierarchical model in equation 7.
2.4 Generating forecasts with quantified uncertainty
The goal of the HMC algorithm described in section 2.2
is to produce a faithful set of samples that concentrate in regions of highprobability in the posterior distribution
. Approximating this distribution is central to our workflow as it enables the generation of future forecasts with quantified uncertainty via computing the predictive posterior distribution(12) 
This predictive distribution provides a complete statistical characterization for the forecasted states by encapsulating epistemic uncertainty in the inferred dynamics, as well as accounting for the fact that the model was trained on a finite set of noisy observations. This allows us to generate plausible realizations of by sampling from the predictive posterior distribution as
(13) 
where, and are approximate samples from computed during model training via HMC sampling, denotes any numerical integrator that takes some initial condition and predicts the system’s state at any time , and accounts for the noise corrupting the observations used during model training. Moreover, the maximum aposteriori (MAP) estimate of the model parameters is given as follows:
(14) 
and use it to obtain a point estimate prediction of the predicted states as
(15) 
Finally, it is straightforward to utilize the posterior samples of to approximate the first and secondorder statistics of the predicted states for any given initial condition as
(16)  
(17) 
where denotes the number of samples drawn from the Hamiltonian Markov Chain used to simulate the posterior, i.e., ,
. Note that higherorder moments are also readily computable in a similar manner.
2.5 Model initialization and data preprocessing
To promote robustness and stability in the training of the proposed machine learning pipeline, users should be cognizant of several important aspects. First, although the proposed Bayesian approach can naturally safeguard against overfitting, it is important that a reasonable amount of training data is provided – relative to the complexity of the system – in order to mitigate any effects of prior misspecification. Second, the training data should be appropriately normalized in order to prevent gradient pathologies during backpropagation [glorot2010understanding]. The specific utility of this step will be demonstrated in the numerical examples presented in this work, using a standard normalization of the form
(18) 
where
is the dimensionwise standard deviation of the training data and the division is an elementwise operation. Notice that this modification directly implies that the assumed parametrization of the underlying dynamical system also needs to be normalized accordingly (see section
3 for a more detailed discussion). A third important remark here, is that the noise precision obtained from the model aims to reflect the noise level in the observed data. However, it may not be the true noise level because the initial condition of the ODE can also be noisy. This point will be further discussed in section 3.Another important point relates to the initialization of the Hamiltonian Monte Carlo Markov Chain sampler. To this end, in order to mitigate poor mixing and convergence to local minima, a preconditioning step is considered to find a reasonable initial guess for the unknown variables that parametrize the underlying latent dynamics. This step is typically carried out by minimizing the reconstruction loss of the training data using an regularization,
(19) 
using a small number (
) of stochastic gradient descent iterations. Notice that this essentially aims at obtaining a rough point estimate for
, where the use of regularization stems from employing a sparsitypromoting Laplace prior. This preconditioning step is closely related to the SINDy algorithm of Brunton et. al. [brunton2016discovering], however without the limitations of requiring training data with small noise amplitude and sampled on regular time grid with small a time step. Moreover, the numerical experiments carried out indicate that tuning the hyperparameter has almost no effect on the obtained results for all the problems considered in this work, as this simply serves as an initialization step for the HMC sampler. Hence, in all examples considered, we take . The minimization of equation 19 is carried out using stochastic Adam updates [kingma2014adam] with a learning rate of .Note that this preconditioning is not precisely equivalent to the MAP estimation of the posterior distribution over all model parameters , since the initialization of the precision parameters and follows a different treatment. Specifically, for all the problems considered in section 3, the parameters ’s and ’s of the prior Gamma distributions are chosen to be . Moreover, the precision of the Gaussian noise distribution is initialized as follows. If the preconditioning precision is larger than , which means the training data appears to be nearly noisefree, the initial guess for is set to to avoid numerical stagnancy of the HMC sampler. Otherwise, if the preconditioning precision is less than , then it is used as the initial guess for . This empirical initialization strategy has a positive effect in accelerating the convergence of the HMC sampler for all the examples considered in section 3. Finally, in all of our numerical examples, the Hamiltonian Monte Carlo stepsize (see equation 10) is taken as order . We would like to point out, this choice of is the safest choice, one can increase its value as long as we do not observe a lot of rejection during the training. The number of Monte Carlo realizations is and for computing the response statistics we chose (see equation 16), unless otherwise noted.
3 Results
In this section we present a comprehensive collection of numerical studies that aim to illustrate the key contributions of this work, and place them in context the existing SINDy framework of Brunton et. al. [brunton2016discovering], which is currently considered as the stateofthaart method for dictionary learning of dynamical systems. Specifically, we expand on five benchmark problems that cover all possible cases discussed in section 2.3 in terms of parametrizing the latent dynamics using a dictionary, domain knowledge, blackbox approximations, or a combination thereof. The algorithmic settings used across all cases follow the discussion provided in 2.5, unless otherwise noticed. All code and data presented in this section will be made publicly available at https://github.com/PredictiveIntelligenceLab/BayesianDifferentiableProgramming.
3.1 Dictionary learning for a twodimensional nonlinear oscillator
Let us start with a pedagogical example on dictionary learning for inferring the dynamics of a twodimensional damped oscillator from scattered timeseries data [raissi2018multistep]. The exact system dynamics are given by
(20)  
Our goal here is to recover this dynamical system directly from data using a dictionary parametrization containing polynomial terms with up to 3rd order interactions, taking the form
(21) 
where the ’s are unknown scalar coefficients that will be estimated using the proposed Bayesian differential programming method. The model’s active nonzero parameters are highlighted with red color for clarity. Our goal is to infer a posterior distribution for the parameters , while it is obvious that the coefficient matrix and dictionary will increase in size with the order of the polynomial features used to parametrize the system. In all presented experiments, we assume an initial condition of at , we generate a set of training data by simulating the exact dynamics of equation 20 in the time interval with , and .
3.1.1 Effects of sparsity in the training data
In what follows, we will investigate the performance of our algorithms with respect to the temporal resolution and the noise level in the observed data. Moreover, we will provide a comprehensive comparison with the SINDy algorithm of Brunton et. al. [brunton2016discovering].
To examine the performance of our method with respect to the temporal resolution of the observed data we distinguish between two cases: (i) the data is generated with relatively small , such that there are training data samples, and (ii) the data is generated with relatively large , such that there are only training data samples. In order to establish a comparison against the SINDy algorithm, we will also assume that the training data is noisefree and sampled on a regular temporal grid, as required by the SINDy setup [brunton2016discovering].
Figure 1 presents the results obtained for the case corresponding to highresolution training data, generated using a timestep of . Specifically, it depicts the training data together with the exact simulated trajectory of the system, as well as the predicted forecasts generated by the proposed Bayesian framework, as well as the deterministic SINDy approach. We observe that both methods are able to correctly recover the underlying dynamics and produce sensible forecasts of future states, while our proposed approach can also yield wellcalibrated uncertainty estimates. The latter are extracted from the inferred posterior distribution , pairdensity plots of which are depicted in figure for the active identified parameters in 2(b). The identified parameters for both methods are summarized in table 1, and show accurate agreement with exact model form while SINDy’s parameters are summarized in table 2.
0.00034  0.00022  0.00045  0.000056  0.000018  0.00063  0.10  0.00056  0.00025  2.00 
0.00038  0.0027  0.0000023  0.00016  0.00087  0.000025  2.010  0.0016  0.00095  0.010 
0  0  0  0  0  0  0.10  0  0  2.0 
0  0  0  0  0  0  2.0  0  0  0.10 
Although both methods give sufficiently accurate results for the highresolution training data case, a discrepancy arises when considering lowresolution training data, generated using a timestep of . As seen in figures 3 and 4, the SINDy algorithm fails when the training data is sparse, while the proposed approach remains robust as the Bayesian formulation outlined in equation 7 enables a faithful statistical characterization of the latent system dynamics even under sparse observations. Indeed, the proposed method does not only have the capability of accurately identifying the parameters even for relatively large (see table 3), but also provides reasonable uncertainty estimates for the extrapolated longterm forecasts. In contrast, the SINDy algorithm gives inaccurate estimations for the dictionary parameters (see table 4), consequently leading to large errors in the forecasted system states. As shown in figures 3(b),(c), the estimated trajectories clearly deviate from the exact solution since its oscillatory frequency is considerably higher than the exact one, and it is not capturing the decay of the oscillations’ peak. Moreover, SINDy’s results for the large case highly depend on the choice of the sequential leastsquares threshold parameter value, and slightly different values of this hyperparameter give significantly different results.
0.00523  0.00569  0.0095  0.0094  0.0116  0.0057  0.0093  0.054  
0.0048  0.0195  0.0072  0.0214  0.0024  0.0075  0.0279  0.0145 
0  0  0  0  0  0  0  0  0  2.10 
0  0  0  0  0  0  1.87  0  0  0 
3.1.2 Effects of noise in the training data
Here we aim to investigate the sensitivity of the proposed methods with respect to the presence of noise in the training data. To this end, we generate a training dataset with equispaced datapairs in by simulating the exact dynamics of equation 20, and deliberately corrupt the observations with uncorrelated Gaussian noise of the form . The dictionary parameters inferred using the proposed Bayesian differential programming approach are summarized in table 5. In figure 5, we show the predicted trajectories of our model and the predicted results from SINDy using the noisy data on regular grid with . The noisy training data and the posterior distribution of the unknown parameters are given in figure 6
0.0042  0.0059  0.0045  0.0072  0.0078  0.0071  0.10  0.047  0.014  2.0 
0.0029  0.010  0.0073  0.0017  0.017  0.0064  2.0  0.015  0.049 
showcasing good agreement with the exact active parameter values that corresponding to the entries highlighted in red. This result illustrates the robust performance of our framework in identifying interpretable and parsimonious system representations, even in the presence of noise in the training data. On the other hand, the SINDy algorithm yields the following identified dictionary in table 6.
0  0  0  0  0  0  0.139  0  0  2.0 
0  0  0  0  0  0  1.97  0  0.188  0 
We can see although the predicted trajectory of SINDy is quite close to the true trajectory, the identified parameters are quite different from the true model form.Even though SINDy employs a total variation diminishing (TVD) regularization to safeguard against small noise corruptions in the training data [brunton2016discovering], it is still prone to providing inaccurate results in cases where the training data is imperfect. This limitation is addressed in the proposed framework by explicitly accounting for the effects of noise in the training data using the hierarchical Bayesian model in equation 6. Here, we must note that this source of uncertainty is also effectively propagated through the system’s dynamics to yield a sensible characterization of uncertainty in the predicted forecasts, as discussed in section 2.4.
3.2 Parameter inference in a predatorprey system with irregularly sampled observations
This case study is designed to illustrate the capability of the proposed framework to accommodate noisy and irregularly sampled timeseries data; a common practical setting that cannot be effectively handled by SINDy and other popular datadriven systems identification methods [brunton2016discovering, rudy2017data, lusch2018deep, raissi2018multistep, qin2019data]. To this end, let us consider a classical preypredator system described by the Lotka–Volterra equations
(22)  
which is known to exhibit a stable limit cycle behavior for , and . Without loss of generality, here we will assume a dictionary that precisely contains the active terms of the system, however, our training datapairs will be irregularly sampled in the interval by randomly subsampling an exact trajectory of system 22 starting from an initial condition set to . Given this irregular training data, our goal is to demonstrate the performance of the proposed Bayesian framework in identifying the unknown model parameters with quantified uncertainty, as well as in producing sensible forecasts of extrapolated future states. Specifically, two different experiments are considered: (i) the model is trained on noise free data and (ii) the training is carried out using data perturbed by white noise.
The numerical results obtained for the noisefree data case are given in figures 7. The corresponding training data and the posterior distribution of the parameters can be found in figure 8. Similarly, the numerical results obtained for the noisy data case are shown in figures 9 and 10. In both cases, it is evident that the proposed Bayesian differential programming approach is (i) able to provide an accurate estimation for the unknown model parameters, (ii) yield a MAP estimator with a predicted trajectory that closely matches the exact system’s dynamics, (iii) return a posterior distribution over plausible models that captures both the epistemic uncertainty of the assumed parametrization and the uncertainty induced by training on a finite amount of noisy training data, and (iv) propagate this uncertainty through the system’s dynamics to characterize variability in the predicted future states.
Another interesting observation here is that the Hamiltonian Monte Carlo sampler is very efficient in identifying the importance/sensitivity of each inferred parameter in the model. For instance, less important parameters have the highest uncertainty, as observed in the posterior density plots shown in figures 8(b) and 10(b). Specifically, notice how the posterior distribution of and has a considerably larger standard deviation than the other parameters, implying that the evolution of this dynamical system is less sensitive with respect to these parameters.
3.3 Safeguarding against model inadequacy: a damped pendulum case study
In this example we aim to demonstrate the effects of a misspecified model parametrization, and highlight how the proposed Bayesian framework can help us detect such problematic cases and safeguard against them. Such cases may arise when domain knowledge is insufficient to guide the selection of a parsimonious model form, as well as when important interaction terms may be missing from a dictionary representation. To illustrate the main ideas, let us consider a simple damped pendulum system described by
(23)  
where here we choose , and . A set of sparse and irregularly sampled training datapairs can be generated by randomly subsampling a simulated trajectory of the exact dynamics in , starting from an initial condition . This imperfect dataset can be the used to recover an interpretable model representation via dictionary learning, albeit here we deliberately choose to use an incomplete dictionary containing polynomial terms only up to 1st order, i.e.,
(24) 
where the unkown model parameters are , with the active terms being marked with red color for clarity. Moreover, a blue marker is used to highlight a mismatched term in our incomplete dictionary, hinting its erroneous capacity to approximate the true term using just as a feature. It is evident that such a dictionary choice cannot faithfully capture the exact physics of the problem, and is hence destined to yield inaccurate predictions. Nevertheless, here we argue that this discrepancy between the true form and the assumed incomplete parametrization of the dynamical system can be effectively detected by our Bayesian workflow via inspecting the inferred posterior distribution over all parameters , which is expected to exhibit high entropy in presence of a misspecified model parametrization and imperfect training data.
The results of this experiment are summarized in figures 11 and 12. In particular, figure 11 shows the scattered training data, the exact simulated trajectory, the predicted trajectory corresponding to the MAP estimate of the inferred model parameters, as well as a two standard deviations band around a set of plausible forecasted states. As expected, the use of a misspecified dictionary leads to an inferred model that has difficulty in fitting the observed data and yields predictions with high variability. This model inadequacy is also clearly captured in the posterior distribution over the inferred parameters shown in figure 12. There we can see that the inferred density for the
parameter exhibits very high variance, as the
coefficient crucially corresponds to the misspecified term in our dictionary. This is a direct indication that the inferred model is inadequate to capture the observed reality, leading to forecasts with inevitably large errorbars. This innate ability of our framework to detect model misspecification via a rigorous quantification of predictive uncertainty can prove crucial in risksensitive applications, and provides an important capability that is currently missing from the recently active literature on datadriven model discovery [brunton2016discovering, rudy2017data, champion2019data, qin2019data, rackauckas2020universal, raissi2018multistep].A natural question to now ask is: can we still recover an accurate predictive model even if the true dynamics can only be partially captured by our chosen dictionary representation? To tackle this question we turn our attention to the hybrid learning setting discussed in section 2.3, in which some parts of our model can be captured by sparsely selecting interpretable terms from a dictionary, while other missing parts or closure terms can be accounted for via a blackbox function approximator. To this end, let us revisit our damped pendulum case study, and endow our dictionary learning parametrization with the ability to approximate the missing term via a deep neural network as
(25) 
where is a matrix of unknown coefficients corresponding to a dictionary containing polynomial interactions up to 2nd order, and is a fullyconnected deep neural network with hidden layers of dimension , a hyperbolic tangent activation, and a set of unknown weight and bias parameters denoted by . Notice that the neural network admits only as an input to make sure that the resulting parametrized dynamical system has a unique solution. Under this setup, the proposed Bayesian framework can be employed to jointly infer the dictionary and the neural network parameters that define our model representation, namely .
Figure 13 shows the scattered training data, the exact simulated trajectory, the predicted trajectory corresponding to the MAP estimate of the inferred model parameters, as well as a two standard deviations band around a set of plausible forecasted states. It is evident that the revised hybrid model formulation can now correctly identify the true underlying dynamics, leading to an accurate predicted MAP trajectory, while the predicted uncertainty effectively diminishes and concentrates around the ground truth. Moreover, the inferred coefficients of the irrelevant terms are very close to zero, while the active terms are all properly identified. This is also true for the neural network approximation of the missing closure term , as depicted in figure 14 which also includes uncertainty estimates over the neural network outputs. This simple example illustrates the great flexibility of the proposed Bayesian framework in seamlessly distilling parsimonious and interpretable models from imperfect data via physicsinformed dictionary learning, as well harnessing the power of blackbox representations for approximating missing closure terms.
3.4 Learning and forecasting chaotic dynamics with quantified uncertainty
This example aims to illustrate the performance of the proposed framework in a more challenging setting involving dictionary learning with noisy and irregularly sampled observation of a chaotic timeseries, leading to a 30dimensional inference problem. To this end, we consider the classical Lorenz system [lorenz1963deterministic]
(26)  
where , and are chosen to induce a chaotic response [lorenz1963deterministic]. Using an initial condition of , we simulate the exact dynamics to generate a trainingdata set consisting of irregularly sampled observations in the time interval . The training datapairs are also corrupted by isotropic uncorrelated Gaussian noise of the form (see figure 15(a)). Using this observed timeseries, our goal is to employ the proposed Bayesian framework to recover an interpretable and parsimonious representation for the chaotic Lorenz dynamics by sparsely selecting terms from a dictionary containing polynomial interactions up to second order, i.e.,
(27) 
where ’s are the scalar coefficients to be inferred. The nonzero parameters contributing to the exact Lorenz dynamics are highlighted with red for clarity. In this case, the dictionary has parameters. Since the scale of this problem’s variables is large, we will normalize the observed data using a uniform strategy to consistently preserve the attractor’s geometry. Specifically, instead of using equation 18, the training data is normalized as follows:
(28) 
such that is taken equal to the maximum standard deviation of the training data across the different dimensions.
The results of this experiment are summarized in table 7 and figures 16 and 15. Specifically, table 7 presents the computed MAP estimators for the dictionary coefficients , highlighting the ability of the proposed algorithms to correctly recover the active terms that define a parsimonious representation of the Lorenz system. In this case, due to the chaotic nature of the dynamics, any small imperfections in the inferred dynamics can lead to predicted trajectories that deviate from the ground truth. However, the predicted uncertainty estimates are expected to capture the variability induced by the Lorenz attractor. Indeed, as shown in figure 16, we observe small uncertainty estimates in the regions where the predicted MAP trajectory closely follows the ground truth solution, while the prediction of future extrapolated states exhibits high variance that faithfully captures the evolution of different trajectories in both branches of the butterflyshaped attractor. This robust characterization of predictive uncertainty is one of the key attributes of the proposed Bayesian approach that enables the computation of the joint posterior distribution over all model parameters, as shown in figure 15(b) for the active dictionary coefficients.
0.0088  11.37  10.69  0.0719  0.0053  0.0024  0.0115  0.0363  0.0025  0.0149 
0.0550  28.35  1.030  0.0214  0.0162  0.0072  0.0044  1.010  0.0005  0.0041 
0.0585  0.1527  0.1411  2.555  0.0377  0.9680  0.0165  0.0041  0.0063  0.0052 
3.5 Bayesian calibration of a Yeast Glycolysis model
In this final example our goal is to investigate the performance of the proposed algorithms in a realistic problem in systems biology. To this end, we consider a yeast glycolysis process which can be described by a 7dimensional dynamical system [ruoff2003temperature, yazdani2019systems] of the form
(29)  
where the terms and on the right hand side are defined as
(30)  
Here we will assume that this model form is known from existing domain knowledge, and our goal is to (i) compute the posterior distribution over the unknown model parameters from a small set of noisy and irregularly sampled observations, and (ii) test the ability of the inferred model to accurately generalize from different initial conditions that were not observed during model training.
A training dataset is generated by randomly subsampling irregularly distributed observations from a single simulated trajectory of the system from an initial condition: in the time interval , assuming a ground truth set of parameters obtained from the experimental data provided in [ruoff2003temperature]: mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, mM/min, , mM, nM, mM and . Again, we will consider two cases corresponding to noisefree and noisy training data, perturbed with a noise proportional to its standard deviation for the latter case.
Table 8 summarizes the inferred MAP estimators for the unknown model parameters. We observe that, in both the noisefree and the noisy cases, all inferred parameters closely agree with the ground truth values used to generate the training data as reported in [ruoff2003temperature]. Uncertainty estimates for the inferred parameters can also be deducted from the computed posterior distribution , as presented in the box plots of figure 17
where the minimum, maximum, median, first quantile and third quantile obtained from the HMC simulations for each parameter are presented. Finally, notice that all true values fall between the predicted quantiles, while the MAP estimators of all the parameters have considerably small relative errors compared with the true parameter values, as shown in table
8.Noise free  2.53  101.88  5.95  16.25  101.57  1.30  12.19  1.83  13.23  4.00  0.52  0.01  1.02  4.00 
Noisy  2.53  100.04  5.67  16.14  100.50  1.27  12.25  1.82  13.25  3.99  0.52  0.01  1.06  4.00 
The inferred model also allows us to produce forecasts of future states with quantified uncertainty via the predictive posterior distribution of equation 12. These extrapolated states are shown in figures 18 and 19, for noise free and noisy data case, respectively. It is evident that the uncertainty estimation obtained via sampling from the joint posterior distribution over all model parameters is able to well capture the reference trajectory of the high dimensional dynamical system. As expected, the uncertainty is larger when considering noisy data compared to the noise free case. This can be explained by the higher precision obtained with the noisefree data, which is reflected on the uncertainty obtained with the samples from the predictive distribution.
Finally, to illustrate the generalization capability of the inferred model with respect to different initial conditions than those used during training, we have assessed the quality of the predicted states starting from a random initial condition that is uniformly sampled within (). Figure 20 shows the prediction obtained with the randomly picked initial condition . The close agreement with the reference solution indicates that the inferred model is well capable of generalizing both in terms of handling different initial conditions, as well as extrapolating to reliable future forecasts with quantified uncertainty.
4 Conclusions
We have presented a novel machine learning framework for robust systems identification under uncertainty. The proposed framework leverages stateoftheart differential programming techniques in combination with gradientenhanced sampling schemes for scalable Bayesian inference of highdimensional posterior distributions, to infer interpretable and parsimonious representations of complex dynamical systems from imperfect (e.g. sparse, noisy, irregularly sampled) data. The developed methods are general as they can seamlessly combine dictionary learning, domain knowledge and blackbox approximations, all in a computationally efficient workflow with endtoend uncertainty quantification. The effectiveness of the proposed techniques has been systematically investigated and compared to stateoftheart approaches across a collection of prototype problems.
Although the proposed Bayesian differential programming framework provides great flexibility to infer a distribution over plausible parsimonious representations of a dynamical system, a number of technical issues need to be further investigated. The first relates to devising more effective initialization procedures for Markov Chain Monte Carlo sampling. Here we have partially addressed this via the MAP preconditioning algorithm discussed in section 2.5, however during the preconditioning process, since the form of the dynamical system is unknown, the intermediate estimations of the parameters may cause the system to become stiff. Moreover, for cases where only very sparse observations are available, the model needs to be integrated with a large timestep and stiffness of the system can lead to numerical instabilities during model training. A possible enhancement in this direction is to use more general stiffly stable ODE solvers as discussed in [rackauckas2020universal, gholami2019anode] or more sophisticated timestep annealing strategies [wang2020understanding]. Another potential future work could be identifying the uncertainty in model’s parameters with partial observations, such that some variables of the system are not accessible. Such task would involve physicsinformed regularization on the unknown latent dynamics of the system. Approaches used in [yazdani2019systems] could be helpful for solving this problem. A third open question is how to adapt the proposed method to stochastic dynamical systems where the dynamics itself may be driving by a stochastic process. Approaches mentioned in [li2020scalable]
could be useful. Finally, the proposed Bayesian differential programming framework can be extended to parameter identification for partial differential equations (PDEs). The latter generally translates into a high dimensional dynamical systems after discretization. The learning task in this context could be carried out not only for the PDEs’ parameters, but also for the discretization scheme
[rackauckas2020universal].Acknowledgements
This work received support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DESC0019116), the Defense Advanced Research Projects Agency under the Physics of Artificial Intelligence program (grant HR00111890034), and the Air Force Office of Scientific Research (grant XXXX).
Comments
abstractingagent ∙
This is absolutely incredible work, truly impressed at the implications and the robustness of the framework, a question though  in one of the examples you decided to calibrate the parameters of the yeast glycolysis process  a rather complex nonlinear model, however  given data of the process and no governing equations, could you also have discovered the governing dynamics just like you did with the pendulum process using the dictionary and the NN? Has this been tried? That would have been truly fascinating
∙ reply