npode
Learning unknown ODE models with Gaussian processes
view repo
In conventional ODE modelling coefficients of an equation driving the system state forward in time are estimated. However, for many complex systems it is practically impossible to determine the equations or interactions governing the underlying dynamics. In these settings, parametric ODE model cannot be formulated. Here, we overcome this issue by introducing a novel paradigm of nonparametric ODE modelling that can learn the underlying dynamics of arbitrary continuous-time systems without prior knowledge. We propose to learn non-linear, unknown differential functions from state observations using Gaussian process vector fields within the exact ODE formalism. We demonstrate the model's capabilities to infer dynamics from sparse data and to simulate the system forward into future.
READ FULL TEXT VIEW PDFLearning unknown ODE models with Gaussian processes
Dynamical systems modeling is a cornerstone of experimental sciences. In biology, as well as in physics and chemistry, modelers attempt to capture the dynamical behavior of a given system or a phenomenon in order to improve its understanding and make predictions about its future state. Systems of coupled ordinary differential equations (ODEs) are undoubtedly the most widely used models in science. Even simple ODE functions can describe complex dynamical behaviours
(Hirsch et al., 2004). Typically, the dynamics are firmly grounded in physics with only a few parameters to be estimated from data. However, equally ubiquitous are the cases where the governing dynamics are partially or completely unknown.We consider the dynamics of a system governed by multivariate ordinary differential functions:
(1) |
where is the state vector of a -dimensional dynamical system at time , and the is the first order time derivative of that drives the state forward, and where is the vector-valued derivative function. The ODE solution is determined by
(2) |
where we integrate the system state from an initial state for time forward. We assume that is completely unknown and we only observe one or several multivariate time series obtained from an additive noisy observation model at observation time points ,
(3) |
where
follows a stationary zero-mean multivariate Gaussian distribution with diagonal noise variances
. The observation time points do not need to be equally spaced. Our task is to learn the differential function given observations , with no prior knowledge of the ODE system.There is a vast literature on conventional ODEs (Butcher, 2016) where a parametric form for function is assumed to be known, and its parameters are subsequently optimised with least squares or Bayesian approach, where the expensive forward solution is required to evaluate the system responses from parameters against observations . To overcome the computationally intensive forward solution, a family of methods denoted as gradient matching (Varah, 1982; Ellner et al., 2002; Ramsay et al., 2007) have proposed to replace the forward solution by matching to empirical gradients of the data instead, which do not require the costly integration step. Recently several authors have proposed embedding a parametric differential function within a Bayesian or Gaussian process (GP) framework (Graepel, 2003; Calderhead et al., 2008; Dondelinger et al., 2013; Wang and Barber, 2014; Macdonald, 2017) (see Macdonald et al. (2015) for a review). GPs have been successfully applied to model linear differential equations as they are analytically tractable (Gao et al., 2008; Raissi et al., 2017).
However, conventional ODE modeling can only proceed if a parametric form of the driving function is known. Recently, initial work to handle unknown or non-parametric ODE models have been proposed, although with various limiting approximations. Early works include spline-based smoothing and additive functions to infer gene regulatory networks (De Hoon et al., 2002; Henderson and Michailidis, 2014). Äijö and Lähdesmäki (2009) proposed estimating the unknown nonlinear function with GPs using either finite time differences, or analytically solving the derivative function as a function of only time, (Äijö et al., 2013). In a seminal technical report of Heinonen and d’Alche Buc (2014) a full vector-valued kernel model was proposed, however using a gradient matching approximation. To our knowledge, there exists no model that can learn non-linear ODE functions over the state against the true forward solutions .
In this work we propose npODE: the first ODE model for learning arbitrary, and a priori completely unknown non-parametric, non-linear differential functions from data in a Bayesian way. We do not use gradient matching or other approximative models, but instead propose to directly optimise the exact ODE system with the fully forward simulated responses against data. We parameterise our model as an augmented Gaussian process vector field with inducing points, while we propose sensitivity equations to efficiently compute the gradients of the system. Our model can forecast continuous-time systems arbitrary amounts to future, and we demonstrate the state-of-the-art performance in human motion datasets. The MATLAB implementation is publicly available at github.com/cagatayyildiz/npode.
The differential function to be learned defines a vector field^{1}^{1}1We use vector field and differential function interchangeably. , that is, an assignment of a gradient vector to every state . We model the vector field as a vector-valued Gaussian process (GP) (Rasmussen and Williams, 2006)
(4) |
which defines a priori distribution over function values whose mean and covariances are
(5) | ||||
(6) |
and where the kernel is matrix-valued. A GP prior defines that for any collection of states , the function values
follow a matrix-valued normal distribution,
(7) |
where is a block matrix of matrix-valued kernels . The key property of Gaussian processes is that they encode functions where similar states induce similar differentials , and where the state similarity is defined by the kernel .
In standard GP regression we would obtain posterior of the vector field by conditioning the GP prior with the data (Rasmussen and Williams, 2006). In ODE models the conditional of a vector field is intractable due to the integral mapping (2) between observed states and differentials . Instead, we resort to augmenting the Gaussian process with a set of inducing points and , such that (Quiñonero-Candela and Rasmussen, 2005). We choose to interpolate the differential function between the inducing points as (See Figure 1)
(8) |
which supports the function with inducing locations , inducing vectors , and are the kernel parameters. The function above corresponds to a vector-valued kernel function (Alvarez et al., 2012), or to a multi-task Gaussian process conditional mean without the variance term (Rasmussen and Williams, 2006). This definition is then compatible with the deterministic nature of the ODE formalism. Due to universality of several kernels and kernel functions (Shawe-Taylor and Cristianini, 2004), we can represent arbitrary vector fields with appropriate inducing point and kernel choices.
The vector-valued kernel function (8) uses operator-valued kernels, which result in matrix-valued kernels for real valued states , while the kernel matrix over data points becomes (See Alvarez et al. (2012) for a review). Most straightforward operator-valued kernel is the identity decomposable kernel , where the scalar Gaussian kernel
(9) |
with differential variance and dimension-specific lengthscales is expanded into a diagonal matrix of size . We collect the kernel parameters as .
We note that more complex kernels can also be considered given prior information of the underlying system characteristics. The divergence-free matrix-valued kernel induces vector fields that have zero divergence (Wahlström et al., 2013; Solin et al., 2015). Intuitively, these vector fields do not have sinks or sources, and every state always finally returns to itself after sufficient amount of time. Similarly, curl-free kernels induce curl-free vector fields that can contain sources or sinks, that is, trajectories can accelerate or decelerate. For theoretical treatment of vector field kernels, see (Narcowich and Ward, 1994; Bhatia et al., 2013; Fuselier and Wright, 2017). Non-stationary vector fields can be modeled with input-dependent lengthscales (Heinonen et al., 2016), while spectral kernels can represent stationary (Wilson et al., 2013) or non-stationary (Remes et al., 2017) recurring patterns in the differential function.
We assume a Gaussian likelihood over the observations and the corresponding simulated responses of Equation (2),
(10) |
where are forward simulated responses using the integral equation (2) and differential equation (8), and collects the dimension-specific noise variances.
The inducing vectors have a Gaussian process prior
(11) |
The model posterior is then
(12) |
where we have for brevity omitted the dependency on the locations of the inducing points
and also the parameter hyperpriors
and since we assume them to be uniform, unless there is specific domain knowledge of the priors.The model parameters are the initial state ^{2}^{2}2In case of multiple time-series, we will use one initial state for each time-series., the inducing vectors
, the noise standard deviations
, and the kernel hyperparameters
.We apply a latent parameterisation using Cholesky decomposition , which maps the inducing vectors to whitened domain (Kuss and Rasmussen, 2005)
(13) |
The latent variables are projected on the kernel manifold to obtain the inducing vectors . This non-centered parameterisation (NCP) transforms the hierarchical posterior of Equation (12) into a reparameterised form
(14) |
where all variables to be optimised are decoupled, with the latent inducing vectors having a standard normal prior . Optimizing and is now more efficient since they have independent contributions to the vector field via . The gradients of the whitened posterior can be retrieved analytically as (Heinonen et al., 2016)
(15) |
Finally, we find a MAP estimate for the initial state , latent vector field , kernel parameters and noises by gradient ascent,
(16) |
while keeping the inducing locations fixed on a sufficiently dense grid (See Figure 1). The partial derivatives of the posterior with respect to noise parameters can be found analytically, while the derivatives with respect to are approximated with finite differences. We select the optimal lengthscales by cross-validation.
The key term to carry out the MAP gradient ascent optimization is the likelihood
that requires forward integration and computing the partial derivatives with respect to the whitened inducing vectors . Given Equation (15) we only need to compute the gradients with respect to the inducing vectors ,
(17) |
This requires computing the derivatives of the simulated system response against the vector field parameters ,
(18) |
which we denote by , and expand the notation to make the dependency of on explicit. Approximating these with finite differences is possible in principle, but is highly inefficient and has been reported to cause unstability (Raue et al., 2013). We instead turn to sensitivity equations for and that provide computationally efficient, analytical gradients (Kokotovic and Heller, 1967; Fröhlich et al., 2017).
The solution for can be derived by differentiating the full nonparametric ODE system with respect to by
(19) |
The sensitivity equation for the given system can be obtained by changing the order of differentiation on the left hand side and carrying out the differentiation on the right hand side. The resulting sensitivity equation can then be expressed in the form
(20) |
where , (See Appendix for detailed specification). For our nonparametric ODE system the sensitivity equation is fully determined by
(21) | ||||
(22) |
The sensitivity equation provides us with an additional ODE system which describes the time evolution of the derivatives with respect to the inducing vectors . The sensitivities are coupled with the actual ODE system and, thus, both systems and are concatenated as the new augmented state that is solved jointly by Equation (2) driven by the differentials and (Leis and Kramer, 1988). The initial sensitivities are computed as . In our implementation, we merge with for sensitivity analysis to obtain the partial derivatives with respect to the initial state which is estimated along with the other parameters. We use the cvodes solver from the Sundials package (Hindmarsh et al., 2005) to solve the nonparametric ODE models and the corresponding gradients numerically. The sensitivity equation based approach is superior to the finite differences approximation because we have exact formulation for the gradients of state over inducing points, which can be solved up to the numerical accuracy of the ODE solver.
As first illustration of the proposed nonparametric ODE method we consider three simulated differential systems: the Van der Pol (VDP), FitzHugh-Nagumo (FHN) and Lotka-Volterra (LV) oscillators of form
In the conventional ODE case the coefficients of these equations can be inferred using standard statistical techniques if sufficient amount of time series data is available (Girolami, 2008; Raue et al., 2013). Our main goal is to infer unknown dynamics, that is, when these equations are unavailable and we instead represent the dynamics with a nonparametric vector field of Equation (8). We use these simulated models to only illustrate our model behavior against the true dynamics.
We employ data points from one cycle of noisy observation data from VDP and FHN models, and data points from cycles from the LV model with noise variance of . We learn the npODE model with these training data using inducing locations on a fixed grid, and forecast between 4 and 8 future cycles starting from true initial state at time . Figure 2 (bottom) shows the training datasets (grey regions), initial states, true trajectories (black lines) and the forecasted trajectory likelihoods (colored regions). The model accurately learns the dynamics from less than two cycles of data and can reproduce them reliably into future.
Figure 2 (top) shows the corresponding true vector field (black arrows) and the estimated vector field (grey arrows). The vector field is a continuous function, which is plotted on a 8x8 grid for visualisation. In general the most difficult part of the system is learning the middle of the loop (as seen in the FHN model), and learning the most outermost regions (bottom left in the LV model). The model learns the underlying differential accurately close to observed points, while making only few errors in the border regions with no data.
Model | Forecasting | Imputation |
---|---|---|
npODE | ||
GPDM | ||
VGPLVM |
Next, we illustrate how the model estimates realistic, unknown dynamics from noisy observations . As in Section 4, we make no assumptions on the structure or form of the underlying system, and capture the underlying dynamics with the nonparameteric system alone. We employ no subjective priors, and assume no inputs, controls or other sources of information. The task is to infer the underlying dynamics , and interpolate or extrapolate the state trajectory outside the observed data.
We use a benchmark dataset of human motion capture data from the Carnegie Mellon University motion capture (CMU mocap) database. Our dataset contains -dimensional pose measurements from humans walking, where each pose dimension records a measurement in different parts of the body during movement (Wang et al., 2008). We apply the preprocessing of Wang et al. (2008) by downsampling the datasets by a factor of four and centering the data. This resulted in a total of 4303 datapoints spread across trajectories with on average frames per trajectory. In order to tackle the problem of dimensionality, we project the original dataset with PCA to a three dimensional latent space where the system is specified, following Damianou et al. (2011) and Wang et al. (2006). We place inducing vectors on a fixed grid, and optimize our model from 100 initial values, which we select by projecting empirical differences to the inducing vectors. We use an LBFGS optimizer in Matlab. The whole inference takes approximately few minutes per trajectory.
We evaluate the method with two types of experiments: imputing missing values and forecasting future cycles. For the forecasting the first half of the trajectory is for model training, and the second half is to be forecasted. For imputation we remove roughly of the frames from the middle of the trajectory, which are to be filled by the models. We perform model selection for lengthscales with cross-validation split of 80/20. We record the root mean square error (RMSE) over test points in the original feature space in both cases, where we reconstruct the original dimensions from the latent space trajectories.
Due to the current lack of ODE methods suitable for this nonparametric inference task, we instead compare our method to the state-of-the-art state-space models where such problems have been previously considered (Wang et al., 2008). In a state-space or dynamical model a transition function moves the system forward in discrete steps. With sufficiently high sampling rate, such models can estimate and forecast finite approximations of smooth dynamics. In Gaussian process dynamical model (Wang et al., 2006; Frigola et al., 2014; Svensson et al., 2016) a GP transition function is inferred in a latent space, which can be inferred with a standard GPLVM (Lawrence, 2004) or with a dependent GPLVM (Zhao and Sun, 2016). In dynamical systems the transition function is replaced by a GP interpolation (Damianou et al., 2011). The discrete time state-space models emphasize inference of a low-dimensional manifold as an explanation of the high-dimensional measurement trajectories.
We compare our method to the dynamical model GPDM of Wang et al. (2006) and to the dynamical system VGPLVM of Damianou et al. (2011), where we directly apply the implementations provided by the authors at inverseprobability.com/vargplvm and dgp.toronto.edu/~jmwang/gpdm. Both methods optimize their latent spaces separately, and they are thus not directly comparable.
In the forecasting task we train all models with the first half of the trajectory, while forecasting the second half starting from the first frame. The models are trained and forecasted within a low-dimensional space, and subsequently projected back into the original space via inverting the PCA or with GPLVM mean predictions. As all methods optimize their latent spaces separately, they are not directly comparable. Thus, the mean errors are computed in the original high-dimensional space. Note that the low-dimensional representation necessarily causes some reconstruction errors.
Figure 3 illustrates the models on one of the trajectories 35_12.amc. The top part (a) shows the training data in the PCA space for npODE, and optimized training data representation for GPDM and VGPLVM (black points). The colored lines (npODE) and points (GPDM, VGPLVM) indicate the future forecast. The bottom part (b) shows the first 9 reconstructed original pose dimensions reconstructed from the latent forecasted trajectories. The training data is shown in gray background, while test data is shown with circles.
The VGPLVM has most trouble forecasting future points, and reverts quickly after training data to a value close to zero, failing to predict future points. The GPDM model produces more realistic trajectories, but fails to predict any of the poses accurately. Finally, npODE can accurately predict five poses, and still retains adequate performance on remaining poses, except for pose 2.
Furthermore, Table 1 indicates that npODE is also best performing method on average over the whole dataset in the forecasting.
In the imputation task we remove approximately of the training data from the middle of the trajectory. The goal is to learn a model with the remaining data and forecast the missing values. Figure 4 highlights the performance of the three models on the trajectory 07_07.amc. The top part (a) shows the training data (black points) in the PCA space (npODE) or optimized training locations in the latent space (GPDM, VGPLVM). The middle part imputation is shown with colored points or lines. Interestingly both npODE and GPDM operate on cyclic representations, while VGPLVM is not cyclic.
The bottom panel (b) shows the first 9 reconstructed pose dimensions from the three models. The missing values are shown in circles, while training points are shown with black dots. All models can accurately reproduce the overall trends, while npODE seems to fit slightly worse than the other methods. The PCA projection causes the seemingly perfect fit of the npODE prediction (at the top) to lead to slightly warped reconstructions (at the bottom). All methods mostly fit the missing parts as well. Table 1 shows that on average the npODE and VGPLVM have approximately equal top performance on the imputing missing values task.
We proposed the framework of nonparametric ODE model that can accurately learn arbitrary, nonlinear continuos-time dynamics from purely observational data without making assumptions of the underlying system dynamics. We demonstrated that the model excels at learning dynamics that can be forecasted into the future. We consider this work as the first in a line of studies of nonparametric ODE systems, and foresee several aspects as future work. Currently we do not handle non-stationary vector fields, that is time-dependent differentials . Furthermore, an interesting future avenue is the study of various vector field kernels, such as divergence-free, curl-free or spectral kernels (Remes et al., 2017). Finally, including inputs or controls to the system would allow precise modelling in interactive settings, such as robotics.
The proposed nonparametric ODE model operates along a continuous-time trajectory, while dynamic models such as hidden Markov models or state-space models are restricted to discrete time steps. These models are unable to consider system state at arbitrary times, for instance, between two successive timepoints.
Conventional ODE models have also been considered from the stochastic perspective with stochastic differential equation (SDE) models that commonly model the system drift and diffusion processes separately leading to a distribution of trajectories . As future work we will consider stochastic extensions of our nonparametric ODE model, as well as MCMC sampling of the inducing point posterior , leading to trajectory distribution as well.
The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 260403, 299915, 275537, 311584.
Foundations and Trends in Machine Learning
, 2012.Accelerating bayesian inference over nonlinear differential equations with gaussian processes.
NIPS, 2008.A radial basis function method for computing helmholtz–hodge decompositions.
IMA Journal of Numerical Analysis, 2017.Solving noisy linear operator equations by gaussian processes: Application to ordinary and partial differential equations.
ICML, 2003.Gaussian process latent variable models for visualisation of high dimensional data.
In Advances in neural information processing systems, pages 329–336, 2004.In the main text, the sensitivity equation is formulated using matrix notation
(23) |
Here, the time-dependent matrices are obtained by differentiating the vector valued functions with respect to vectors i.e.
(24) | ||||
(25) | ||||
(26) |
Below is the explicit form of the log posterior. Note that we introduce and for notational simplicity.
(27) | ||||
(28) | ||||
(29) | ||||
(30) |
Our goal is to compute the gradients with respect to the initial state , latent vector field , kernel parameters and noise variables . As explained in the paper, we compute the gradient of the posterior with respect to inducing vectors and project them to the white domain thanks to noncentral parameterisation. The analytical forms of the partial derivatives are as follows:
(31) | ||||
(32) | ||||
(33) |
Seemingly hard to compute terms, and , are computed using sensitivities. The lengthscale parameter is considered as a model complexity parameter and is chosen from a grid using cross-validation. We furthermore need the gradient with respect to the other kernel variable, i.e., the signal variance . Because and are the functions of kernel, computing the gradients with respect to is not trivial and we make use of finite differences:
(34) |
We use to compute the finite differences.
One problem of using gradient-based optimization techniques is that they do not ensure the positivity of the parameters being optimized. Therefore, we perform the optimization of the noise standard deviations and signal variance with respect to their logarithms:
(35) |
where .
We initialise the inducing vectors by computing the empirical gradients , and conditioning as
(36) |
where we optimize the scale against the posterior. The whitened inducing vector is obtained as . This procedure produces initial vector fields that partially match the trajectory already. We then do 100 restarts of the optimization from random perturbations .
We use LBFGS gradient optimization routine in Matlab. We initialise the inducing vector locations on a equidistant fixed grid on a box containing the observed points. We select the lengthscales using cross-validation from values . In general large lengthscales induce smoother models, while lower lengthscales cause overfitting.
Comments
There are no comments yet.