The goal of system identification is to learn about underlying physics dynamics behind the observed time-series data. To model the nonparametric and probabilistic dynamics model, Gaussian process state-space models (GPSSMs) have been widely studied; GPs are not only capable to represent nonlinear dynamics, but estimate the uncertainty of prediction and avoid over-fitting. Traditional GPSSMs, however, are based on Gaussian transition model, thus often have difficulty in describing multi-modal motions. To resolve the challenge, this thesis proposes a model using multiple GPs and extends the GPSSM to information-theoretic framework by introducing a mutual information regularizer helping the model to learn interpretable and disentangled representation of multi-modal transition dynamics model. Experiment results show that the proposed model not only successfully represents the observed system but distinguishes the dynamics mode that governs the given observation sequence.READ FULL TEXT VIEW PDF
State-space model (SSM) is one of the most general representation model that has been used on a wide range of fields (e.g. aerospace engineering, robotics, economics, and biomedical engineering, etc) for time series analysis 
. The key idea in state-space model is to construct the latent state-space and its dynamics model representing the sequentially observed data. In traditional researches and applications, linear Gaussian state-space models are commonly used to solve the state estimation (i.e. inference) or the system identification (i.e. model learning) problems based on Kalman filtering (KF) algorithms. For the last few decades, a huge number of studies have tried to extend to nonlinear SSM such as particle filter 
, expectation maximization (EM), unscented KF , and dual extended kalman filter , etc. But, most of traditional SSM approaches assume a parametric latent model, thus they are only able to be used when we have fairly much information about the dynamics.
To resolve the challenge, nonparametric nonlinear SSM have been recently getting attention. They have designed dynamics model and approximate inference structures based on neural networks, and could successfully represent the complex sequential data[7, 8, 9, 10]. While the probabilistic dynamics model are known to be more suitable in control problems because they can fully quantify the model uncertainty and enable safe and unbiased model learning . As such, the probabilistic SSM approaches based on Gaussian process (GP) , so called Gaussian process state-space model (GPSSM), have been widely used in the system identification problems [13, 14, 15]. Although GPSSMs are powerful representation model, they are not still suitable to express multi-modal dynamics model without any additional probabilistic structure. For example, if we want to learn the dynamics of aircraft motions, single GP doesn’t seem to fully represent multiple behaviors (constant velocity, level turn, climb, and descend, etc) at once. In such cases, it is more desirable to assume the latent dynamics model is comprised of multiple processes. However, GPSSM using multiple GPs has not been reported in the literature. The first key contribution of this paper is the extension of GPSSM framework to multi-modal state-space models.
Meanwhile, GPSSMs are in a class of unsupervised learning that is possibly ill-posed. The belief in many unsupervised learning algorithms is that they will be automatically trained to represent the data as human observer conceives; we merely hope the learned model to be understandable. But, without any regularization, there is no guarantee that the model will learn a disentangled representation because nonparametric models often possess too strong representation power. In a similar sense, it is never trivial to train each GP dynamics to be distinguishable (i.e. each GP learns each mode of multi-modal dynamics). To solve the issue, we propose interpretable SSM structure, namely InfoSSM, by introducing mutual information regularization, similar to InfoGAN , between the observation and the latent dynamics, which is the second key contribution of this paper.
Note that there have been earlier researches that utilized multiple GPs, called GP experts, to improve the regression performance [17, 18, 19]. Those approaches are based on the idea to divide the input space into subsets and assign a GP for each subset. However, such approaches can not be used directly into the GPSSM framework since the input space (i.e. latent state-space) is unknown in unsupervised learning tasks. Thereby, this paper proposes a novel approach to assign data into multiple GPs by dividing the output space, not the input space, in an explainable way.
Rest of the paper is organized as follows. In section 2, we begin with a brief summary of the backgrounds about GP and GPSSM. Section 3 provides details about our algorithm; we show the mathematical formulation of InfoSSM and inference structures to approximate analytically intractable marginal likelihood and mutual information terms. Finally, in section 4, we analyze the performance of the proposed model with aircraft system identification experiments.
In the following derivations, subscript denotes data set, and subscript denotes data at time step. For instance,
. Scalar, vector, and matrix values are, represented using lowercase italic, lowercase bold italic, and capital bold italic, respectively.
represents the probability density of matrix
under the matrix normal distribution with mean matrixand two covariance matrices and , where , , and denotes normal distribution, the vectorization of , and the Kronecker product, respectively.
denotes the probability of random variableconditioning on random variable and parameter .
GP is a flexible and powerful nonparametric Bayesian model to approximate the distribution over functions . Consider we are given a data set of input and output pairs, . The problem GP seeks to solve is to learn the function that maps input space to output space of data 111In this paper, we particularly consider matrix-variate GP which further considers the covariance among multi-output.. GP assumes function outputs are jointly Gaussian:
where , is the mean function, is the GP kernel matrix, and is the covariance matrix among multi-output. Affine mean function and squared exponential (SE) automatic relevance determination (ARD) kernel is commonly used for the covariance matrix between and :
where subscript denotes dimension of the variable (), while , , ,
are hyperparameters of mean and covariance function to learn. Given input pointsand function outputs
, the probability distribution of function value at the new input locationcan be predicted as normal distribution:
with mean and covariance .
However, GPs are known to suffer from extensive computational cost due to the matrix inversion of covariance matrix, . To cope with the problem, sparse approximation is commonly used [20, 21], which introduces inducing inputs and outputs . Prior distribution of sparse GP is given by
Assuming inducing variables can sufficiently represent the distribution of original GP function, the true GP prediction is approximated as
Note that inducing outputs are random variables to infer whereas inducing inputs are treated as parameters to learn. Resultant computational complexity is reduced to . By an abuse of notation, we drop and for the GP prediction term in the following derivations.
State-space model is a representation model to describe the dynamic system, consisted of latent Markovian state , control input , and observation output at time . However, for any given dynamic system, there are infinitely many numbers of state-space models that can represent the identical system. Thus we adopt one of the standardized models, canonical state-space model with , which is given as
where and are transition and observation model, while and are process and measurement noise. and signify the position and velocity vector. In this paper, we focuses on the problem where the transition model is completely unknown (i.e. non-parametric) whereas observation model is fairly known (i.e. parametric) 222Without any loss of generality, it is also possible to learn both non-parametric transition model and observation model in GPSSM formulation. However, such setting brings out the problem of severe non-identifiability between and [13, 15], and degrade the interpretability of the learned latent model.. In the discrete canonical state-space model with the predefined time interval , latent state at time can be recursively approximated as . For the sake of notational simplicity, we denote , , , and , respectively. The key concept of GPSSM is modeling transition model of the system as GP function:
Note that is a affine transform of and
is Gaussian white noise caused by process noise, thus both are Gaussian and tractable. As shown in the graphical model of GPSSM in Fig.1
, the joint distribution of GPSSM variables for time stepis factorized as
Previous GPSSM studies [13, 14, 15] mostly targeted on the problem where the control input sequence is given. However, the control input is often unobservable and the only data we can access is the observation outputs, . For instance, in case we want to learn the dynamics of maneuvering aircraft, we receive range, azimuth, and elevation history from radar signal but corresponding control signal (i.e. thrust, and control surfaces, etc) is not given. But it is very difficult to consider every possible control input sequences or infer the unknown control input. Alternatively, inspired from the interacting multiple model (IMM) algorithm , we assumed that the dynamics model can be approximated by the combination of a finite number of motion patterns (i.e. control input patterns):
where and are transition model and process noise, respectively.
To express such multi-modal dynamics in GPSSM structure, we use multiple number of GPs representing each motion pattern rather than a single GP expressing global dynamics at once. Mathematically, let us have numbers of GPs: with , and denote where with . If observation came from GP model, the transition model within latent state trajectory is given by
is mean and variance function ofGP prediction. In this paper, we will call as a latent dynamics code of data in the way that determines the mode of dynamics governing the latent states. The joint distribution of InfoSSM variables is then given by
But, unfortunately is hidden latent variable, which makes the problem difficult to solve. In other words, we also need to infer the dynamics code for each observation.
The goal of InfoSSM is to optimize GP hyperparameters and inducing variables maximizing the marginal likelihood , so as to learn the latent dynamics that most represents the data. Unlike single-layer GP, however, GPSSM is no long GP as it stacks GPs hierarchically along the time. This causes a challenge for the estimation of the marginal likelihood that contains intractable integration. Instead of directly optimizing the marginal likelihood, variational inference approaches alternatively maximizes the lower bound of the objective, which is called the evidence lower bound (ELBO):
This paper follows a doubly stochastic variational inference approach [23, 15], known to give superior performance to other variational approach  or EM  for deep GP structures; it neither impose any assumptions about the independence between GP layers nor Gaussianity for GP outputs. Following , we factorize the variational distribution as
Remark that the gap between the log marginal likelihood and the lower bound decreases as the variational distribution gets closer to the true posterior, hence it is very important to select a proper inference structure.
Approximate inference for the posterior distribution of latent states consists of three parts: variational distribution of transition variable , initial state , and propagated states .
Primarily, is factorized by number of which is parameterized by a matrix-variate normal distribution sharing the same multi-output covariance with the GP prior:
where and are variational parameters to optimize.
Secondly, for the initial state distribution, one of the most naïve approach is to parameterize per data. Though it may give fairly exact inference results, it is not only computationally expensive but impossible to generalize for unseen data. Alternatively, this paper uses amortized inference networks [26, 27]
to approximate the posterior distribution. We choose backward recurrent neural network (BRNN) that is designed to mimic Kalman smoothing algorithm; the initial hidden state of BRNN is expected to contain the compressed information from . An additional shallow neural network outputs the mean and covariance of initial states from :
where is a set of parameters within a BRNN and a neural-network. To reduce the computational complexity, diagonal covariance is used. The inference structure is shown in Fig. 2.
The probability distribution of can be ideally inferred from although intractable. But we expect that by observing the shape of observation trajectory the model can distinguish which mode of dynamics it is emerged from. Thus, we used a neural network that receives the observation trajectory as input and outputs the corresponding latent dynamics code:
where is a categorical distribution with probability mass function: for . A softmax activation is used for the last layer to mimic the distribution.
In the present work, we are particularly focusing on the state-space models in the Cartesian coordinate. Dynamics in Cartesian coordinate has two general characteristics; physics are invariant to shift and rotation in the horizontal space. As such, we hope networks output the same dynamics code even if we input the shifted and rotated trajectories. To embed such characteristics, we propose a shift and rotation invariant inference structure via simple idea; we shift the initial position of the input trajectory to the origin and rotate the shifted trajectory with random angle:
where is a random rotation matrix. This also can be interpreted as data augmentation techniques which is widely used in classification tasks.
Expectation terms can be computed by Monte Carlo sampling approach for ELBO update. Meanwhile, instead of using traditional ELBO update, this paper adopts Monte Carlo objective (MCO) update approach  that is known to not only increase the representation power of variational distribution but give a tighter lower bound by using multiple samples in a more effective way [29, 30].
The model likelihood ) can be estimated by Monte Carlo estimator :
By using Jensen’s Inequality, the lower bound of marginal log likelihood is given by
Hence, the MCO with Monte Carlo samples is given by
During the computation, reparameterization trick [26, 27, 31] is further used to make learning signal from the MCO be back-propagated into every inference network. As such, we can construct the end-to-end training procedure in a fully differentiable manner.
Remark that, the MCO becomes equivalent to the ELBO when . In fact, defined MCO is equivalent to the IWAE bound, thus it gets monotonically tighter as the number of sample size increases . Hence, the MCO achieves a tighter lower bound than the traditional ELBO:
Ideally, we hope each GP is trained in a meaningful way even without any supervision. For example, when we have a data set of trajectories from the maneuvering aircraft, we expect one GP learns constant velocity, another learns level turn, and the others learn climb and descend motion. But in the worst case, it is also possible that assigns every trajectoreis to only first GP, and the very first one is trained to represent every motion while the others learn useless motions. To avoid such circumstance, we need to introduce an additional information theoretic objective that lead the latent dynamics code to satisfies two criterions. First, the code should contain useful information determining the shape of trajectory. Second, trajectories generated from different GPs should be distinguishable. As such, there should be the strong mutual dependence between the latent dynamics code and the corresponding trajectory.
Inspired from [16, 32, 33], this paper as well adopts mutual information regularization for latent dynamics code. In the information theoretic sense, the mutual information is a measure quantifying the mutual dependence between two random variables:
Mutual information can also be interpreted as the amount of information contained in one random variable about the other random variable. If two random variables are mutually independent, mutual information becomes zero. Finally, the InfoSSM is trained to maximize the mutual information along with the MCO:
where is a tunable parameter adjusting the relative scale between MCO and regularization. is a function sampling the observation trajectory for given initial state and code by using (9) and (16). Note that, however, it is intractable to directly compute the mutual information term. Instead, we optimize the lower bound of the mutual information by variational information maximization approach :
See  for detailed derivations. is constant since the prior is fixed, the same in (17) is used, and expectation term can be computed by Monte Carlo estimation. Detail implementation is provided in Algorithm 1
. We implement InfoSSM with Tensorflow
, and Adaptive moment estimation (Adam) algorithm is used for gradient-descent optimizer.
We evaluate the proposed algorithm with Dubins’ vehicle experiment. The Dubins path is commonly used approximated dynamics in the planning and control problems for wheeled vehicles and aircrafts [37, 38, 39]. The dynamics of 2D Dubins’ vehicle is given by:
where is the speed of vehicle which is assumed as constant (). Depending on the control input, Dubins’ vehicle shows three motion primitives: right(R), straight(S), left(L) with , respectively. The sample trajectory of Dubin’s vehicle is shown in Fig. 3. Fifty sub-trajectories are used for the training procedure.
We construct our latent state-space of as
where and .
As baselines, we compare the InfoSSM with PRSSM  ( and ), and InfoSSM without mutual information regularization () which we will call unInfoSSM. We empirically found that is one of good settings. We evaluate the models on three aspects: interpretability, model accuracy, and long-term prediction performance. Every model are constructed with , , , and . For InfoSSM and unInfoSSM, three GPs are used ().
Most importantly, we analyzed the effect of mutual information to model interpretability. We compared the results between InfoSSM and unInfoSSM. To visualize the effect, several trajectories generated from different GP (i.e. latent dynamics code) at random initial states are plotted in Fig. 4. As shown in results, we can’t see any interpretable meanings from unInfoSSM. Empirically found that the model without mutual information regularization tends to use only the first GP (i.e. ); it considers R and L motions as a noise from S motion, which is very inefficient way. The InfoSSM, in contrast, successfully learns disentangled representation so that each code distinguishes L, S, and R motion pattern. Furthermore, as shown in Table 1, InfoSSM demonstrates the highest mutual information thus the most distinguishable.
To compare the model accuracy †(i.e. how well model represents the observation), we analyzed two factors: the lower bound of log marginal likelihood and the reconstruction performance. Primarily, Table 1 shows InfoSSM achieves the highest lower bound. Note that the unInfoSSM shows even worse performance than PRSSM which uses only single GP. This is due to the fact that unInfoSSM fails to use multiple GPs efficiently while KL-divergence term in (23) is proportionally increased as the number of GPs. Secondly, Table 2 and Fig. 5 illustrates the reconstruction performance. As expected, the InfoSSM shows the smallest root mean square error (RSME) and largest log-likelihood.
|Case 1. Right||0.6767||2.8916||1.0623||50.0152||-32.6587||31.6831|
|Case 2. Straight||0.4091||1.6435||2.0696||51.3238||30.6983||11.4390|
|Case 3. Left||0.7263||1.4932||1.5680||35.8378||-10.6636||-27.3243|
Finally, we evaluate the long-term prediction performance of InfoSSM to see whether the learned dynamics well matches with the true dynamics. From the initial state , we propagate the vehicle for 100 time steps with . Using first 20 step trajectory, the model inferred the code and latent state. Then, the latent state is propagated with the learned dynamics of corresponding code and compared with the true trajectory. As shown in Fig. 6, the InfoSSM shows highly accurate long-term prediction for each dynamics mode. Note that, however, baseline models fail to predict the future state and the uncertainty is increased rapidly as time goes.
In this paper, we presented the InfoSSM, a information theoretic extension of GPSSM. To describe the multi-modal dynamics, we modeled the latent dynamics by using multiple GPs which is assigned by the latent dynamics code. The inference of latent state and dynamics code is performed via structured neural networks. Unlike previous GPSSM approaches, InfoSSM could learn disentangled representation for dynamics via the mutual information regularization without any supervision from human. The proposed model was evaluated via Dubins’ vehicle experiment, and showed that the InfoSSM effectively represent multi-modal latent dynamics. In future, we will extend the experiment to more practical example such as intent learning and aircraft navigation modeling. Another interesting topic is to develop a efficient planning algorithm for the multi-skill agent.
Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
International Conference on Artificial Intelligence and Statistics, 2009, pp. 567–574.
Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” inAdvances in Neural Information Processing Systems, 2017, pp. 3815–3825.