InfoSSM: Interpretable Unsupervised Learning of Nonparametric State-Space Model for Multi-modal Dynamics

by   Young-Jin Park, et al.

The goal of system identification is to learn about underlying physics dynamics behind the observed time-series data. To model the nonparametric and probabilistic dynamics model, Gaussian process state-space models (GPSSMs) have been widely studied; GPs are not only capable to represent nonlinear dynamics, but estimate the uncertainty of prediction and avoid over-fitting. Traditional GPSSMs, however, are based on Gaussian transition model, thus often have difficulty in describing multi-modal motions. To resolve the challenge, this thesis proposes a model using multiple GPs and extends the GPSSM to information-theoretic framework by introducing a mutual information regularizer helping the model to learn interpretable and disentangled representation of multi-modal transition dynamics model. Experiment results show that the proposed model not only successfully represents the observed system but distinguishes the dynamics mode that governs the given observation sequence.



There are no comments yet.


page 1

page 2

page 3

page 4


Switching Recurrent Kalman Networks

Forecasting driving behavior or other sensor measurements is an essentia...

Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

State-space models are successfully used in many areas of science, engin...

Active Learning in Gaussian Process State Space Model

We investigate active learning in Gaussian Process state-space models (G...

Unsupervised Learning for Nonlinear PieceWise Smooth Hybrid Systems

This paper introduces a novel system identification and tracking method ...

Identifying nonlinear dynamical systems from multi-modal time series data

Empirically observed time series in physics, biology, or medicine, are c...

Multi-Modal Motion Planning Using Composite Pose Graph Optimization

In this paper, we present a motion planning framework for multi-modal ve...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-space model (SSM) is one of the most general representation model that has been used on a wide range of fields (e.g. aerospace engineering, robotics, economics, and biomedical engineering, etc) for time series analysis [1]

. The key idea in state-space model is to construct the latent state-space and its dynamics model representing the sequentially observed data. In traditional researches and applications, linear Gaussian state-space models are commonly used to solve the state estimation (i.e. inference) or the system identification (i.e. model learning) problems based on Kalman filtering (KF) algorithms

[2]. For the last few decades, a huge number of studies have tried to extend to nonlinear SSM such as particle filter [3]

, expectation maximization (EM)

[4], unscented KF [5], and dual extended kalman filter [6], etc. But, most of traditional SSM approaches assume a parametric latent model, thus they are only able to be used when we have fairly much information about the dynamics.

To resolve the challenge, nonparametric nonlinear SSM have been recently getting attention. They have designed dynamics model and approximate inference structures based on neural networks, and could successfully represent the complex sequential data

[7, 8, 9, 10]. While the probabilistic dynamics model are known to be more suitable in control problems because they can fully quantify the model uncertainty and enable safe and unbiased model learning [11]. As such, the probabilistic SSM approaches based on Gaussian process (GP) [12], so called Gaussian process state-space model (GPSSM), have been widely used in the system identification problems [13, 14, 15]. Although GPSSMs are powerful representation model, they are not still suitable to express multi-modal dynamics model without any additional probabilistic structure. For example, if we want to learn the dynamics of aircraft motions, single GP doesn’t seem to fully represent multiple behaviors (constant velocity, level turn, climb, and descend, etc) at once. In such cases, it is more desirable to assume the latent dynamics model is comprised of multiple processes. However, GPSSM using multiple GPs has not been reported in the literature. The first key contribution of this paper is the extension of GPSSM framework to multi-modal state-space models.

Meanwhile, GPSSMs are in a class of unsupervised learning that is possibly ill-posed. The belief in many unsupervised learning algorithms is that they will be automatically trained to represent the data as human observer conceives; we merely hope the learned model to be understandable. But, without any regularization, there is no guarantee that the model will learn a disentangled representation because nonparametric models often possess too strong representation power. In a similar sense, it is never trivial to train each GP dynamics to be distinguishable (i.e. each GP learns each mode of multi-modal dynamics). To solve the issue, we propose interpretable SSM structure, namely InfoSSM, by introducing mutual information regularization, similar to InfoGAN [16], between the observation and the latent dynamics, which is the second key contribution of this paper.

Note that there have been earlier researches that utilized multiple GPs, called GP experts, to improve the regression performance [17, 18, 19]. Those approaches are based on the idea to divide the input space into subsets and assign a GP for each subset. However, such approaches can not be used directly into the GPSSM framework since the input space (i.e. latent state-space) is unknown in unsupervised learning tasks. Thereby, this paper proposes a novel approach to assign data into multiple GPs by dividing the output space, not the input space, in an explainable way.

Rest of the paper is organized as follows. In section 2, we begin with a brief summary of the backgrounds about GP and GPSSM. Section 3 provides details about our algorithm; we show the mathematical formulation of InfoSSM and inference structures to approximate analytically intractable marginal likelihood and mutual information terms. Finally, in section 4, we analyze the performance of the proposed model with aircraft system identification experiments.

2 Backgrounds

2.1 Nomenclature

In the following derivations, subscript denotes data set, and subscript denotes data at time step. For instance,

. Scalar, vector, and matrix values are, represented using lowercase italic, lowercase bold italic, and capital bold italic, respectively.

represents the probability density of matrix

under the matrix normal distribution with mean matrix

and two covariance matrices and , where , , and denotes normal distribution, the vectorization of , and the Kronecker product, respectively.

denotes the probability of random variable

conditioning on random variable and parameter .

2.2 Gaussian Process

GP is a flexible and powerful nonparametric Bayesian model to approximate the distribution over functions [12]. Consider we are given a data set of input and output pairs, . The problem GP seeks to solve is to learn the function that maps input space to output space of data 111In this paper, we particularly consider matrix-variate GP which further considers the covariance among multi-output.. GP assumes function outputs are jointly Gaussian:


where , is the mean function, is the GP kernel matrix, and is the covariance matrix among multi-output. Affine mean function and squared exponential (SE) automatic relevance determination (ARD) kernel is commonly used for the covariance matrix between and :


where subscript denotes dimension of the variable (), while , , ,

are hyperparameters of mean and covariance function to learn. Given input points

and function outputs

, the probability distribution of function value at the new input location

can be predicted as normal distribution:


with mean and covariance .

However, GPs are known to suffer from extensive computational cost due to the matrix inversion of covariance matrix, . To cope with the problem, sparse approximation is commonly used [20, 21], which introduces inducing inputs and outputs . Prior distribution of sparse GP is given by


Assuming inducing variables can sufficiently represent the distribution of original GP function, the true GP prediction is approximated as


Note that inducing outputs are random variables to infer whereas inducing inputs are treated as parameters to learn. Resultant computational complexity is reduced to . By an abuse of notation, we drop and for the GP prediction term in the following derivations.

2.3 Gaussian Process State-Space Model

State-space model is a representation model to describe the dynamic system, consisted of latent Markovian state , control input , and observation output at time . However, for any given dynamic system, there are infinitely many numbers of state-space models that can represent the identical system. Thus we adopt one of the standardized models, canonical state-space model with , which is given as


where and are transition and observation model, while and are process and measurement noise. and signify the position and velocity vector. In this paper, we focuses on the problem where the transition model is completely unknown (i.e. non-parametric) whereas observation model is fairly known (i.e. parametric) 222Without any loss of generality, it is also possible to learn both non-parametric transition model and observation model in GPSSM formulation. However, such setting brings out the problem of severe non-identifiability between and [13, 15], and degrade the interpretability of the learned latent model.. In the discrete canonical state-space model with the predefined time interval , latent state at time can be recursively approximated as . For the sake of notational simplicity, we denote , , , and , respectively. The key concept of GPSSM is modeling transition model of the system as GP function:


Note that is a affine transform of and

is Gaussian white noise caused by process noise, thus both are Gaussian and tractable. As shown in the graphical model of GPSSM in Fig.


, the joint distribution of GPSSM variables for time step

is factorized as

Figure 1: The graphical model of InfoSSM. InfoSSM reduces into GPSSM when the dynamics code is leaved out.

3 InfoSSM

3.1 Multiple State-Space Modeling

Previous GPSSM studies [13, 14, 15] mostly targeted on the problem where the control input sequence is given. However, the control input is often unobservable and the only data we can access is the observation outputs, . For instance, in case we want to learn the dynamics of maneuvering aircraft, we receive range, azimuth, and elevation history from radar signal but corresponding control signal (i.e. thrust, and control surfaces, etc) is not given. But it is very difficult to consider every possible control input sequences or infer the unknown control input. Alternatively, inspired from the interacting multiple model (IMM) algorithm [22], we assumed that the dynamics model can be approximated by the combination of a finite number of motion patterns (i.e. control input patterns):


where and are transition model and process noise, respectively.

To express such multi-modal dynamics in GPSSM structure, we use multiple number of GPs representing each motion pattern rather than a single GP expressing global dynamics at once. Mathematically, let us have numbers of GPs: with , and denote where with . If observation came from GP model, the transition model within latent state trajectory is given by


where and

is mean and variance function of

GP prediction. In this paper, we will call as a latent dynamics code of data in the way that determines the mode of dynamics governing the latent states. The joint distribution of InfoSSM variables is then given by


But, unfortunately is hidden latent variable, which makes the problem difficult to solve. In other words, we also need to infer the dynamics code for each observation.

3.2 Inference Sturcture of InfoSSM

The goal of InfoSSM is to optimize GP hyperparameters and inducing variables maximizing the marginal likelihood , so as to learn the latent dynamics that most represents the data. Unlike single-layer GP, however, GPSSM is no long GP as it stacks GPs hierarchically along the time. This causes a challenge for the estimation of the marginal likelihood that contains intractable integration. Instead of directly optimizing the marginal likelihood, variational inference approaches alternatively maximizes the lower bound of the objective, which is called the evidence lower bound (ELBO):


This paper follows a doubly stochastic variational inference approach [23, 15], known to give superior performance to other variational approach [24] or EM [25] for deep GP structures; it neither impose any assumptions about the independence between GP layers nor Gaussianity for GP outputs. Following [15], we factorize the variational distribution as


Remark that the gap between the log marginal likelihood and the lower bound decreases as the variational distribution gets closer to the true posterior, hence it is very important to select a proper inference structure.

3.2.1 Approximate Inference for Latent states

Approximate inference for the posterior distribution of latent states consists of three parts: variational distribution of transition variable , initial state , and propagated states .

Primarily, is factorized by number of which is parameterized by a matrix-variate normal distribution sharing the same multi-output covariance with the GP prior:


where and are variational parameters to optimize.

Secondly, for the initial state distribution, one of the most naïve approach is to parameterize per data. Though it may give fairly exact inference results, it is not only computationally expensive but impossible to generalize for unseen data. Alternatively, this paper uses amortized inference networks [26, 27]

to approximate the posterior distribution. We choose backward recurrent neural network (BRNN) that is designed to mimic Kalman smoothing algorithm

[8]; the initial hidden state of BRNN is expected to contain the compressed information from . An additional shallow neural network outputs the mean and covariance of initial states from :


where is a set of parameters within a BRNN and a neural-network. To reduce the computational complexity, diagonal covariance is used. The inference structure is shown in Fig. 2.

Figure 2: The inference structure for the initial latent state. is a hidden state.

Finally, propagated states can be recursively derived from GP prediction:


where and

. Derivation directly comes from the linear transformation of Gaussian distribution.

3.2.2 Approximate Inference for Latent Dynamics Code

The probability distribution of can be ideally inferred from although intractable. But we expect that by observing the shape of observation trajectory the model can distinguish which mode of dynamics it is emerged from. Thus, we used a neural network that receives the observation trajectory as input and outputs the corresponding latent dynamics code:


where is a categorical distribution with probability mass function: for . A softmax activation is used for the last layer to mimic the distribution.

3.2.3 Shift and Rotation Invariant Inference Structures

In the present work, we are particularly focusing on the state-space models in the Cartesian coordinate. Dynamics in Cartesian coordinate has two general characteristics; physics are invariant to shift and rotation in the horizontal space. As such, we hope networks output the same dynamics code even if we input the shifted and rotated trajectories. To embed such characteristics, we propose a shift and rotation invariant inference structure via simple idea; we shift the initial position of the input trajectory to the origin and rotate the shifted trajectory with random angle:


where is a random rotation matrix. This also can be interpreted as data augmentation techniques which is widely used in classification tasks.

3.3 Monte Carlo Objectives

Based on (11 - 13), the ELBO is derived [15]:


Expectation terms can be computed by Monte Carlo sampling approach for ELBO update. Meanwhile, instead of using traditional ELBO update, this paper adopts Monte Carlo objective (MCO) update approach [28] that is known to not only increase the representation power of variational distribution but give a tighter lower bound by using multiple samples in a more effective way [29, 30].

3.3.1 Derivations of MCOs

The model likelihood ) can be estimated by Monte Carlo estimator [28]:


where represents . By using (11) and (13),


By using Jensen’s Inequality, the lower bound of marginal log likelihood is given by

Hence, the MCO with Monte Carlo samples is given by


During the computation, reparameterization trick [26, 27, 31] is further used to make learning signal from the MCO be back-propagated into every inference network. As such, we can construct the end-to-end training procedure in a fully differentiable manner.

Remark that, the MCO becomes equivalent to the ELBO when . In fact, defined MCO is equivalent to the IWAE bound, thus it gets monotonically tighter as the number of sample size increases [29]. Hence, the MCO achieves a tighter lower bound than the traditional ELBO:


3.4 Mutual Information Regularization for Latent Dynamics Code

Ideally, we hope each GP is trained in a meaningful way even without any supervision. For example, when we have a data set of trajectories from the maneuvering aircraft, we expect one GP learns constant velocity, another learns level turn, and the others learn climb and descend motion. But in the worst case, it is also possible that assigns every trajectoreis to only first GP, and the very first one is trained to represent every motion while the others learn useless motions. To avoid such circumstance, we need to introduce an additional information theoretic objective that lead the latent dynamics code to satisfies two criterions. First, the code should contain useful information determining the shape of trajectory. Second, trajectories generated from different GPs should be distinguishable. As such, there should be the strong mutual dependence between the latent dynamics code and the corresponding trajectory.

Inspired from [16, 32, 33], this paper as well adopts mutual information regularization for latent dynamics code. In the information theoretic sense, the mutual information is a measure quantifying the mutual dependence between two random variables:


Mutual information can also be interpreted as the amount of information contained in one random variable about the other random variable. If two random variables are mutually independent, mutual information becomes zero. Finally, the InfoSSM is trained to maximize the mutual information along with the MCO:


where is a tunable parameter adjusting the relative scale between MCO and regularization. is a function sampling the observation trajectory for given initial state and code by using (9) and (16). Note that, however, it is intractable to directly compute the mutual information term. Instead, we optimize the lower bound of the mutual information by variational information maximization approach [34]:


See [16] for detailed derivations. is constant since the prior is fixed, the same in (17) is used, and expectation term can be computed by Monte Carlo estimation. Detail implementation is provided in Algorithm 1

. We implement InfoSSM with Tensorflow 


, and Adaptive moment estimation (Adam) algorithm

[36] is used for gradient-descent optimizer.

1:  Initialize .
2:  for  to  do
3:     .
4:     for  to  do
5:        Select observation data: from .
6:        Sample initial states and codes: , using (15) and (17).
7:        Propagate states: , using (16) recursively.
8:        Compute the MCO: , using (23).
9:        .
10:     end for
11:     Sample .
12:     Generate the trajectories from each .
13:     Compute the mutual information: , using (27).
14:     .
15:     Update InfoSSM using Adam.
16:  end for


Algorithm 1 InfoSSM Implementation

4 Results

We evaluate the proposed algorithm with Dubins’ vehicle experiment. The Dubins path is commonly used approximated dynamics in the planning and control problems for wheeled vehicles and aircrafts [37, 38, 39]. The dynamics of 2D Dubins’ vehicle is given by:


where is the speed of vehicle which is assumed as constant (). Depending on the control input, Dubins’ vehicle shows three motion primitives: right(R), straight(S), left(L) with , respectively. The sample trajectory of Dubin’s vehicle is shown in Fig. 3. Fifty sub-trajectories are used for the training procedure.

Figure 3: The trajectory of the Dubins’ vehicle. Aircrafts are marked for each 100 steps.

We construct our latent state-space of as


where and .

As baselines, we compare the InfoSSM with PRSSM [15] ( and ), and InfoSSM without mutual information regularization () which we will call unInfoSSM. We empirically found that is one of good settings. We evaluate the models on three aspects: interpretability, model accuracy, and long-term prediction performance. Every model are constructed with , , , and . For InfoSSM and unInfoSSM, three GPs are used ().

4.1 Interpretability

Most importantly, we analyzed the effect of mutual information to model interpretability. We compared the results between InfoSSM and unInfoSSM. To visualize the effect, several trajectories generated from different GP (i.e. latent dynamics code) at random initial states are plotted in Fig. 4. As shown in results, we can’t see any interpretable meanings from unInfoSSM. Empirically found that the model without mutual information regularization tends to use only the first GP (i.e. ); it considers R and L motions as a noise from S motion, which is very inefficient way. The InfoSSM, in contrast, successfully learns disentangled representation so that each code distinguishes L, S, and R motion pattern. Furthermore, as shown in Table 1, InfoSSM demonstrates the highest mutual information thus the most distinguishable.

(a) unInfoSSM ()
(b) InfoSSM ()
Figure 4: The sampled trajectory from different codes. Initial state is marked with red dot. From above, .

4.2 Model Accuracy

To compare the model accuracy †(i.e. how well model represents the observation), we analyzed two factors: the lower bound of log marginal likelihood and the reconstruction performance. Primarily, Table 1 shows InfoSSM achieves the highest lower bound. Note that the unInfoSSM shows even worse performance than PRSSM which uses only single GP. This is due to the fact that unInfoSSM fails to use multiple GPs efficiently while KL-divergence term in (23) is proportionally increased as the number of GPs. Secondly, Table 2 and Fig. 5 illustrates the reconstruction performance. As expected, the InfoSSM shows the smallest root mean square error (RSME) and largest log-likelihood.

3727.2 1249.2 1525.5
-0.000002 -287.4 -
Table 1: The MCO and Mutual Information of Models
RMSE log-likelihood
Case 1. Right 0.6767 2.8916 1.0623 50.0152 -32.6587 31.6831
Case 2. Straight 0.4091 1.6435 2.0696 51.3238 30.6983 11.4390
Case 3. Left 0.7263 1.4932 1.5680 35.8378 -10.6636 -27.3243
Table 2: The RMSE and Log-likelihood of Test Cases in Fig. 5
(a) InfoSSM ()
(b) unInfoSSM ()
Figure 5: Reconstruction results at three different cases. From the left, let case 1, 2, and 3. The RMSE and log-likelihood of each result are shown in Table 2.
(a) InfoSSM ()
(b) unInfoSSM ()
Figure 6: The long term prediction results.

4.3 Long-term Prediction

Finally, we evaluate the long-term prediction performance of InfoSSM to see whether the learned dynamics well matches with the true dynamics. From the initial state , we propagate the vehicle for 100 time steps with . Using first 20 step trajectory, the model inferred the code and latent state. Then, the latent state is propagated with the learned dynamics of corresponding code and compared with the true trajectory. As shown in Fig. 6, the InfoSSM shows highly accurate long-term prediction for each dynamics mode. Note that, however, baseline models fail to predict the future state and the uncertainty is increased rapidly as time goes.

5 Conclusion

In this paper, we presented the InfoSSM, a information theoretic extension of GPSSM. To describe the multi-modal dynamics, we modeled the latent dynamics by using multiple GPs which is assigned by the latent dynamics code. The inference of latent state and dynamics code is performed via structured neural networks. Unlike previous GPSSM approaches, InfoSSM could learn disentangled representation for dynamics via the mutual information regularization without any supervision from human. The proposed model was evaluated via Dubins’ vehicle experiment, and showed that the InfoSSM effectively represent multi-modal latent dynamics. In future, we will extend the experiment to more practical example such as intent learning and aircraft navigation modeling. Another interesting topic is to develop a efficient planning algorithm for the multi-skill agent.


  • [1] J. D. Hamilton, “State-space models,” Handbook of econometrics, vol. 4, pp. 3039–3080, 1994.
  • [2] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
  • [3] N. J. Gordon, D. J. Salmond, and A. F. Smith, “Novel approach to nonlinear/non-gaussian bayesian state estimation,” in IEE Proceedings F (Radar and Signal Processing), vol. 140, no. 2.   IET, 1993, pp. 107–113.
  • [4] T. Briegel and V. Tresp, “Fisher scoring and a mixture of modes approach for approximate inference and learning in nonlinear state space models,” in Advances in Neural Information Processing Systems, 1999, pp. 403–409.
  • [5] E. A. Wan and R. Van Der Merwe, “The unscented kalman filter for nonlinear estimation,” in Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000.   Ieee, 2000, pp. 153–158.
  • [6] T. A. Wenzel, K. Burnham, M. Blundell, and R. Williams, “Dual extended kalman filter for vehicle state and parameter estimation,” Vehicle System Dynamics, vol. 44, no. 2, pp. 153–171, 2006.
  • [7] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variational bayes filters: Unsupervised learning of state space models from raw data,” arXiv preprint arXiv:1605.06432, 2016.
  • [8] R. G. Krishnan, U. Shalit, and D. Sontag, “Structured inference networks for nonlinear state space models.” in AAAI, 2017, pp. 2101–2109.
  • [9] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A disentangled recognition and nonlinear dynamics model for unsupervised learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3604–3613.
  • [10] J.-S. Ha, Y.-J. Park, H.-J. Chae, S.-S. Park, and H.-L. Choi, “Adaptive path-integral approach for representation learning and planning,” 2018.
  • [11] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in

    Proceedings of the 28th International Conference on machine learning (ICML-11)

    , 2011, pp. 465–472.
  • [12] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning.   MIT press Cambridge, 2006, vol. 1.
  • [13] R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian process state-space models,” in Advances in Neural Information Processing Systems, 2014, pp. 3680–3688.
  • [14] W. Sternberg and M. P. Deisenroth, “Identification of gaussian process state-space models,” 2017.
  • [15] A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe, “Probabilistic recurrent state-space models,” arXiv preprint arXiv:1801.10395, 2018.
  • [16] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, 2016, pp. 2172–2180.
  • [17] C. E. Rasmussen and Z. Ghahramani, “Infinite mixtures of gaussian process experts,” in Advances in neural information processing systems, 2002, pp. 881–888.
  • [18] T. Nguyen and E. Bonilla, “Fast allocation of gaussian process experts,” in International Conference on Machine Learning, 2014, pp. 145–153.
  • [19] C. Yuan and C. Neubauer, “Variational mixture of gaussian process experts,” in Advances in Neural Information Processing Systems, 2009, pp. 1897–1904.
  • [20] E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in neural information processing systems, 2006, pp. 1257–1264.
  • [21] M. K. Titsias, “Variational learning of inducing variables in sparse gaussian processes,” in

    International Conference on Artificial Intelligence and Statistics

    , 2009, pp. 567–574.
  • [22] Y. Bar-Shalom, K. Chang, and H. A. Blom, “Tracking a maneuvering target using input estimation versus the interacting multiple model algorithm,” IEEE Transactions on Aerospace and Electronic Systems, vol. 25, no. 2, pp. 296–300, 1989.
  • [23] H. Salimbeni and M. Deisenroth, “Doubly stochastic variational inference for deep gaussian processes,” in Advances in Neural Information Processing Systems, 2017, pp. 4591–4602.
  • [24] A. Damianou and N. Lawrence, “Deep gaussian processes,” in Artificial Intelligence and Statistics, 2013, pp. 207–215.
  • [25] T. Bui, D. Hernández-Lobato, J. Hernandez-Lobato, Y. Li, and R. Turner, “Deep gaussian processes for regression using approximate expectation propagation,” in International Conference on Machine Learning, 2016, pp. 1472–1481.
  • [26] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [27] D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in International Conference on Machine Learning (ICML), 2015, pp. 1530–1538.
  • [28] A. Mnih and D. J. Rezende, “Variational inference for monte carlo objectives,” arXiv preprint arXiv:1602.06725, 2016.
  • [29] Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” arXiv preprint arXiv:1509.00519, 2015.
  • [30] C. Cremer, Q. Morris, and D. Duvenaud, “Reinterpreting importance-weighted autoencoders,” arXiv preprint arXiv:1704.02916, 2017.
  • [31] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
  • [32]

    Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in

    Advances in Neural Information Processing Systems, 2017, pp. 3815–3825.
  • [33] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” arXiv preprint arXiv:1802.06070, 2018.
  • [34] D. Barber and F. Agakov, “The im algorithm: a variational approach to information maximization,” in Proceedings of the 16th International Conference on Neural Information Processing Systems.   MIT Press, 2003, pp. 201–208.
  • [35] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
  • [36] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [37] S. M. LaValle, Planning algorithms.   Cambridge university press, 2006.
  • [38] A. Tsourdos, B. White, and M. Shanmugavel, Cooperative path planning of unmanned aerial vehicles.   John Wiley & Sons, 2010, vol. 32.
  • [39] R. W. Beard and T. W. McLain, Small unmanned aircraft: Theory and practice.   Princeton university press, 2012.