1 Introduction
Sequential modelling of highdimensional data is a challenging problem that is encountered in many domains such as visual modelbased reinforcement learning [12] and imagebased dynamics identification for control [40]. Recently, there has been broad interest in studying the problem through the lens of generative latent variable models. These methods embed highdimensional observations into a lowerdimensional representation space, or latent space
, often by means of a variational autoencoder (VAE)
[21, 33], where the (latent) dynamics can be identified in a selfsupervised manner. Importantly, latent dynamics models form the backbone of many recent methods for imagebased control, reinforcement learning [13, 30], and motion planning with complex robotic systems [16, 27]. When dealing with images as a single modality, this approach has been shown to be particularly effective in various existing works [43, 39, 40, 18, 7]. However, in many robotic systems, images are accompanied by data from multiple additional sensing modalities with varying characteristics, that is, as multimodal data. These multimodal data may contain complementary information, patterns, and useful statistical correlations across modalities. Prior research has demonstrated the advantages of capturing multimodal information in the contexts of classification and regression of multimedia data [28, 1], learning from demonstration [15], localization [44], and control policy learning [35]. Conversely, the use of multimodal data has not yet been studied in the context of learned latent dynamics models.In this paper, we present a novel probabilistic framework for learning latent dynamics from multimodal time series data. Inspired by the multimodal variational autoencoder (MVAE) architecture [41], we employ a product of experts [14]
to encode all data modalities into a shared probabilistic latent representation while jointly learning the dynamics in a selfsupervised manner with a recurrent neural network (RNN)
[3].For validation and demonstration purposes, we use simulated and realworld datasets collected from a robotic manipulator pushing task involving three heterogenous data modalities: images from a camera, force and torque readings from a forcetorque sensor, and proprioceptive readings from the manipulator encoders. Figure 1 provides a visual summary of our approach.
Our main contributions are as follows:

a formulation of a variational inference model for selfsupervised training of latent dynamics models specifically with multimodal time series data;

experimental results demonstrating that our selfsupervised method achieves results comparable to an existing supervised method [25], which requires groundtruth labels, when used to capture taskrelevant state information and dynamics; and

an opensource implementation of our method and experiments in PyTorch
[29].^{1}^{1}1https://github.com/utiasSTARS/visualhapticdynamics
2 Related Work
In this section, we survey papers related to the modelling of highdimensional sequential data with latent variable models and learning with multimodal data in machine learning and robotics. We pay particular attention to vision as it is one of the most commonlyencountered modalities across many domains.
2.0.1 Latent Dynamics
Early deep dynamical models [39] used the bottleneck of a standard autoencoder as the compressed state from which the dynamics were learned in a tractable manner. Probabilistic extensions using the VAE were studied later [40, 2]. A significant amount of work has been carried out with sequential image data in the context of learned probabilistic state space models [24, 18] and differentiable filtering [11]
. These approaches attempt to combine the structure and interpretability of probabilistic graphical models with the flexibility and representational capacity of neural networks. Probabilistic graphical models have been used as a way to impose structure for fast and exact inference
[17, 7] and to filter out novel or outofdistribution images using a notion of uncertainty that comes from generative models [26]. The closelyrelated topic of imagebased transition models, or world models, has been studied in the reinforcement learning literature [12, 10]. Our work is an extension to these latent dynamics models that makes them amenable to multimodal data.2.0.2 Multimodal Machine Learning
Machine learning has been applied to the problem of learning representations and patterns of multimodal data for various downstream tasks. A good summary of the existing literature focused on applications involving multimedia data (e.g., video, text, and audio) is provided in [1]
. Probabilistic methods have also been applied to model the joint and conditional distributions of nonsequential multimodal data. Examples of these include the Restricted Boltzman Machine (RBM)
[28, 36] and the VAE [41, 37, 38]. Our work is most similar to the latter of these two approaches. We build upon the MVAE [41], but, critically, we apply and extend the framework to the sequential setting so that it is amenable to the application of capturing the latent dynamics of multimodal data.2.0.3 Multimodal Learning for Robotics
Our work is most similar in nature to [8], where the authors use audio data to augment a deterministic statebased forward (or dynamics) model. However, in [8], the audio data are taken from a previous random interaction and do not provide causallyrelated information to the forward model—the audio data is simply used to augment the representation. In contrast, we directly model the dynamics of the observed multimodal data. We also do not assume a relaxed deterministic statebased setting and instead learn a probabilistic representation from raw multimodal data directly (as opposed to dealing with difficulttoacquire state labels).
Similar to the approach presented in this paper, other groups have investigated the use of learned differentiable filters with multimodal measurement models of vision, proprioceptive, and haptic data, relying on groundtruth annotations [25]. However, in many cases groundtruth labels are expensive or impossible to acquire, which may hinder the scalability of such methods. We leverage recent work in variational inference and devise a selfsupervised generative approach to bypass this limitation. We do so by maximizing a proper lower bound of the marginal likelihood of the data itself.
Other works have also leveraged the MVAE architecture for learned localization with multimodal data [44]. Closer to our work, the authors of [32] demonstrate a technique to learn a notion of ‘intuitive physics’ in a selfsupervised manner by applying the MVAE architecture as a generative model of multimodal sensor measurements. Specifically, future sensor measurements resulting from interaction with objects are decoded based on an encoding of the current sensor measurements. In our work, as opposed to directly decoding future transitions, we learn a dynamics model based on the compressed latent space; we therefore have the choice to predict while remaining in a lowdimensional space and without having to decode, which saves a significant amount of computation and memory.
3 Methodology
We begin by presenting a baseline sequential latent variable model in Section 3.1 and then demonstrate how to extend this framework to multimodal data in Section 3.2.
3.1 Sequential Latent Variable Models
We consider a sequence of observations of a single modality with respective control inputs . We then introduce latent variables
to create a joint distribution
, where are the learnable parameters of our distributions that are parametrized by neural networks. We factorize the joint distribution of the generative process as , where:(1) 
and
(2) 
We model the latent dynamics with the distribution and the observation or measurement model with . The distribution is an arbitrary initial distribution with high uncertainty. The goal of learning is to maximize the marginal likelihood of the data or the evidence, given for a single sequence as
(3) 
with respect to the parameters . Unfortunately, in the general case with this model, the posterior distribution used for inference, , is intractable. A common solution from recent work in variational inference [21, 33] is to introduce a recognition or inference model with parameters to approximate the intractable posterior. This leads to the following lower bound on the marginal loglikelihood or the evidence lower bound (ELBO):
(4) 
Maximizing this lower bound can be shown to be equivalent to minimizing the KLdivergence between the true posterior and the recognition model . The resulting optimization objective, as denoted by Eq. 4, is based on an expectation with respect to the distribution , which itself is based on the parameters . As is typically done, we restrict
to be a Gaussian variational approximation. This enables us to use stochastic gradient descent (i.e., using Monte Carlo estimates of the gradient) via the reparameterization trick
[21] to optimize the lower bound. The specific choice for the factorization of varies depending on the application (i.e., prediction, smoothing, or filtering). Given our intended applications of prediction we choose to only use causal (i.e., current and past) information for inference,(5) 
3.2 Multimodal Sequential Latent Variable Models
We now extend and generalize the sequential latent variable model defined above to the multimodal case. We consider N sequences of separate observations or modalities, , where we assume that each sequence is of equivalent length : . As done in the previous section, we include the respective control inputs and again introduce a set of latent variables as a joint lowerdimensional latent space containing some underlying dynamics of interest. The final joint distribution factorizes as . We choose to define the generative process as follows:
(6) 
and with remaining the same as shown in Eq. 1. The marginal likelihood is then
(7) 
and the respective ELBO, for a single sequence, is then:
(8) 
In order to decide on the factorization of , we draw inspiration from the MVAE architecture [41], and base our inference network on the structure of the true multimodal posterior :
(9) 
Based on the last term in Eq. 9, the final factorization of the joint posterior is then a quotient between a product of the individual modalityspecific posteriors and the prior. Accordingly, we choose our inference model to be . We also use the same representation reformulation trick from the MVAE [41] and set in order to produce a simpler and numerically more stable product of experts,
(10) 
where would be equivalent to the standard single modality posterior shown in Eq. 5. Finally, we can further factorize the posterior in Eq. 10 into a more intuitive form for our sequential setting:
(11) 
Our factorization reveals that, at every time step of Eq. 3.2, each data modality is first separately encoded into a Gaussian distribution by its own inference model. A product is then taken of each modalityspecific distributions and the prior, which is also the transition distribution (i.e., the dynamics) of our latent space. Interestingly, we recover a similar form to the commonlyused recurrent state space model [12], where the transition distribution is included in the posterior. The product of distributions at each time step are not generally solvable in closed form. However, by assuming Gaussian distributions for the prior dynamics and each modalityspecific inference distribution, we end up with a final product of Gaussians for which a closedform analytical solution does exist. Conveniently, a product of Gaussians is also itself a Gaussian [6].
4 Experiments
In this section, we present implementation details and an empirical evaluation of our method. For our experiments, we chose to study a common multimodal task: planar pushing with a robotic manipulator using vision, haptic, and proprioceptive data from a camera, a forcetorque sensor, and joint encoders, respectively.
Planar pushing involves complex contact dynamics that are difficult to model with vision alone, while the multimodal sensor data produced are highly heterogeneous in dimension and quality.
We compared five different models to study the effects of multimodality on this type of task. The models were, in order: 1) visiononly (denoted by V), 2) visualproprioceptive (denoted by VP), 3) visualhaptic (denoted by VH), 4) visualhapticproprioceptive (denoted by VHP), and 5) a commonlyused baseline where the latent embedding of each data modality is simply concatenated (denoted by VHPC).
4.1 Datasets
We use both a synthetic and a realworld dataset as shown in Figure 2. We provide more details on the data and the collection procedure below.
4.1.1 Simulated Manipulator Pushing
We used PyBullet [4] to generate data from a simulated manipulator pushing task. We collected grayscale images of size pixels, , with pixel intensities rescaled to be in the range of zero to one. The proprioceptive data consisted of the Cartesian position and velocity of the endeffector, while the haptic data included force and torque measurements along three axes. We combined the haptic and proprioceptive data into a single second modality when both were used, , and otherwise used one of the two, . We argue that this is a reasonable decision since both proprioceptive and haptic data have similar characteristics, as opposed to, for example, image data. The control inputs were the endeffector velocity commands along the planar and directions, .
A total of 4,800 trajectories were collected. The image data was recorded at a frequency of approximately 4 Hz and forcetorque and proprioceptive data at a frequency of 120 Hz. We concatenated sequences of measurements in order to keep a consistent number of time steps between modalities. We used 4,320 trajectories for training and held out 480, or 10%, for evaluation. Each trajectory was of length (i.e., images and (concatenated) forcetorque and proprioception measurements in total). The number of measurements per discrete time step was based on the respective frequency of each data source. The object being pushed was a single square plate. In order to collect training data, we used a policy with actions drawn from a fixed Gaussian distribution. We used the same initial object position for each trajectory.
4.1.2 MIT Pushing
The MIT pushing dataset [42] consists of highfidelity real robot pushes carried out on various material surfaces and with various object shapes. Our dataset was a subset of trajectories with three different ellipseshaped plates and four different surface materials (Delrin, plywood, ABS, and polyurethane). We followed the experimental protocol of the authors in [25].
We used the code provided by previous work [22] to preprocess the data and to artificially render the respective images of the trajectories (since no image data was collected as part of the original dataset). In Figure 3, we show four examples of rendered images based on the realworld pushing data (i.e., object pose, endeffector pose, and forcetorque data). We note that the rendered images are not completely realistic (e.g., the images are not occluded and only the manipulator’s tip is rendered). However, for our purposes, the dataset provided an adequate starting point for preliminary experiments and comparisons.
We downsampled the images to pixels and transformed them to grayscale, , with pixel intensities rescaled to be in the range of zero to one. The proprioceptive data consisted of the and Cartesian coordinates of the endeffector. The haptic data consisted of force measurements along the and (inplane) axes and the torque measurements about the axis. We combined the haptic and proprioceptive data to form a single second modality, , if both were used, and otherwise used one or the other, or . The control inputs were position commands along the and axes, . The images were recorded at a frequency of Hz and the force and proprioceptive data were recorded at a frequency of Hz. We used a subset of 2,332 trajectories out of the total dataset (2,099 trajectories for training and 233, or approximately 10%, held out for evaluation). Each trajectory was of length (i.e., images and (concatenated) forcetorque and proprioception measurements in total). The data were collected using several preplanned pushes with varying velocities, accelerations, and contact points and angles.
4.2 Network Architecture and Training
We parametrized our models with neural networks. For the image encoder, we used a fully convolutional neural network based on the architecture of
[10]. The respective decoder was a matching deconvolutional network. For the proprioceptive and forcetorque data, we used a simple 1D convolutional architecture for the encoder and a 1D deconvolutional network for the decoder.Our transition function or dynamics model, was parameterized as a singlelayer GRU network [3] with 256 units that produced linear transition matrices (i.e., , where and are outputs of the network), as done in previous work [18, 7, 26]. During training, we sampled minibatches of 32 trajectories. We applied weight normalization [34]
to all of the network layers except for the GRU. We used ReLU activation functions for all of the networks. Our latent space for all experiments consisted of 16dimensional Gaussians with diagonal covariance matrices. We applied the
Adam optimizer [20]with a learning rate of .0003 and a gradient clipping norm of 0.5 for the GRU network.
4.3 Image Prediction Experiments
Graph of prediction quality over multiple prediction horizons from both datasets. We compare each model to the baseline visiononly (V) model. We plot the mean values of the SE, SSIM and PSNR with one standard deviation shaded.
In order to compare the models, we first evaluated the quality of the generated image sequences relative to the known ground truth images as a proxy for prediction quality. We began by encoding an initial amount of data into the latent space and then predicted future latent states with our learned dynamics models and known control inputs. Given these predicted states, we then decoded and generated future predicted images. We computed the squared error (SE) in terms of pixels, structural similarity index measure (SSIM), and peak signaltonoise ratio (PSNR) between the ground truth images and predicted images. Accurate predictions translate into a lower SE, higher SSIM, and higher PSNR.
For the synthetic data, we first conditioned on four frames and predicted the next 11 frames in the sequence. Figure (a)a is a visualization of the mean score, with one standard deviation shaded, based on all of the held out test data. For clear comparisons, we overlay each multimodal model (i.e., VHP, VP, and VH) on top of the baseline visiononly model (i.e., V).
Model  RMSE ()  SSIM ()  PSNR () 

V  3.243  0.955 0.04  28.639 5.84 
VP  1.758  0.982 0.02  33.935 5.93 
VH  2.956  0.961 0.04  29.684 5.70 
VHPC  2.911  0.962 0.04  29.700 5.39 
VHP  1.749  0.983 0.02  34.202 6.19 
Model  RMSE ()  SSIM ()  PSNR () 

V  1.917  0.907 0.02  30.806 1.66 
VP  1.850  0.923 0.03  31.395 2.28 
VH  1.839  0.922 0.03  31.298 2.03 
VHPC  1.846  0.923 0.03  31.333 2.12 
VHP  1.734  0.927 0.02  31.741 1.84 
Additionally, we compile the average score over the entire predicted horizon and summarize the results in Table 1. We also include the baseline results of simply concatenating all modalities, labelled as VHPC.
Overall, the VHP model performed the best for all three metrics when averaged over all prediction lengths, but it was closely followed by the VP model. All multimodal models outperformed the baseline, visiononly model, although the improvement in the case of the VH model was marginal. We hypothesize that this result may be due to the quality of the simulated force torque sensor data and the simulated contact dynamics. In this case, proprioception was the most effective second modality, as demonstrated by the performance of the VP model. However, using vision, haptic and proprioceptive data together with the VHP model led to the best results.
Notably, simply concatenating each modality (i.e., VHPC) yielded poorer performance when compared to our product of experts approach (i.e., VHP). The product of experts appears to have an appealing ‘filtering’ inductive bias for prediction problems. The product of experts formulation enables decisions to be made about when to use each modality at each time step based on the respective uncertainties. Further, each expert can selectively focus on a few dimensions without having to cover the full dimensionality of the state. Finally, a product of experts produces a sharper final distribution than the individual expert models [14]. This idea is well known in the context of digit image generation: one lowresolution model can generate the approximate overall shape of the digit while other local models can refine segments of the stroke with the correct fine structure [14].
Figure 5 is a visualization of a selected rollout or prediction that demonstrates a typical failure mode of the visiononly model. As shown by the later frames generated by the visiononly model in the red box outline, the overall motion of the object is almost correct, but the finerscale changes (e.g., the exact amount of rotation) are not well captured. Our VHP model, however, was able to capture these smaller changes.
We ran similar image prediction experiments for the MIT pushing dataset. In this case, we first conditioned on five frames and predicted the next 37 frames in the sequence. As shown by Figure (b)b, both the VP and VH models performed slightly better than the full VHP during the early time steps according to all three metrics (SE, SSIM and PSNR). However, over a longer prediction horizon the VHP model was clearly superior. For the realworld data, unlike in the synthetic case, the VH model significantly outperformed the baseline V model. Table 2 lists the average scores over the entire prediction horizon. Consistent with the previous synthetic experiment, simply concatenating the modalities (VHPC) led to poorer performance when compared to the product of experts model (VHP). Each multimodal model performed better than the baseline visiononly model. We visualize another failure mode of the visiononly model in Figure 6. In this case, we observe that the visiononly model failed to capture the more subtle dynamics of the object, which ended up angled and tilted clockwise; the model defaulted to outputting a blurry average at the approximate location of the object. On the other hand, our VHP model was able to produce a relatively crisp and accurate prediction of the object pose.
4.4 Regression Experiments
We further evaluated the predictive capability of the models with a downstream regression experiment that measured how well the model was able to capture and predict the underlying position of the object being pushed. As is commonly done in the selfsupervised representation learning literature [9, 23], we trained separate regressors on top of the frozen representations and dynamics models to do this.
We first encoded an initial latent state (i.e., a filtered state) from a set of observations . Then, using our learned dynamics and known future controls , we predicted future latent states (i.e., predicted states). Using the predicted states as inputs, we regressed the ground truth positions of the object, while keeping the weights of all of our previouslylearned networks frozen. Poor regression results would indicate lower correlation between the predicted latent states and the object’s ground truth position. This in turn would imply that the learned latent representation did not encode all of the necessary information to represent the state of the object, or that the learned dynamics of the object were inaccurate, or both. We kept the same prediction horizons from the image prediction experiments in Section 4.3.
For our results with synthetic data, we found ordinary least squares to be adequate for regression. Figure
(a)ashows that the mean and variance of the absolute translation errors from the regression are lowest for the full VHP model. We (again) observe that the inclusion of haptic data alone (VH) as an extra modality in the model did not lead to huge improvements. On the other hand, using proprioception (VP) did lead to significant improvements. The inclusion of both proprioceptive and haptic data (VHP) outperformed using proprioceptive data alone (VP).
For the MIT pushing dataset, we trained a separate simple neural network with a single hidden layer of size 50 to regress the position of the object. The realworld results generally match our synthetic results, as shown by Figure (b)b. Including any sort of multimodal data reduced the mean and variance of the absolute translation errors when compared to the results from the visiononly model.
As an additional benchmark with the MIT dataset, we compared the accuracy of our regressed values of the position of the object to those computed by a related approach on differentiable filtering [25]. Using the same dataset, the authors trained a variety of differentiable filters to directly regress the positions of the objects from the multimodal sensor data in a supervised manner using ground truth labels. Because the multimodal differentiable filters were trained with ground truth annotations of the objects’ locations, while our model is completely selfsupervised, the results provide a reasonable estimate of an upper bound on the expected performance. Table 3 demonstrates that our regression results, from both the predicted and filtered states, are comparable.
Model  RMSE [cm] () 

EKF (Sup. w/ GT, filt.)  1.33 
PF (Sup. w/ GT, filt.)  1.14 
LSTM (Sup. w/ GT, filt.)  2.32 
VHP (Selfsup., filt.) (Ours)  1.80 
VHP (Selfsup., pred.) (Ours)  1.96 
5 Conclusions & Future Work
We have presented a selfsupervised sequential latent variable model for multimodal time series data. Our probabilistic formulation extends existing latent dynamics models, a key backbone of many methods in visionbased control and modelbased reinforcement learning, to multimodal sensor data. We provided a case study of a manipulator pushing task where visual, haptic, and proprioceptive data streams were available. For latent dynamics in particular, we demonstrated that a principled probabilistic formulation for fusing sequential multimodal data performed significantly better than the common baseline of directly concatenating each modality in the latent space. Additionally, our learned selfsupervised approach was shown to be competitive with an existing multimodal differentiable filtering method that relies on supervised ground truth labels. As interesting avenues for future work, we plan to further investigate the properties of the learned uncertainties used for multimodal fusion and to use our model for control policy learning.
References
 [1] (2019) Multimodal machine learning: a survey and taxonomy. pami 41 (2), pp. 423–443. External Links: Document Cited by: §1, §2.0.2.
 [2] (2018) Robust locallylinear controllable embedding. In aistats, Vol. 84, pp. 1751–1759. Cited by: §2.0.1.

[3]
(2014)
On the properties of neural machine translation: encoder–decoder approaches
. In SSST8 Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §1, §4.2.  [4] (2016–2019) PyBullet, a python module for physics simulation for games, robotics and machine learning. Note: http://pybullet.org Cited by: §4.1.1.
 [5] (2015) A review and metaanalysis of multimodal affect detection systems. ACM computing surveys (CSUR) 47 (3), pp. 1–36. Cited by: item 2.
 [6] (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In NIPS’14 Modern Nonparametrics 3: Automating the Learning Pipeline Workshop, Cited by: §3.2.

[7]
(2017)
A disentangled recognition and nonlinear dynamics model for unsupervised learning
. In nips, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 3601–3610. External Links: Link Cited by: §1, §2.0.1, §4.2.  [8] (2020) Swoosh! Rattle! Thump!–Actions that sound. In rss, Cited by: item 2, §2.0.3.

[9]
(2020)
Bootstrap your own latent  a new approach to selfsupervised learning
. In nips, Vol. 33, pp. 21271–21284. Cited by: §4.4.  [10] (2018) Recurrent world models facilitate policy evolution. In nips, pp. 2450–2462. Cited by: §2.0.1, §4.2.
 [11] (2016) Backprop KF: learning discriminative deterministic state estimators. In nips, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 4376–4384. External Links: Link Cited by: §2.0.1.
 [12] (2019) Learning latent dynamics for planning from pixels. In icml, pp. 2555–2565. Cited by: §1, §2.0.1, §3.2.
 [13] (2021) Mastering atari with discrete world models. In iclr, Cited by: §1.

[14]
(2002)
Training products of experts by minimizing contrastive divergence
. Neural Computation 14 (8), pp. 1771–1800. External Links: Document, Link, https://doi.org/10.1162/089976602760128018 Cited by: §1, §4.3.  [15] (2021) Learning from demonstration with weakly supervised disentanglement. In iclr, Cited by: §1.
 [16] (2019) Robot motion planning in learned latent spaces. ral 4 (3), pp. 2407–2414. Cited by: §1.
 [17] (2016) Composing graphical models with neural networks for structured representations and fast inference. In nips, Vol. 29, pp. 2946–2954. Cited by: §2.0.1.
 [18] (2017Apr.) Deep variational Bayes filters: unsupervised learning of state space models from raw data. In iclr, Cited by: §1, §2.0.1, §4.2.

[19]
(2014)
Learning image embeddings using convolutional neural networks for improved multimodal semantics.
In
Proceedings of the 2014 Conf. on empirical methods in natural language processing (EMNLP)
, pp. 36–45. Cited by: item 2.  [20] (2015) Adam: A method for stochastic optimization. In iclr, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.2.
 [21] (2014) Autoencoding variational Bayes. In iclr, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §3.1.
 [22] (202009) Combining learned and analytical models for predicting action effects from sensory data. ijrr. External Links: ISSN 17413176, Link, Document Cited by: Figure 3, §4.1.2.

[23]
(2019)
Do better imagenet models transfer better?
. In cvpr, pp. 2661–2671. Cited by: §4.4.  [24] (2017) Structured inference networks for nonlinear state space models. In aaai, pp. 2101–2109. Cited by: §2.0.1.
 [25] (202010) Multimodal sensor fusion with differentiable filters. In iros, Cited by: item 3, §2.0.3, §4.1.2, §4.4, Table 3.
 [26] (202010) Heteroscedastic uncertainty for robust generative latent dynamics. ral 5 (4), pp. 6654–6661. External Links: Document, Link Cited by: §2.0.1, §4.2.
 [27] (2020) Latent space roadmap for visual action planning of deformable and rigid object manipulation. In iros, Cited by: §1.

[28]
(2011)
Multimodal deep learning
. In icml, Cited by: §1, §2.0.2.  [29] (2019) PyTorch: An imperative style, highperformance deep learning library. In nips, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: item 4.
 [30] (2021) Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control (L4DC), pp. 1154–1168. Cited by: §1.
 [31] (2013) Grounding action descriptions in videos. Trans. Assoc. Computational Linguistics 1, pp. 25–36. Cited by: item 2.
 [32] (2021) Learning intuitive physics with multimodal generative models. In aaai, Cited by: §2.0.3.

[33]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. In icml, pp. 1278–1286. Cited by: §1, §3.1.  [34] (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In nips, Cited by: §4.2.
 [35] (2015) Multimodal deep autoencoders for control of a mobile robot. In Proc. Australasian Conf. Robotics and Automation, H. Li and J. Kim (Eds.), pp. 1–10. Cited by: §1.

[36]
(201401)
Multimodal learning with deep boltzmann machines
. jmlr 15 (1), pp. 2949–2980. External Links: ISSN 15324435 Cited by: §2.0.2.  [37] (2016) Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891. Cited by: §2.0.2.
 [38] (2018) Generative models of visually grounded imagination. In iclr, External Links: Link Cited by: §2.0.2.
 [39] (2015) From pixels to torques: policy learning with deep dynamical models. In ICML’15 Deep Learning Workshop, Cited by: §1, §2.0.1.
 [40] (2015) Embed to control: A locally linear latent dynamics model for control from raw images. In nips, pp. 2746–2754. Cited by: §1, §2.0.1.
 [41] (2018) Multimodal generative models for scalable weaklysupervised learning. In nips, Vol. 31. Cited by: §1, §2.0.2, §3.2.
 [42] (2016) More than a million ways to be pushed. A highfidelity experimental dataset of planar pushing. In iros, pp. 30–37. External Links: Link, Document Cited by: Figure 2, §4.1.2.
 [43] (2019) Solar: deep structured representations for modelbased reinforcement learning. In icml, pp. 7444–7453. Cited by: §1.
 [44] (2021) VMLoc: variational fusion for learningbased multimodal camera localization. In aaai, Vol. 35, pp. 6165–6173. Cited by: §1, §2.0.3.