Log In Sign Up

Learning Sequential Latent Variable Models from Multimodal Time Series Data

by   Oliver Limoyo, et al.

Sequential modelling of high-dimensional data is an important problem that appears in many domains including model-based reinforcement learning and dynamics identification for control. Latent variable models applied to sequential data (i.e., latent dynamics models) have been shown to be a particularly effective probabilistic approach to solve this problem, especially when dealing with images. However, in many application areas (e.g., robotics), information from multiple sensing modalities is available – existing latent dynamics methods have not yet been extended to effectively make use of such multimodal sequential data. Multimodal sensor streams can be correlated in a useful manner and often contain complementary information across modalities. In this work, we present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics. Using synthetic and real-world datasets from a multimodal robotic planar pushing task, we demonstrate that our approach leads to significant improvements in prediction and representation quality. Furthermore, we compare to the common learning baseline of concatenating each modality in the latent space and show that our principled probabilistic formulation performs better. Finally, despite being fully self-supervised, we demonstrate that our method is nearly as effective as an existing supervised approach that relies on ground truth labels.


page 7

page 10

page 12


Harmonized Multimodal Learning with Gaussian Process Latent Variable Models

Multimodal learning aims to discover the relationship between multiple m...

Factorized Multimodal Transformer for Multimodal Sequential Learning

The complex world around us is inherently multimodal and sequential (con...

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Recently, Self-Supervised Representation Learning (SSRL) has attracted m...

Latent Variable Algorithms for Multimodal Learning and Sensor Fusion

Multimodal learning has been lacking principled ways of combining inform...

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Integrating deep learning with latent state space models has the potenti...

Heteroscedastic Uncertainty for Robust Generative Latent Dynamics

Learning or identifying dynamics from a sequence of high-dimensional obs...

Beta DVBF: Learning State-Space Models for Control from High Dimensional Observations

Learning a model of dynamics from high-dimensional images can be a core ...

1 Introduction

Sequential modelling of high-dimensional data is a challenging problem that is encountered in many domains such as visual model-based reinforcement learning [12] and image-based dynamics identification for control [40]. Recently, there has been broad interest in studying the problem through the lens of generative latent variable models. These methods embed high-dimensional observations into a lower-dimensional representation space, or latent space

, often by means of a variational autoencoder (VAE)

[21, 33], where the (latent) dynamics can be identified in a self-supervised manner. Importantly, latent dynamics models form the backbone of many recent methods for image-based control, reinforcement learning [13, 30], and motion planning with complex robotic systems [16, 27]. When dealing with images as a single modality, this approach has been shown to be particularly effective in various existing works [43, 39, 40, 18, 7]. However, in many robotic systems, images are accompanied by data from multiple additional sensing modalities with varying characteristics, that is, as multimodal data. These multimodal data may contain complementary information, patterns, and useful statistical correlations across modalities. Prior research has demonstrated the advantages of capturing multimodal information in the contexts of classification and regression of multimedia data [28, 1], learning from demonstration [15], localization [44], and control policy learning [35]. Conversely, the use of multimodal data has not yet been studied in the context of learned latent dynamics models.

In this paper, we present a novel probabilistic framework for learning latent dynamics from multimodal time series data. Inspired by the multimodal variational autoencoder (MVAE) architecture [41], we employ a product of experts [14]

to encode all data modalities into a shared probabilistic latent representation while jointly learning the dynamics in a self-supervised manner with a recurrent neural network (RNN)


For validation and demonstration purposes, we use simulated and real-world datasets collected from a robotic manipulator pushing task involving three heterogenous data modalities: images from a camera, force and torque readings from a force-torque sensor, and proprioceptive readings from the manipulator encoders. Figure 1 provides a visual summary of our approach.

Figure 1:

We learn a sequential latent variable model for multimodal time series data (e.g., image, proprioceptive, and haptic data). Each modality is separately encoded into a probabilistic latent distribution, in this case a Gaussian distribution parametrized by

and , and then combined into a joint representation using a product of experts. Simultaneously, we learn the respective (latent) dynamics based on the control inputs u.

Our main contributions are as follows:

  1. a formulation of a variational inference model for self-supervised training of latent dynamics models specifically with multimodal time series data;

  2. experimental results demonstrating that our method of incorporating multimodal data in the latent dynamics framework has superior representative and predictive capability when compared to the baseline of simply concatenating each modality [19, 5, 31, 8];

  3. experimental results demonstrating that our self-supervised method achieves results comparable to an existing supervised method [25], which requires ground-truth labels, when used to capture task-relevant state information and dynamics; and

  4. an open-source implementation of our method and experiments in PyTorch


2 Related Work

In this section, we survey papers related to the modelling of high-dimensional sequential data with latent variable models and learning with multimodal data in machine learning and robotics. We pay particular attention to vision as it is one of the most commonly-encountered modalities across many domains.

2.0.1 Latent Dynamics

Early deep dynamical models [39] used the bottleneck of a standard autoencoder as the compressed state from which the dynamics were learned in a tractable manner. Probabilistic extensions using the VAE were studied later [40, 2]. A significant amount of work has been carried out with sequential image data in the context of learned probabilistic state space models [24, 18] and differentiable filtering [11]

. These approaches attempt to combine the structure and interpretability of probabilistic graphical models with the flexibility and representational capacity of neural networks. Probabilistic graphical models have been used as a way to impose structure for fast and exact inference

[17, 7] and to filter out novel or out-of-distribution images using a notion of uncertainty that comes from generative models [26]. The closely-related topic of image-based transition models, or world models, has been studied in the reinforcement learning literature [12, 10]. Our work is an extension to these latent dynamics models that makes them amenable to multimodal data.

2.0.2 Multimodal Machine Learning

Machine learning has been applied to the problem of learning representations and patterns of multimodal data for various downstream tasks. A good summary of the existing literature focused on applications involving multimedia data (e.g., video, text, and audio) is provided in [1]

. Probabilistic methods have also been applied to model the joint and conditional distributions of non-sequential multimodal data. Examples of these include the Restricted Boltzman Machine (RBM)

[28, 36] and the VAE [41, 37, 38]. Our work is most similar to the latter of these two approaches. We build upon the MVAE [41], but, critically, we apply and extend the framework to the sequential setting so that it is amenable to the application of capturing the latent dynamics of multimodal data.

2.0.3 Multimodal Learning for Robotics

Our work is most similar in nature to [8], where the authors use audio data to augment a deterministic state-based forward (or dynamics) model. However, in [8], the audio data are taken from a previous random interaction and do not provide causally-related information to the forward model—the audio data is simply used to augment the representation. In contrast, we directly model the dynamics of the observed multimodal data. We also do not assume a relaxed deterministic state-based setting and instead learn a probabilistic representation from raw multimodal data directly (as opposed to dealing with difficult-to-acquire state labels).

Similar to the approach presented in this paper, other groups have investigated the use of learned differentiable filters with multimodal measurement models of vision, proprioceptive, and haptic data, relying on ground-truth annotations [25]. However, in many cases ground-truth labels are expensive or impossible to acquire, which may hinder the scalability of such methods. We leverage recent work in variational inference and devise a self-supervised generative approach to bypass this limitation. We do so by maximizing a proper lower bound of the marginal likelihood of the data itself.

Other works have also leveraged the MVAE architecture for learned localization with multimodal data [44]. Closer to our work, the authors of [32] demonstrate a technique to learn a notion of ‘intuitive physics’ in a self-supervised manner by applying the MVAE architecture as a generative model of multimodal sensor measurements. Specifically, future sensor measurements resulting from interaction with objects are decoded based on an encoding of the current sensor measurements. In our work, as opposed to directly decoding future transitions, we learn a dynamics model based on the compressed latent space; we therefore have the choice to predict while remaining in a low-dimensional space and without having to decode, which saves a significant amount of computation and memory.

3 Methodology

We begin by presenting a baseline sequential latent variable model in Section 3.1 and then demonstrate how to extend this framework to multimodal data in Section 3.2.

3.1 Sequential Latent Variable Models

We consider a sequence of observations of a single modality with respective control inputs . We then introduce latent variables

to create a joint distribution

, where are the learnable parameters of our distributions that are parametrized by neural networks. We factorize the joint distribution of the generative process as , where:




We model the latent dynamics with the distribution and the observation or measurement model with . The distribution is an arbitrary initial distribution with high uncertainty. The goal of learning is to maximize the marginal likelihood of the data or the evidence, given for a single sequence as


with respect to the parameters . Unfortunately, in the general case with this model, the posterior distribution used for inference, , is intractable. A common solution from recent work in variational inference [21, 33] is to introduce a recognition or inference model with parameters to approximate the intractable posterior. This leads to the following lower bound on the marginal log-likelihood or the evidence lower bound (ELBO):


Maximizing this lower bound can be shown to be equivalent to minimizing the KL-divergence between the true posterior and the recognition model . The resulting optimization objective, as denoted by Eq. 4, is based on an expectation with respect to the distribution , which itself is based on the parameters . As is typically done, we restrict

to be a Gaussian variational approximation. This enables us to use stochastic gradient descent (i.e., using Monte Carlo estimates of the gradient) via the reparameterization trick

[21] to optimize the lower bound. The specific choice for the factorization of varies depending on the application (i.e., prediction, smoothing, or filtering). Given our intended applications of prediction we choose to only use causal (i.e., current and past) information for inference,


3.2 Multimodal Sequential Latent Variable Models

We now extend and generalize the sequential latent variable model defined above to the multimodal case. We consider N sequences of separate observations or modalities, , where we assume that each sequence is of equivalent length : . As done in the previous section, we include the respective control inputs and again introduce a set of latent variables as a joint lower-dimensional latent space containing some underlying dynamics of interest. The final joint distribution factorizes as . We choose to define the generative process as follows:


and with remaining the same as shown in Eq. 1. The marginal likelihood is then


and the respective ELBO, for a single sequence, is then:


In order to decide on the factorization of , we draw inspiration from the MVAE architecture [41], and base our inference network on the structure of the true multimodal posterior :


Based on the last term in Eq. 9, the final factorization of the joint posterior is then a quotient between a product of the individual modality-specific posteriors and the prior. Accordingly, we choose our inference model to be . We also use the same representation reformulation trick from the MVAE [41] and set in order to produce a simpler and numerically more stable product of experts,


where would be equivalent to the standard single modality posterior shown in Eq. 5. Finally, we can further factorize the posterior in Eq. 10 into a more intuitive form for our sequential setting:


Our factorization reveals that, at every time step of Eq. 3.2, each data modality is first separately encoded into a Gaussian distribution by its own inference model. A product is then taken of each modality-specific distributions and the prior, which is also the transition distribution (i.e., the dynamics) of our latent space. Interestingly, we recover a similar form to the commonly-used recurrent state space model [12], where the transition distribution is included in the posterior. The product of distributions at each time step are not generally solvable in closed form. However, by assuming Gaussian distributions for the prior dynamics and each modality-specific inference distribution, we end up with a final product of Gaussians for which a closed-form analytical solution does exist. Conveniently, a product of Gaussians is also itself a Gaussian [6].

4 Experiments

In this section, we present implementation details and an empirical evaluation of our method. For our experiments, we chose to study a common multimodal task: planar pushing with a robotic manipulator using vision, haptic, and proprioceptive data from a camera, a force-torque sensor, and joint encoders, respectively.

(a) PyBullet pushing simulation.
(b) Real-world pushing task.
Figure 2: Environments from which the datasets were collected. Real-world setup image from existing MIT pushing dataset [42].

Planar pushing involves complex contact dynamics that are difficult to model with vision alone, while the multimodal sensor data produced are highly heterogeneous in dimension and quality.

We compared five different models to study the effects of multimodality on this type of task. The models were, in order: 1) vision-only (denoted by V), 2) visual-proprioceptive (denoted by VP), 3) visual-haptic (denoted by VH), 4) visual-haptic-proprioceptive (denoted by VHP), and 5) a commonly-used baseline where the latent embedding of each data modality is simply concatenated (denoted by VHP-C).

4.1 Datasets

We use both a synthetic and a real-world dataset as shown in Figure 2. We provide more details on the data and the collection procedure below.

4.1.1 Simulated Manipulator Pushing

We used PyBullet [4] to generate data from a simulated manipulator pushing task. We collected grayscale images of size pixels, , with pixel intensities rescaled to be in the range of zero to one. The proprioceptive data consisted of the Cartesian position and velocity of the end-effector, while the haptic data included force and torque measurements along three axes. We combined the haptic and proprioceptive data into a single second modality when both were used, , and otherwise used one of the two, . We argue that this is a reasonable decision since both proprioceptive and haptic data have similar characteristics, as opposed to, for example, image data. The control inputs were the end-effector velocity commands along the planar and directions, .

A total of 4,800 trajectories were collected. The image data was recorded at a frequency of approximately 4 Hz and force-torque and proprioceptive data at a frequency of 120 Hz. We concatenated sequences of measurements in order to keep a consistent number of time steps between modalities. We used 4,320 trajectories for training and held out 480, or 10%, for evaluation. Each trajectory was of length (i.e., images and (concatenated) force-torque and proprioception measurements in total). The number of measurements per discrete time step was based on the respective frequency of each data source. The object being pushed was a single square plate. In order to collect training data, we used a policy with actions drawn from a fixed Gaussian distribution. We used the same initial object position for each trajectory.

(a) Ellipse-1 on Delrin.

(b) Ellipse-2 on plywood.

(c) Ellipse-3 on ABS.

(d) Ellipse-3 on polyurethane.
Figure 3: Downsampled images generated from the real-world MIT pushing dataset using existing tools [22]. Realistic lighting and material textures are also rendered.

4.1.2 MIT Pushing

The MIT pushing dataset [42] consists of high-fidelity real robot pushes carried out on various material surfaces and with various object shapes. Our dataset was a subset of trajectories with three different ellipse-shaped plates and four different surface materials (Delrin, plywood, ABS, and polyurethane). We followed the experimental protocol of the authors in [25].

We used the code provided by previous work [22] to preprocess the data and to artificially render the respective images of the trajectories (since no image data was collected as part of the original dataset). In Figure 3, we show four examples of rendered images based on the real-world pushing data (i.e., object pose, end-effector pose, and force-torque data). We note that the rendered images are not completely realistic (e.g., the images are not occluded and only the manipulator’s tip is rendered). However, for our purposes, the dataset provided an adequate starting point for preliminary experiments and comparisons.

We downsampled the images to pixels and transformed them to grayscale, , with pixel intensities rescaled to be in the range of zero to one. The proprioceptive data consisted of the and Cartesian coordinates of the end-effector. The haptic data consisted of force measurements along the and (in-plane) axes and the torque measurements about the axis. We combined the haptic and proprioceptive data to form a single second modality, , if both were used, and otherwise used one or the other, or . The control inputs were position commands along the and axes, . The images were recorded at a frequency of Hz and the force and proprioceptive data were recorded at a frequency of Hz. We used a subset of 2,332 trajectories out of the total dataset (2,099 trajectories for training and 233, or approximately 10%, held out for evaluation). Each trajectory was of length (i.e., images and (concatenated) force-torque and proprioception measurements in total). The data were collected using several pre-planned pushes with varying velocities, accelerations, and contact points and angles.

4.2 Network Architecture and Training

We parametrized our models with neural networks. For the image encoder, we used a fully convolutional neural network based on the architecture of

[10]. The respective decoder was a matching deconvolutional network. For the proprioceptive and force-torque data, we used a simple 1D convolutional architecture for the encoder and a 1D deconvolutional network for the decoder.

Our transition function or dynamics model, was parameterized as a single-layer GRU network [3] with 256 units that produced linear transition matrices (i.e., , where and are outputs of the network), as done in previous work [18, 7, 26]. During training, we sampled mini-batches of 32 trajectories. We applied weight normalization [34]

to all of the network layers except for the GRU. We used ReLU activation functions for all of the networks. Our latent space for all experiments consisted of 16-dimensional Gaussians with diagonal covariance matrices. We applied the

Adam optimizer [20]

with a learning rate of .0003 and a gradient clipping norm of 0.5 for the GRU network.

4.3 Image Prediction Experiments

(a) Results from the synthetic dataset.
(b) Results from the MIT dataset.
Figure 4:

Graph of prediction quality over multiple prediction horizons from both datasets. We compare each model to the baseline vision-only (V) model. We plot the mean values of the SE, SSIM and PSNR with one standard deviation shaded.

In order to compare the models, we first evaluated the quality of the generated image sequences relative to the known ground truth images as a proxy for prediction quality. We began by encoding an initial amount of data into the latent space and then predicted future latent states with our learned dynamics models and known control inputs. Given these predicted states, we then decoded and generated future predicted images. We computed the squared error (SE) in terms of pixels, structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) between the ground truth images and predicted images. Accurate predictions translate into a lower SE, higher SSIM, and higher PSNR.

For the synthetic data, we first conditioned on four frames and predicted the next 11 frames in the sequence. Figure (a)a is a visualization of the mean score, with one standard deviation shaded, based on all of the held out test data. For clear comparisons, we overlay each multimodal model (i.e., VHP, VP, and VH) on top of the baseline vision-only model (i.e., V).

Model RMSE () SSIM () PSNR ()
V 3.243 0.955 0.04 28.639 5.84
VP 1.758 0.982 0.02 33.935 5.93
VH 2.956 0.961 0.04 29.684 5.70
VHP-C 2.911 0.962 0.04 29.700 5.39
VHP 1.749 0.983 0.02 34.202 6.19
Table 1: Quantitative prediction quality for the synthetic dataset. We compared the predicted images’ quality with their respective ground truth for four models trained with various subsets of modalities. We show one standard deviation for the SSIM and PSNR values and calculate the average score across the entire predicted horizon.
Model RMSE () SSIM () PSNR ()
V 1.917 0.907 0.02 30.806 1.66
VP 1.850 0.923 0.03 31.395 2.28
VH 1.839 0.922 0.03 31.298 2.03
VHP-C 1.846 0.923 0.03 31.333 2.12
VHP 1.734 0.927 0.02 31.741 1.84
Table 2: Quantitative prediction quality with the MIT dataset as described in Table 1.

Additionally, we compile the average score over the entire predicted horizon and summarize the results in Table 1. We also include the baseline results of simply concatenating all modalities, labelled as VHP-C.

Overall, the VHP model performed the best for all three metrics when averaged over all prediction lengths, but it was closely followed by the VP model. All multimodal models outperformed the baseline, vision-only model, although the improvement in the case of the VH model was marginal. We hypothesize that this result may be due to the quality of the simulated force torque sensor data and the simulated contact dynamics. In this case, proprioception was the most effective second modality, as demonstrated by the performance of the VP model. However, using vision, haptic and proprioceptive data together with the VHP model led to the best results.

Notably, simply concatenating each modality (i.e., VHP-C) yielded poorer performance when compared to our product of experts approach (i.e., VHP). The product of experts appears to have an appealing ‘filtering’ inductive bias for prediction problems. The product of experts formulation enables decisions to be made about when to use each modality at each time step based on the respective uncertainties. Further, each expert can selectively focus on a few dimensions without having to cover the full dimensionality of the state. Finally, a product of experts produces a sharper final distribution than the individual expert models [14]. This idea is well known in the context of digit image generation: one low-resolution model can generate the approximate overall shape of the digit while other local models can refine segments of the stroke with the correct fine structure [14].

Figure 5: A prediction example that demonstrates a failure mode of the vision-only (V) model trained on the synthetic dataset; the model fails to fully predict the object’s true motion (GT) and instead predicts less counterclockwise rotation in the later frames (denoted by red box). In contrast, the multimodal model (VHP) correctly predicts the object’s motion.
Figure 6: A prediction example that demonstrates a different failure mode of the vision-only (V) model trained on the MIT dataset; the model is unable to produce ‘crisp’ predictions of the object form and instead outputs a blurry average at all time steps. The multimodal model (VHP) is able to predict the slight angle of the ellipse (clockwise), matching the ground truth (GT).

Figure 5 is a visualization of a selected roll-out or prediction that demonstrates a typical failure mode of the vision-only model. As shown by the later frames generated by the vision-only model in the red box outline, the overall motion of the object is almost correct, but the finer-scale changes (e.g., the exact amount of rotation) are not well captured. Our VHP model, however, was able to capture these smaller changes.

We ran similar image prediction experiments for the MIT pushing dataset. In this case, we first conditioned on five frames and predicted the next 37 frames in the sequence. As shown by Figure (b)b, both the VP and VH models performed slightly better than the full VHP during the early time steps according to all three metrics (SE, SSIM and PSNR). However, over a longer prediction horizon the VHP model was clearly superior. For the real-world data, unlike in the synthetic case, the VH model significantly outperformed the baseline V model. Table 2 lists the average scores over the entire prediction horizon. Consistent with the previous synthetic experiment, simply concatenating the modalities (VHP-C) led to poorer performance when compared to the product of experts model (VHP). Each multimodal model performed better than the baseline vision-only model. We visualize another failure mode of the vision-only model in Figure 6. In this case, we observe that the vision-only model failed to capture the more subtle dynamics of the object, which ended up angled and tilted clockwise; the model defaulted to outputting a blurry average at the approximate location of the object. On the other hand, our VHP model was able to produce a relatively crisp and accurate prediction of the object pose.

4.4 Regression Experiments

We further evaluated the predictive capability of the models with a downstream regression experiment that measured how well the model was able to capture and predict the underlying position of the object being pushed. As is commonly done in the self-supervised representation learning literature [9, 23], we trained separate regressors on top of the frozen representations and dynamics models to do this.

(a) Results from the synthetic dataset.
(b) Results from the MIT dataset.
Figure 7: Mean translation errors for regressing the object’s and coordinates from the predicted latent states. The ellipses represent one standard deviation.

We first encoded an initial latent state (i.e., a filtered state) from a set of observations . Then, using our learned dynamics and known future controls , we predicted future latent states (i.e., predicted states). Using the predicted states as inputs, we regressed the ground truth positions of the object, while keeping the weights of all of our previously-learned networks frozen. Poor regression results would indicate lower correlation between the predicted latent states and the object’s ground truth position. This in turn would imply that the learned latent representation did not encode all of the necessary information to represent the state of the object, or that the learned dynamics of the object were inaccurate, or both. We kept the same prediction horizons from the image prediction experiments in Section 4.3.

For our results with synthetic data, we found ordinary least squares to be adequate for regression. Figure


shows that the mean and variance of the absolute translation errors from the regression are lowest for the full VHP model. We (again) observe that the inclusion of haptic data alone (VH) as an extra modality in the model did not lead to huge improvements. On the other hand, using proprioception (VP) did lead to significant improvements. The inclusion of both proprioceptive and haptic data (VHP) outperformed using proprioceptive data alone (VP).

For the MIT pushing dataset, we trained a separate simple neural network with a single hidden layer of size 50 to regress the position of the object. The real-world results generally match our synthetic results, as shown by Figure (b)b. Including any sort of multimodal data reduced the mean and variance of the absolute translation errors when compared to the results from the vision-only model.

As an additional benchmark with the MIT dataset, we compared the accuracy of our regressed values of the position of the object to those computed by a related approach on differentiable filtering [25]. Using the same dataset, the authors trained a variety of differentiable filters to directly regress the positions of the objects from the multimodal sensor data in a supervised manner using ground truth labels. Because the multimodal differentiable filters were trained with ground truth annotations of the objects’ locations, while our model is completely self-supervised, the results provide a reasonable estimate of an upper bound on the expected performance. Table 3 demonstrates that our regression results, from both the predicted and filtered states, are comparable.

Model RMSE [cm] ()
EKF (Sup. w/ GT, filt.) 1.33
PF (Sup. w/ GT, filt.) 1.14
LSTM (Sup. w/ GT, filt.) 2.32
VHP (Self-sup., filt.) (Ours) 1.80
VHP (Self-sup., pred.) (Ours) 1.96
Table 3: Root-mean-square error (RMSE) of the objects’ regressed positions with the MIT pushing dataset. We compared both the filtering (filt.) and prediction (pred.) results from our self-supervised multimodal model (VHP) to various types of differentiable filters from [25].

5 Conclusions & Future Work

We have presented a self-supervised sequential latent variable model for multimodal time series data. Our probabilistic formulation extends existing latent dynamics models, a key backbone of many methods in vision-based control and model-based reinforcement learning, to multimodal sensor data. We provided a case study of a manipulator pushing task where visual, haptic, and proprioceptive data streams were available. For latent dynamics in particular, we demonstrated that a principled probabilistic formulation for fusing sequential multimodal data performed significantly better than the common baseline of directly concatenating each modality in the latent space. Additionally, our learned self-supervised approach was shown to be competitive with an existing multimodal differentiable filtering method that relies on supervised ground truth labels. As interesting avenues for future work, we plan to further investigate the properties of the learned uncertainties used for multimodal fusion and to use our model for control policy learning.


  • [1] T. Baltrušaitis, C. Ahuja, and L. Morency (2019) Multimodal machine learning: a survey and taxonomy. pami 41 (2), pp. 423–443. External Links: Document Cited by: §1, §2.0.2.
  • [2] E. Banijamali, R. Shu, M. Ghavamzadeh, H. H. Bui, and A. Ghodsi (2018) Robust locally-linear controllable embedding. In aistats, Vol. 84, pp. 1751–1759. Cited by: §2.0.1.
  • [3] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder–decoder approaches

    In SSST-8 Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §1, §4.2.
  • [4] E. Coumans and Y. Bai (2016–2019) PyBullet, a python module for physics simulation for games, robotics and machine learning. Note: Cited by: §4.1.1.
  • [5] S. K. D’mello and J. Kory (2015) A review and meta-analysis of multimodal affect detection systems. ACM computing surveys (CSUR) 47 (3), pp. 1–36. Cited by: item 2.
  • [6] D. J. Fleet. (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In NIPS’14 Modern Nonparametrics 3: Automating the Learning Pipeline Workshop, Cited by: §3.2.
  • [7] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther (2017)

    A disentangled recognition and nonlinear dynamics model for unsupervised learning

    In nips, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 3601–3610. External Links: Link Cited by: §1, §2.0.1, §4.2.
  • [8] D. Gandhi, A. Gupta, and L. Pinto (2020) Swoosh! Rattle! Thump!–Actions that sound. In rss, Cited by: item 2, §2.0.3.
  • [9] J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko (2020)

    Bootstrap your own latent - a new approach to self-supervised learning

    In nips, Vol. 33, pp. 21271–21284. Cited by: §4.4.
  • [10] D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. In nips, pp. 2450–2462. Cited by: §2.0.1, §4.2.
  • [11] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel (2016) Backprop KF: learning discriminative deterministic state estimators. In nips, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 4376–4384. External Links: Link Cited by: §2.0.1.
  • [12] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In icml, pp. 2555–2565. Cited by: §1, §2.0.1, §3.2.
  • [13] D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba (2021) Mastering atari with discrete world models. In iclr, Cited by: §1.
  • [14] G. E. Hinton (2002)

    Training products of experts by minimizing contrastive divergence

    Neural Computation 14 (8), pp. 1771–1800. External Links: Document, Link, Cited by: §1, §4.3.
  • [15] Y. Hristov and S. Ramamoorthy (2021) Learning from demonstration with weakly supervised disentanglement. In iclr, Cited by: §1.
  • [16] B. Ichter and M. Pavone (2019) Robot motion planning in learned latent spaces. ral 4 (3), pp. 2407–2414. Cited by: §1.
  • [17] M. J. Johnson, D. Duvenaud, A. B. Wiltschko, R. P. Adams, and S. R. Datta (2016) Composing graphical models with neural networks for structured representations and fast inference. In nips, Vol. 29, pp. 2946–2954. Cited by: §2.0.1.
  • [18] M. Karl, M. Sölch, J. Bayer, and P. van der Smagt (2017-Apr.) Deep variational Bayes filters: unsupervised learning of state space models from raw data. In iclr, Cited by: §1, §2.0.1, §4.2.
  • [19] D. Kiela and L. Bottou (2014) Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In

    Proceedings of the 2014 Conf. on empirical methods in natural language processing (EMNLP)

    pp. 36–45. Cited by: item 2.
  • [20] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In iclr, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.2.
  • [21] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In iclr, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §3.1.
  • [22] A. Kloss, S. Schaal, and J. Bohg (2020-09) Combining learned and analytical models for predicting action effects from sensory data. ijrr. External Links: ISSN 1741-3176, Link, Document Cited by: Figure 3, §4.1.2.
  • [23] S. Kornblith, J. Shlens, and Q. V. Le (2019)

    Do better imagenet models transfer better?

    In cvpr, pp. 2661–2671. Cited by: §4.4.
  • [24] R. G. Krishnan, U. Shalit, and D. Sontag (2017) Structured inference networks for nonlinear state space models. In aaai, pp. 2101–2109. Cited by: §2.0.1.
  • [25] M. Lee, B. Yi, R. Martin-Martin, S. Savarese, and J. Bohg (2020-10) Multimodal sensor fusion with differentiable filters. In iros, Cited by: item 3, §2.0.3, §4.1.2, §4.4, Table 3.
  • [26] O. Limoyo, B. Chan, F. Maric, B. Wagstaff, R. Mahmood, and J. Kelly (2020-10) Heteroscedastic uncertainty for robust generative latent dynamics. ral 5 (4), pp. 6654–6661. External Links: Document, Link Cited by: §2.0.1, §4.2.
  • [27] M. Lippi, P. Poklukar, M. C. Welle, A. Varava, H. Yin, A. Marino, and D. Kragic (2020) Latent space roadmap for visual action planning of deformable and rigid object manipulation. In iros, Cited by: §1.
  • [28] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011)

    Multimodal deep learning

    In icml, Cited by: §1, §2.0.2.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An imperative style, high-performance deep learning library. In nips, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: item 4.
  • [30] R. Rafailov, T. Yu, A. Rajeswaran, and C. Finn (2021) Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control (L4DC), pp. 1154–1168. Cited by: §1.
  • [31] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal (2013) Grounding action descriptions in videos. Trans. Assoc. Computational Linguistics 1, pp. 25–36. Cited by: item 2.
  • [32] S. Rezaei-Shoshtari, F. R. Hogan, M. Jenkin, D. Meger, and G. Dudek (2021) Learning intuitive physics with multimodal generative models. In aaai, Cited by: §2.0.3.
  • [33] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In icml, pp. 1278–1286. Cited by: §1, §3.1.
  • [34] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In nips, Cited by: §4.2.
  • [35] J. Sergeant, N. Suenderhauf, M. Milford, and B. Upcroft (2015) Multimodal deep autoencoders for control of a mobile robot. In Proc. Australasian Conf. Robotics and Automation, H. Li and J. Kim (Eds.), pp. 1–10. Cited by: §1.
  • [36] N. Srivastava and R. Salakhutdinov (2014-01)

    Multimodal learning with deep boltzmann machines

    jmlr 15 (1), pp. 2949–2980. External Links: ISSN 1532-4435 Cited by: §2.0.2.
  • [37] M. Suzuki, K. Nakayama, and Y. Matsuo (2016) Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891. Cited by: §2.0.2.
  • [38] R. Vedantam, I. Fischer, J. Huang, and K. Murphy (2018) Generative models of visually grounded imagination. In iclr, External Links: Link Cited by: §2.0.2.
  • [39] N. Wahlström, T. B. Schön, and M. P. Desienroth (2015) From pixels to torques: policy learning with deep dynamical models. In ICML’15 Deep Learning Workshop, Cited by: §1, §2.0.1.
  • [40] M. Watter, J. T. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: A locally linear latent dynamics model for control from raw images. In nips, pp. 2746–2754. Cited by: §1, §2.0.1.
  • [41] M. Wu and N. Goodman (2018) Multimodal generative models for scalable weakly-supervised learning. In nips, Vol. 31. Cited by: §1, §2.0.2, §3.2.
  • [42] K. Yu, M. Bauzá, N. Fazeli, and A. Rodriguez (2016) More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing. In iros, pp. 30–37. External Links: Link, Document Cited by: Figure 2, §4.1.2.
  • [43] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine (2019) Solar: deep structured representations for model-based reinforcement learning. In icml, pp. 7444–7453. Cited by: §1.
  • [44] K. Zhou, C. Chen, B. Wang, M. R. U. Saputra, N. Trigoni, and A. Markham (2021) VMLoc: variational fusion for learning-based multimodal camera localization. In aaai, Vol. 35, pp. 6165–6173. Cited by: §1, §2.0.3.