1 Introduction
Virtually all sensory data that humans and autonomous systems receive can be thought of as multimodal time series—multiple sensors each provide streams of information about the surrounding environment, and intelligent systems have to integrate these information streams to form a coherent yet dynamic model of the world. These time series are often asynchronously or irregularly sampled, with many timepoints having missing or incomplete data. Classical time series algorithms, such as the Kalman filter, are robust to such incompleteness: they are able to infer the state of the world, but only in linear regimes
kalman1960new . On the other hand, human intelligence is robust and complex: we infer complex quantities with nonlinear dynamics, even from incomplete temporal observations of multiple modalities—for example, intended motion from both eye gaze and arm movements dragan2013legibility , or desires and emotions from actions, facial expressions, speech, and language baker2017rational ; ong2015affective .There has been a proliferation of attempts to bridge this gap and learn nonlinear dynamics by integrating deep learning with traditional probabilistic approaches such as Hidden Markov Models (HMMs) and latent dynamical systems. Most approaches do this by adding random latent variables to each time step in an RNN
chung2015recurrent ; bayer2014learning ; fabius2014variational ; fraccaro2016sequential . Other authors begin with latent sequence models as a basis, then develop deep parameterizations and inference techniques for these models krishnan2017structured ; archer2015black ; johnson2016composing ; karl2016deep ; doerr2018probabilistic ; lin2018variational ; chen2018neural . Most relevant to our work are the Deep Markov Models (DMMs) proposed by Krishnan, Uri, and Sontag krishnan2017structured , a generalization of HMMs and Gaussian State Space Models where the transition and emission distributions are parameterized by deep networks.Unfortunately, none of the approaches thus far are designed to handle inference with both missing data and multiple modalities. Instead, most approaches rely upon RNN inference networks krishnan2017structured ; karl2016deep ; che2018multivariate , which can only handle missing data using adhoc approaches such as zeromasking lipton2016directly , updateskipping krishnan2017structured , temporal gating mechanisms neil2016phased ; che2018multivariate , or providing time stamps as inputs chen2018neural , none of which have intuitive probabilistic interpretations. For example, Che et al. extend DMMs to handle time series sampled at different rates by learning temporal gates, but do not handle arbitrary missingness che2018multivariate , while Chen et al.
learn neural ordinary differential equations that allow modelling of irregularly sampled data, but still rely on blackbox RNNs with timestamp concatenation for inference
chen2018neural . Of the approaches that do not rely on RNNs, Fraccaro et al. handle missing data by assuming linear latent dynamics fraccaro2017disentangled , while Johnson et al. johnson2016composing and Lin et al. lin2018variational use hybrid message passing inference that is theoretically capable of marginalizing out missing data. However, these methods are unable to learn nonlinear transition dynamics, nor do they handle multiple modalities.We address these limitations—the inability to handle missing data, over multiple modalities, and with nonlinear dynamics—by introducing a multimodal generalization of Deep Markov Models, as well as a factorized inference method for such models that handles missing time points and modalities by design. Our method allows for both filtering and smoothing given incomplete time series, while also performing uncertaintyaware multimodal fusion a la
the Multimodal Variational Autoencoder (MVAE)
wu2018multimodal . Because our method handles incompleteness over both time and modalities, it is capable of interpolation, forward extrapolation, backward extrapolation, conditional generation of one modality from another, including label prediction. We demonstrate these capabilities on both a synthetic dataset of noisy bidirectional spirals, as well as a real world dataset of labelled human actions. Our experiments show that our method learns and performs excellently on each of these tasks, while outperforming stateoftheart inference methods that rely upon RNNs.2 Methods
2.1 Multimodal Deep Markov Models
We introduce Multimodal Deep Markov Models (MDMMs) as a generalization of Krishnan et al.’s Deep Markov Models (DMMs) krishnan2017structured . In a MDMM (Figure 1(a)), we model multiple sequences of observations, each of which is conditionally independent of the other sequences given the latent state variables. Each observation sequence corresponds to a particular data or sensor modality (e.g. video, audio, labels), and may be missing when other modalities are present. An MDMM can thus be seen as a sequential version of the MVAE wu2018multimodal .
Formally, let and respectively denote the latent state and observation for modality at time . An MDMM with modalities is then defined by the transition and emission distributions:
(1)  
(2) 
Here, , and
are functions (e.g., neural networks) which output the distribution parameters, and are themselves parameterized by
, and respectively. We denote the model parameters by . We also use to denote the timeseries of from to , and to denote the corresponding observations from modalities to . We omit the modality superscripts when all modalities are present (i.e., ).As with VAEs kingma2014auto and MVAEs wu2018multimodal , we want to jointly learn the parameters of the generative model and a variational posterior which approximates the true (intractable) posterior . To do so, we maximize a lower bound on the log marginal likelihood , also known as the evidence lower bound (ELBO):
(3) 
In practice, we can maximize the ELBO with respect to and
via gradient ascent with stochastic backpropagation
kingma2014auto ; rezende2014stochastic . Doing so requires sampling from the variational posterior . The computation of this posterior is what differentiates our method from many recent approaches to variational inference in time series models. In the following sections, we derive a variational posterior that factorizes over timesteps and modalities, allowing us to tractably infer the latent states even when data is missing.2.2 Factorized Distributions for Filtering, Smoothing, and Sequencing
In latent sequence models such as MDMMs, we often want to perform several kinds inferences over the latent states. The most common and important of such latent state inferences are:
Filtering  Inferring the current given all observations thus far.  

Smoothing  Inferring some given all observations in the sequence .  
Sequencing  Inferring the complete latent sequence from . 
Most architectures that combine deep learning with Markov models focus upon filtering fabius2014variational ; chung2015recurrent ; hafner2018learning ; buesing2018learning , while Krishnan et al. optimize their DMM for sequencing krishnan2017structured . One of our contributions is to demonstrate that we can learn the filtering, smoothing, and sequencing distributions within the same framework, because they all share similar factorizations (see Figures 1(b) and 1(c) for the shared inference structure of filtering and smoothing). A further consequence of these factorizations is that we can naturally handle inferences given missing modalities or timesteps.
To demonstrate this similarity, we first factorize the sequencing distribution over time:
(4) 
This factorization means that each latent state depends only on the previous latent state , as well as all current and future observations , and is implied by the graphical structure of the MDMM (Figure 1(a)). We term the conditional smoothing posterior, because it is the posterior that corresponds to the conditional prior on the latent space, and because it combines information from the past and the future (hence ‘smoothing’).
Given one or more modalities, we can show that the conditional smoothing posterior , the backward filtering distribution , and the smoothing distribution all factorize almost identically:
(5)  
(6)  
(7) 
Equations 5–7 show that each distribution can be decomposed into (i) its dependence on future observations, , (ii) its dependence on each modality in the present, , and, excluding filtering, (iii) its dependence on the past or . Their shared structure is due to the conditional independence of given from all prior observations or latent states. Here we show only the derivation for Equation 7, because the others follow by either dropping (Equation 5), or replacing with (Equation 6):
The factorizations in Equations 5–7 lead to several useful insights. First, they show that any missing modalities at time can simply be left out of the product over modalities, leaving us with distributions that correctly conditioned on only the modalities that are present. Second, they suggest that we can compute all three distributions if we can approximate the dependence on the future, , learn approximate posteriors for each modality , and know the model dynamics , .
2.3 Multimodal Temporal Fusion via ProductofGaussians
However, there are a few obstacles to performing tractable computation of Equations 5–7
. One obstacle is that it is not tractable to compute the product of generic probability distributions. To address this, we adopt the approach used for the MVAE
wu2018multimodal , making the assumption that each term in Equations 5–7 is Gaussian. If each distribution is Gaussian, then their products or quotients are also Gaussian and can be computed in closed form. Since this result is wellknown, we state it in the supplement (see wu2018multimodal for a proof).This ProductofGaussians approach has the added benefit that the output distribution is dominated by the input Gaussian terms with lower variance (higher precision), thereby fusing information in a way that gives more weight to highercertainty inputs
cao2014generalized ; ong2015affective . This automatically balances the information provided by each modality , depending on whether is high or low certainty, as well as the information provided from the past and future through and , thereby performing multimodal temporal fusion in a manner that is uncertaintyaware.2.4 Approximate Backward Filtering with Missing Data
Another obstacle to computing Equations 5–7 is the dependence on future observations, , which does not admit further factorization, and hence does not readily handle missing data among those future observations. Other approaches to approximating this dependence on the future rely on RNNs as recognition models krishnan2017structured ; che2018multivariate , but these are not designed to work with missing data.
To address this obstacle in a more principled manner, our insight was that is the expectation of under the backwards filtering distribution, :
(8) 
For tractable approximation of this expectation, we use an approach similar to assumed density filtering huber2011semi . We assume both and to be multivariate Gaussian with diagonal covariance, and sample the parameters , of under . After drawing samples, we approximate the parameters of
via empirical momentmatching:
(9) 
Approximating by led us to three important insights. First, by substituting the expectation from Equation 8 into Equation 5, the backward filtering distribution becomes:
(10) 
In other words, by sampling under the filtering distribution for time , , we can compute the filtering distribution for time , . We can thus recursively compute backwards in time, starting from time .
Second, once we can perform filtering backwards in time, we can use this to approximate in the smoothing distribution (Equation 6) and the conditional smoothing posterior (Equation 7). Backward filtering hence allows approximation of both the smoothing and sequencing distributions.
Third, this approach removes the explicit dependence on all future observations , allowing us to handle missing data. Suppose the data points are missing, where and are the timestep and modality of the th missing point respectively. Rather than directly compute the dependence on incomplete set of future observations, , we can instead sample under the filtering distribution conditioned on incomplete observations, , and then compute given the sampled , thereby approximating .
2.5 Bidirectional Factorized Variational Inference
We now introduce factorized variational approximations of Equations 5–7. We replace the true posteriors with approximations , where is parameterized by a (timeinvariant) neural network for each modality . Like wu2018multimodal , we directly learn the Gaussian quotients to avoid the constraint required for ensuring a quotient of Gaussians is welldefined. We also parameterize the transition dynamics and using neural networks for the quotient distributions. This gives the following approximations:
(11)  
(12)  
(13) 
Here, is shorthand for the expectation under the approximate backward filtering distribution , while is the expectation under the forward smoothing distribution .
To calculate the backward filtering distribution , we introduce a variational backward algorithm (Algorithm 1) to recursively compute Equation 11 for all timesteps in a single pass. Note that simply by reversing time in Algorithm 1, this gives us a variational forward algorithm that computes the forward filtering distribution .
Unlike filtering, smoothing and sequencing require information from both the past () and the future (). This motivates a variational backwardforward algorithm (Algorithm 2) for smoothing and sequencing. Algorithm 2 first uses Algorithm 1 as a backward pass, then performs a forward pass to propagate information from past to future.^{1}^{1}1Algorithm 2 also requires knowing for each . While this can be computed through sampling in the forward pass, we choose to avoid the instability by instead assuming is constant with time, i.e., the MDMM is stationary when no observations are given. During training, we add and to the loss to ensure that the transition dynamics obey this assumption.
While Algorithm 1 approximates the filtering distribution , by setting the number of particles , it effectively computes the (backward) conditional filtering posterior and (backward) conditional prior for a randomly sampled latent sequence . Similarly, while Algorithm 2 approximates smoothing by default, when , it effectively computes the (forward) conditional smoothing posterior and (forward) conditional prior for a random latent sequence . These quantities are useful not only because they allow us to perform sequencing, but also because we can use them to compute the ELBO for both backward filtering and forward smoothing:
(14)  
(15) 
is the filtering ELBO because it corresponds to a ’backward filtering’ variational posterior , where each is only inferred using the current observation and the future latent state . is the smoothing ELBO because it corresponds to the correct factorization of the posterior in Equation 4, where each term combines information from both past and future. Because we require accurate backward filtering in order to perform forward smoothing, we learn the MDMM parameters by jointly maximizing the filtering and smoothing ELBOs. We call this paradigm bidirectional factorized variational inference (BFVI), due to its use of factorized variational posteriors for both backward filtering and forward smoothing.
3 Experiments
We compare BFVI against stateoftheart RNNbased inference methods on two multimodal time series datasets over a range of inference tasks. FMask and FSkip use forward RNNs (one per modality), using zeromasking and update skipping respectively to handle missing data. They are thus multimodal variants of the STL network in krishnan2017structured , and similar to the variational RNN chung2015recurrent and recurrent SSM hafner2018learning . BMask and BSkip use backward RNNs, with masking and skipping respectively, and correspond to the Deep Kalman Smoother in krishnan2017structured . The underlying MDMM architecture is constant across inference methods. Architectural and training details can be found in the supplement.
3.1 Datasets

[leftmargin=0pt]
 Noisy Spirals.

We synthesized a dataset of noisy 2D spirals (600 train / 400 test) similar to Chen et al. chen2018neural , treating the and coordinates as two separate modalities. Spiral trajectories vary in direction (clockwise or counterclockwise), size, and aspect ratio, and Gaussian noise is added to the observations. We used
latent dimensions, and twolayer perceptrons for encoding
and decoding . For evaluation, we compute the mean squared error (MSE) per time step between the predicted trajectories and ground truth spirals.  Weizmann Human Actions.

This is a video dataset of 9 people each performing 10 actions gorelick2007actions . We converted it to a trimodal time series dataset by treating actions and people’s identities as perframe labels, similar to He et al. he2018probabilistic . Each RGB frame was cropped to the central 128128 window and resized to 6464. We selected one person’s videos as the test set, and the other 80 videos as the training set, allowing us to test action label prediction on an unseen person. We used latent dimensions, and convolutional / deconvolutional neural networks for encoding and decoding. For evaluation, we compute the Structural Similarity (SSIM) between the input video frames and the reconstructed outputs.
3.2 Tasks
We evaluated all methods on the following suite of temporal inference tasks for both datasets:
Reconstruction  Reconstruction given complete observations 
Drop Half  Reconstruction after half of the inputs are randomly deleted. 
Forward Extrapolation  Predicting the last 25% of a sequence when the rest is given. 
Backward Extrapolation  Inferring the first 25% of a sequence when the rest is given. 
When evaluating the above tasks on the Weizmann dataset, we provided only video frames as input, to test whether the methods were capable of unimodal inference after multimodal training. We also tested crossmodal inference using the following conditional generation / label prediction tasks:
Cond. Gen. (Spirals)  Given  and initial 25% of coordinates, generate rest of spiral. 
Cond. Gen. (Video)  Given action and identity labels & first 25% of frames, generate rest of video. 
Label Pred. (Video)  Infer the action labels for each frame, given only the video frames. 
3.3 Results
Method  Recon.  Drop Half  Fwd. Extra.  Bwd. Extra.  Cond. Gen.  Label Pred. 

Spirals Dataset: MSE (SD)  
BFVI  0.02 (0.01)  0.04 (0.01)  0.12 (0.10)  0.07 (0.03)  0.26 (0.26)  – 
FMask  0.02 (0.01)  0.06 (0.02)  0.10 (0.08)  0.18 (0.07)  1.37 (1.39)  – 
FSkip  0.04 (0.01)  0.10 (0.05)  0.13 (0.11)  0.19 (0.06)  1.51 (1.54)  – 
BMask  0.02 (0.01)  0.04 (0.01)  0.18 (0.14)  0.04 (0.01)  1.25 (1.23)  – 
BSkip  0.05 (0.01)  0.19 (0.05)  0.32 (0.22)  0.37 (0.15)  1.64 (1.51)  – 
Weizmann Video Dataset: SSIM or Accuracy* (SD)  
BFVI  .83 (.04)  .82 (.04)  .83 (.04)  .81 (.05)  .81 (.03)  .66 (.34)* 
FMask  .78 (.04)  .70 (.17)  .78 (.04)  .66 (.18)  .78 (.04)  .19 (.21)* 
FSkip  .71 (.16)  .70 (.17)  .71 (.16)  .67 (.18)  .71 (.16)  .24 (.39)* 
BMask  .67 (.20)  .76 (.11)  .67 (.19)  .53 (.24)  .68 (.20)  .40 (.43)* 
BSkip  .79 (.04)  .78 (.03)  .79 (.04)  .78 (.03)  .79 (.04)  .18 (.30)* 
SSIM or label accuracy (higher is better) per timestep with respect to original videos. Means and Standard Deviations (SD) are across the test set.
We present the results of the inference methods in Table 1, and show sample reconstruction results for spirals in Figures 2 and videos in Figure 3. On the spirals dataset, BFVI achieves high performance on all tasks, whereas the RNNbased methods only perform well on a few. In particular, all methods besides BFVI do poorly on the conditional generation task, which can be understood from the rightmost column of Figure 2. BFVI generates a spiral that matches the provided coordinates, while the nextbest method, BMask, completes the trajectory with a plausible spiral, but ignores the observations entirely in the process.
On the more complex Weizmann video dataset, BFVI outperforms all other methods on every task, demonstrating both the power and flexibility of our approach. The RNNbased methods performed especially poorly on label prediction, and this was the case even on the training set (not shown in Table 1). We suspect this occurred because they lack a principled approach to multimodal fusion, and hence fail to learn a latent space which captures the mutual information between action labels and images. In contrast, BFVI learns to both predict one modality from another, and to propagate information across time, as can be seen from the reconstruction and predictions in Figure 3.
4 Conclusion
In this paper, we introduced bidirectional factorized variational inference (BFVI) as a novel inference method for Multimodal Deep Markov Models. This method handles incomplete data via a factorized variational posterior, allowing us to easily marginalize over missing observations. Our method is thus capable of a large range of multimodal temporal inference tasks, which we demonstrate on both a synthetic dataset and a video dataset of human motions. The ability to handle missing data also suggests applications in weakly supervised learning of labelled time series
wu2018multimodal , as well as modelling of multirate time series che2018multivariate . Given the abundance of multimodal time series data where missing data is the norm rather than the exception, our work holds great promise for many future applications.References
 [1] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 [2] Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. In Proceedings of the 8th ACM/IEEE international conference on Humanrobot interaction, pages 301–308. IEEE Press, 2013.
 [3] Chris L Baker, Julian JaraEttinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4):0064, 2017.
 [4] Desmond C Ong, Jamil Zaki, and Noah D Goodman. Affective cognition: Exploring lay theories of emotion. Cognition, 143:141–162, 2015.
 [5] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [6] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. In NIPS 2014 Workshop on Advances in Variational Inference, 2014.
 [7] Otto Fabius and Joost R van Amersfoort. Variational recurrent autoencoders. arXiv preprint arXiv:1412.6581, 2014.
 [8] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pages 2199–2207, 2016.

[9]
Rahul G Krishnan, Uri Shalit, and David Sontag.
Structured inference networks for nonlinear state space models.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [10] Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015.
 [11] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pages 2946–2954, 2016.
 [12] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.

[13]
Andreas Doerr, Christian Daniel, Martin Schiegg, NguyenTuong Duy, Stefan
Schaal, Marc Toussaint, and Trimpe Sebastian.
Probabilistic recurrent statespace models.
In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pages 1280–1289, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.  [14] Wu Lin, Mohammad Emtiyaz Khan, and Nicolas Hubacher. Variational message passing with structured inference networks. In International Conference on Learning Representations, 2018.
 [15] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
 [16] Zhengping Che, Sanjay Purushotham, Guangyu Li, Bo Jiang, and Yan Liu. Hierarchical deep generative models for multirate multivariate time series. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 784–793, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 [17] Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pages 253–270, 2016.
 [18] Daniel Neil, Michael Pfeiffer, and ShihChii Liu. Phased lstm: Accelerating recurrent network training for long or eventbased sequences. In Advances in neural information processing systems, pages 3882–3890, 2016.

[19]
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pages 3601–3610, 2017.  [20] Mike Wu and Noah Goodman. Multimodal generative models for scalable weaklysupervised learning. In Advances in Neural Information Processing Systems, pages 5575–5585, 2018.
 [21] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations, 2014.
 [22] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
 [23] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 [24] Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
 [25] Yanshuai Cao and David J Fleet. Generalized product of experts for automatic and principled fusion of gaussian process predictions. In Modern Nonparametrics 3: Automating the Learning Pipeline Workshop, NeurIPS 2014., 2014.
 [26] Marco F Huber, Frederik Beutler, and Uwe D Hanebeck. Semianalytic gaussian assumed density filter. In Proceedings of the 2011 American Control Conference, pages 3006–3011. IEEE, 2011.
 [27] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as spacetime shapes. IEEE transactions on pattern analysis and machine intelligence, 29(12):2247–2253, 2007.

[28]
Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal.
Probabilistic video generation using holistic attribute control.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 452–467, 2018.