1 Introduction
Replicating the human ability to process and relate information coming from different sources and learn from these is a longstanding goal in machine learning Baltrušaitis et al. (2018). Multiple information sources offer the potential of learning better and more generalizable representations, but pose challenges at the same time: models have to be aware of complex intra and intermodal relationships, and be robust to missing modalities Ngiam et al. (2011); Zadeh et al. (2017). However, the excessive labelling of multiple data types is expensive and hinders possible applications of fullysupervised approaches Fang et al. (2015); Karpathy and FeiFei (2015). Simultaneous observations of multiple modalities moreover provide selfsupervision in the form of shared information which connects the different modalities. Selfsupervised, generative models are a promising approach to capture this joint distribution and flexibly support missing modalities with no additional labelling cost attached. Based on the shortcomings of previous work (see section 2.1), we formulate the following wishlist for multimodal, generative models:
Scalability. The model should be able to efficiently handle any number of modalities. Translation approaches Huang et al. (2018); Zhu et al. (2017) have had great success in combining two modalities and translating from one to the other. However, the training of these models is computationally expensive for more than two modalities due to the exponentially growing number of possible paths between subsets of modalities.
Missing data.
A multimodal method should be robust to missing data and handle any combination of available and missing data types. For discriminative tasks, the loss in performance should be minimized. For generation, the estimation of missing data types should be conditioned on and coherent with available data while providing diversity over modalityspecific attributes in the generated samples.
Information gain. Multimodal models should benefit from multiple modalities for discriminative as well as for generative tasks.
In this work, we introduce a novel probabilistic, generative and selfsupervised multimodal model. The proposed model is able to integrate information from different modalities, reduce uncertainty and ambiguity in redundant sources, as well as handle missing modalities while making no assumptions about the nature of the data, especially about the intermodality relations.
We base our approach directly in the Variational Bayesian Inference framework and propose the new multimodal JensenShannon divergence (mmJSD) objective. We introduce the idea of a dynamic prior for multimodal data, which enables the use of the JensenShannon divergence for
distributions Aslam and Pavlu (2007); Lin (1991) and interlinks the unimodal probabilistic representations of theobservation types. Additionally, we are  to the best of our knowledge  the first to empirically show the advantage of modalityspecific subspaces for multiple data types in a selfsupervised and scalable setting. For the experiments, we concentrate on Variational Autoencoders
Kingma and Welling (2013). In this setting, our multimodal extension to variational inference implements a scalable method, capable of handling missing observations, generating coherent samples and learning meaningful representations. We empirically show this on two different datasets. In the context of scalable generative models, we are the first to perform experiments on datasets with more than 2 modalities showing the ability of the proposed method to perform well in a setting with multiple modalities.2 Theoretical Background & Related Work
We consider some dataset of i.i.d. sets with every being a set of modalities
. We assume that the data is generated by some random process involving a joint hidden random variable
where intermodality dependencies are unknown. In general, the same assumptions are valid as in the unimodal setting Kingma and Welling (2013). The marginal loglikelihood can be decomposed into a sum over marginal loglikelihoods of individual sets , which can be written as:(1) 
(2) 
is called evidence lower bound (ELBO) on the marginal loglikelihood of set . The ELBO forms a computationally tractable objective to approximate the joint data distribution which can be efficiently optimized, because it follows from the nonnegativity of the KL divergence: . Particular to the multimodal case is what happens to the ELBO formulation if one or more data types are missing: we are only able to approximate the true posterior by the variational function . denotes a subset of with available modalities where . However, we would still like to be able to approximate the true multimodal posterior distribution of all data types. For simplicity, we always use to symbolize missing data for set , although there is no information about which or how many modalities are missing. Additionally, different modalities might be missing for different sets . In this case, the ELBO formulation changes accordingly:
(3) 
defines the ELBO if only is available, but we are interested in the true posterior distribution . To improve readability, we will omit the superscript in the remaining part of this work.
2.1 Related Work
In this work, we focus on methods with the aim of modelling a joint latent distribution, instead of transferring between modalities Huang et al. (2018); Tian and Engel (2019) due to the scalability constraint described in section 1.
Joint and Conditional Generation. Suzuki et al. (2016) implemented a multimodal VAE and introduced the idea that the distribution of the unimodal approximation should be close to the multimodal approximation function. Vedantam et al. (2017) introduced the triple ELBO as an additional improvement. Both define labels as second modality and are not scalable in number of modalities.
Modalityspecific Latent Subspaces. Hsu and Glass (2018) and Tsai et al. (2018) both proposed models with modalityspecific latent distributions and an additional shared distribution. The former relies on supervision by labels to extract modalityindependent factors, while the latter is nonscalable.
Scalability. More recently, Kurle et al. (2018) and Wu and Goodman (2018) proposed scalable multimodal generative models for which they achieve scalability by using a Product of Experts Hinton (1999)
as a joint approximation distribution. The Product of Experts (PoE) allows them to handle missing modalities without requiring separate inference networks for every combination of missing and available data. A PoE is computationally attractive as  for Gaussian distributed experts  it remains Gaussian distributed which allows the calculation of the KLdivergence in closed form. However, they report problems in optimizing the unimodal variational approximation distributions due to the multiplicative nature of the PoE. To overcome this limitation,
Wu and Goodman (2018) introduced a combination of ELBOs which results in the final objective not being an ELBO anymore. Shi et al. (2019) use a MixtureofExperts (MoE) as joint approximation function. The additive nature of the MoE facilitates the optimization of the individual experts, but is computationally less efficient as there exists no closed form solution to calculate the KLdivergence. Shi et al. (2019) need to rely on importance sampling (IS) to achieve the desired performance. IS based VAEs Burda et al. (2015) tend to achieve tight ELBOs for the price of a reduced computational efficiency. Additionally, their model leverages passes through the decoder networks which increases the computational cost further.3 The multimodal JSDivergence model
We propose a new multimodal objective (mmJSD) utilizing the JensenShannon divergence. Compared to previous work, this formulation does not need any additional training objectives Wu and Goodman (2018), supervision Tsai et al. (2018) or importance sampling Shi et al. (2019), while being scalable Hsu and Glass (2018).
Definition 1.
We define a new objective for learning multimodal, generative models which utilizes the JensenShannon divergence:
(4) 
where denotes the JensenShannon divergence for distributions with distribution weights and .
For any , the JensenShannon (JS) divergence for distributions is defined as follows:
(5) 
where the function defines a mixture distribution of its arguments. The JSdivergence for distributions is the extension of the standard JSdivergence for two distributions to an arbitrary number of distributions. It is a weighted sum of KLdivergences between the
individual probability distributions
and their mixture distribution . denote the distribution weights and . In the remaining part of this section, we derive the new objective directly from the standard ELBO formulation and prove that it is a lower bound to the marginal loglikelihood .3.1 Joint Distribution
A MoE is an arithmetic mean function whose additive nature facilitates the optimization of the individual experts compared to a PoE (see section 2.1). As there exists no closed form solution for the calculation of the KLdivergence, we need to rely on an upper bound to the true divergence using Jensen’s inequality Hershey and Olsen (2007) for an efficient calculation (for details please see section B.1). Hence, we are able to approximate the multimodal ELBO defined in equation (2) by a sum of KLterms:
(6) 
The sum of KLdivergences can be calculated in closed form if prior distribution and unimodal posterior approximations are both Gaussian distributed. In this case, this lower bound to the ELBO allows the optimization of the ELBO objective in a computationally efficient way.
3.2 Dynamic Prior
In the regularization term in equation (6), although efficiently optimizable, the unimodal approximations are only individually compared to the prior, and no joint objective is involved. We propose to incoporate the unimodal posterior approximations into the prior through a function .
Definition 2 (Multimodal Dynamic Prior).
The dynamic prior is defined as a function of the unimodal approximation functions and a predefined distribution :
(7) 
The dynamic prior is not a prior distribution in the conventional sense as it does not reflect prior knowledge of the data, but it incorporates the prior knowledge that all modalities share common factors. We therefore call it prior due to its role in the ELBO formulation and optimization. As a function of all the unimodal posterior approximations, the dynamic prior extracts the shared information and relates the unimodal approximations to it. With this formulation, the objective is optimized at the same time for a similarity between the function and the unimodal posterior approximations. For random sampling, the predefined prior is used.
3.3 JensenShannon Divergence
Utilizing the dynamic prior , the sum of KLdivergences in equation (6) can be written as JSdivergence (see equation (5)) if the function defines a mixture distribution. To remain a valid ELBO, the function needs to be a welldefined prior.
Lemma 1.
If the function of the dynamic prior defines a mixture distribution of the unimodal approximation distributions , the resulting dynamic prior is welldefined.
Proof.
The proof can be found in section B.2. ∎
With Lemma 1, the new multimodal objective utilizing the JensenShannon divergence (Definition 1) can now be directly derived from the ELBO in equation (2).
Lemma 2.
Proof.
3.4 Generalized JensenShannon Divergence
Nielsen (2019) defines the JSdivergence for the general case of abstract means. This allows to calculate the JSdivergence not only using an arithmetic mean as in the standard formulation, but any mean function. Abstract means are a suitable class of functions for aggregating information from different distributions while being able to handle missing data Nielsen (2019).
Definition 3.
The dynamic prior
is defined as the geometric mean of the unimodal posterior approximations
and the predefined distribution .For Gaussian distributed arguments, the geometric mean is again Gaussian distributed and equivalent to a weighted PoE Hinton (1999). The proof that is a welldefined prior can be found in section B.3. Utilizing Definition 3, the JSdivergence in equation (4) can be calculated in closed form. This allows the optimization of the proposed, multimodal objective in a computationally efficient way while also tackling the limitations of previous work outlined in section 2.1. For all experiments, we use a dynamic prior of the form , as given in definition 3.
3.5 Modalityspecific Latent Subspaces
We define our latent representations as a combination of modalityspecific spaces and a shared, modalityindependent space: . Every is modelled to have its own independent, modalityspecific part . Additionally, we assume a joint content for all which captures the information that is shared across modalities. and are considered conditionally independent given . Different to previous work Bouchacourt et al. (2018); Tsai et al. (2018), we empirically show that meaningful representations can be learned in a selfsupervised setting by the supervision which is given naturally for multimodal problems. Building on what we derived in sections 2 and 3, and the assumptions outlined above, we model the modalitydependent divergence term similarly to the unimodal setting as there is no intermodality relationship associated with them. Applying these assumptions to Equation (4), it follows (for details, please see section B.4):
(10)  
The objective in Equation (4) is split further into two different divergence terms: The JSdivergence is used only for the multimodal latent factors , while modalityindependent terms are part of a sum of KLdivergences. Following the common line in VAEresearch, the variational approximation functions and , as well as the generative models
are parameterized by neural networks.
4 Experiments & Results
We carry out experiments on two different datasets. For the experiment we use a matching digits dataset consisting of MNIST LeCun and Cortes (2010) and SVHN Netzer et al. (2011) images with an additional text modality. This experiment provides empirical evidence on a method’s generalizability to more than two modalities. The second experiment is carried out on the challenging CelebA faces dataset Liu et al. (2015) with additional text describing the attributes of the shown face. The CelebA dataset is highly imbalanced regarding the distribution of attributes which poses additional challenges for generative models.
4.1 Evaluation
We evaluate the quality models with respect to the multimodal wishlist introduced in section 1
. To assess the discriminative capabilities of a model, we evaluate the latent representations with respect to the input data’s semantic information. We employ a linear classifier on the unimodal and multimodal posterior approximations. To assess the generative performance, we evaluate generated samples according to their quality and coherence. Generation should be coherent across all modalities with respect to shared information. Conditionally generated samples should be coherent with the input data, randomly generated samples with each other. For every data type, we use a classifier which was trained on the original training set
Ravuri and Vinyals (2019)to evaluate the coherence of generated samples. To assess the quality of generated samples, we use the precision and recall metric for generative models
Sajjadi et al. (2018) where precision defines the quality and recall the diversity of the generated samples. In addition, we evaluate all models regarding their test set loglikelihoods.We compare the proposed method to two stateoftheart models: the MVAE model Wu and Goodman (2018) and the MMVAE model Shi et al. (2019) described in section 2.1. We use the same encoder and decoder networks and the same number of parameters for all methods. Implementation details for all experiments together with a comparison of runtimes can be found in section C.2.
4.2 MNISTSVHNText
Model  M  S  T  M,S  M,T  S,T  Joint 

MVAE  0.85  0.20  0.58  0.80  0.92  0.46  0.90 
MMVAE  0.96  0.81  0.99  0.89  0.97  0.90  0.93 
mmJSD  0.97  0.82  0.99  0.93  0.99  0.92  0.98 
MVAE (MS)  0.86  0.28  0.78  0.82  0.94  0.64  0.92 
MMVAE (MS)  0.96  0.81  0.99  0.89  0.98  0.91  0.92 
mmJSD (MS)  0.98  0.85  0.99  0.94  0.98  0.94  0.99 
Previous works on scalable, multimodal methods performed no evaluation on more than two modalities^{1}^{1}1Wu and Goodman (2018) designed a multimodal experiment for the CelebA dataset where every attribute is considered a modality.. We use the MNISTSVHN dataset Shi et al. (2019) as basis. To this dataset, we add an additional, textbased modality. The texts consist of strings which name the digit in English where the start index of the word is chosen at random to have more diversity in the data. To evaluate the effect of the dynamic prior as well as modalityspecific latent subspaces, we first compare models with a single shared latent space. In a second comparison, we add modalityspecific subspaces to all models (for these experiments, we add a (MS)suffix to the model names). This allows us to assess and evaluate the contribution of the dynamic prior as well as modalityspecific subspaces. Different subspace sizes are compared in C.2.
M  S  T  

Model  Random  S  T  S,T  M  T  M,T  M  S  M,S 
MVAE  0.72  0.17  0.14  0.22  0.37  0.30  0.86  0.20  0.12  0.22 
MMVAE  0.54  0.82  0.99  0.91  0.32  0.30  0.31  0.96  0.83  0.90 
mmJSD  0.60  0.82  0.99  0.95  0.37  0.36  0.48  0.97  0.83  0.92 
MVAE (MS)  0.74  0.16  0.17  0.25  0.35  0.37  0.85  0.24  0.14  0.26 
MMVAE (MS)  0.67  0.77  0.97  0.86  0.88  0.93  0.90  0.82  0.70  0.76 
mmJSD (MS)  0.66  0.80  0.97  0.93  0.89  0.93  0.92  0.92  0.79  0.86 
M  S  

Model  S  T  S,T  R  M  T  M, T  R 
MVAE  0.62  0.62  0.58  0.62  0.33  0.34  0.22  0.33 
MMVAE  0.22  0.09  0.18  0.35  0.005  0.006  0.006  0.27 
mmJSD  0.19  0.09  0.16  0.15  0.05  0.01  0.06  0.09 
MVAE (MS)  0.60  0.59  0.50  0.60  0.30  0.33  0.17  0.29 
MMVAE (MS)  0.62  0.63  0.63  0.52  0.21  0.20  0.20  0.19 
mmJSD (MS)  0.62  0.64  0.64  0.30  0.21  0.22  0.22  0.17 
Table 1 and Table 2 demonstrate that the proposed mmJSD objective generalizes better to three modalities than previous work. The difficulty of the MVAE objective in optimizing the unimodal posterior approximation is reflected in the coherence numbers of missing data types and the latent representation classification. Although MMVAE is able to produce good results if only a single data type is given, the model cannot leverage the additional information of multiple available observations. Given multiple modalities, the corresponding performance numbers are the arithmetic mean of their unimodal pendants. The mmJSD model is able to achieve stateoftheart performance in optimizing the unimodal posteriors as well as outperforming previous work in leveraging multiple modalities thanks to the dynamic prior. The introduction of modalityspecific subspaces increases the coherence of the difficult SVHN modality for MMVAE and mmJSD. More importantly, modalityspecific latent spaces improve the quality of the generated samples for all modalities (see Table 3). Figure 1 shows qualitative results. Table 4 provides evidence that the high coherence of generated samples of the mmJSD model are not traded off against test set loglikelihoods. It also shows that MVAE is able to learn the statistics of a dataset well, but not to preserve the content in case of missing modalities.
Model  

MVAE  1864  2002  1936  2040  1881  1970  1908 
MMVAE  1916  2213  1911  2250  2062  2231  2080 
mmJSD  1961  2175  1955  2249  2000  2121  2004 
MVAE (MS)  1870  1999  1937  2033  1886  1971  1909 
MMVAE (MS)  1893  1982  1934  1995  1905  1958  1915 
mmJSD (MS)  1900  1994  1944  2006  1907  1968  1918 
4.3 Bimodal CelebA
Every CelebA image is labelled according to 40 attributes. We extend the dataset with an additional text modality describing the face in the image using the labelled attributes. Examples of created strings can be seen in Figure 2. Any negative attribute is completely absent in the string. This is different and more difficult to learn than negated attributes as there is no fixed position for a certain attribute in a string which introduces additional variability in the data. Figure 2 shows qualitative results for images which are generated conditioned on text samples. Every row of images is based on the text next to it. As the labelled attributes are not capturing all possible variation of a face, we generate 10 images with randomly sampled imagespecific information to capture the distribution of information which is not encoded in the shared latent space. The imbalance of some attributes effects the generative process. Rare and subtle attributes like eyeglasses are difficult to learn while frequent attributes like gender and smiling are well learnt.
Table 5 demonstrates the superior performance of the proposed mmJSD objective compared to previous work on the challening bimodal CelebA dataset. The classification results regarding the individual attributes can be found in section C.3.
Latent Representation  Generation  

Model  I  T  Joint  I T  T I 
MVAE (MS)  0.42  0.45  0.44  0.32  0.30 
MMVAE (MS)  0.43  0.45  0.42  0.30  0.36 
mmJSD (MS)  0.48  0.59  0.57  0.32  0.42 
5 Conclusion
In this work, we propose a novel generative model for learning from multimodal data. Our contributions are fourfold: (i) we formulate a new multimodal objective using a dynamic prior. (ii) We propose to use the JSdivergence for multiple distributions as a divergence measure for multimodal data. This measure enables direct optimization of the unimodal as well as the joint latent approximation functions. (iii) We prove that the proposed mmJSD objective constitutes an ELBO for multiple data types. (iv) With the introduction of modalityspecific latent spaces, we show empirically the improvement in quality of generated samples. Additionally, we demonstrate that the proposed method does not need any additional training objectives while reaching stateoftheart or superior performance compared to recently proposed, scalable, multimodal generative models. In future work, we would like to further investigate which functions would serve well as prior function and we will apply our proposed model in the medical domain.
6 Broader Impact
Learning from multiple data types offers many potential applications and opportunities as multiple data types naturally cooccur. We intend to apply our model in the medical domain in future work, and we will focus here on the impact our model might have in the medical application area. Models that are capable of dealing with largescale multimodal data are extremely important in the field of computational medicine and clinical data analysis. The recent developments in medical information technology have resulted in an overwhelming amount of multimodal data available for every single patient. A patient visit at a hospital may result in tens of thousands of measurements and structured information, including clinical factors, diagnostic imaging, lab tests, genomic and proteomic tests, and hospitals may see thousands of patients each year. The ultimate aim is to use all this vast information for a medical treatment tailored to the needs of an individual patient. To turn the vision of precision medicine into reality, there is an urgent need for the integration of the multimodal patient data currently available for improved disease diagnosis, prognosis and therapy outcome prediction. Instead of learning on one data set exclusively, as for example just on images or just on genetics, the aim is to improve learning and enhance personalized treatment by using as much information as possible for every patient. First steps in this direction have been successful, but so far a major hurdle has been the huge amount of heterogeneous data with many missing data points which is collected for every patient.
With this work, we lay the theoretical foundation for the analysis of largescale multimodal data. We focus on a selfsupervised approach as collecting labels for large datasets of multiple data types is expensive and becomes quickly infeasible with a growing number of modalities. Selfsupervised approaches have the potential to overcome the need for excessive labelling and the bias coming from these labels. In this work, we extensively tested the model in controlled environments. In future work, we will apply our proposed model to medical multimodal data with the goal of gaining insights and making predictions about disease phenotypes, disease progression and response to treatment.
References
 Aslam and Pavlu (2007) J. A. Aslam and V. Pavlu. Query hardness estimation using JensenShannon divergence among multiple scoring functions. In European conference on information retrieval, pages 198–209. Springer, 2007.
 Baltrušaitis et al. (2018) T. Baltrušaitis, C. Ahuja, and L.P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2018.

Bouchacourt et al. (2018)
D. Bouchacourt, R. Tomioka, and S. Nowozin.
Multilevel variational autoencoder: Learning disentangled
representations from grouped observations.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Burda et al. (2015) Y. Burda, R. Grosse, and R. Salakhutdinov. Importance Weighted Autoencoders. pages 1–14, 2015. URL http://arxiv.org/abs/1509.00519.

Fang et al. (2015)
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár,
J. Gao, X. He, M. Mitchell, J. C. Platt, and others.
From captions to visual concepts and back.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1473–1482, 2015.  He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Hershey and Olsen (2007)
J. R. Hershey and P. A. Olsen.
Approximating the Kullback Leibler divergence between Gaussian mixture models.
In 2007 IEEE International Conference on Acoustics, Speech and Signal ProcessingICASSP’07, volume 4, pages IV–317. IEEE, 2007.  Hinton (1999) G. E. Hinton. Products of experts. 1999.
 Hsu and Glass (2018) W.N. Hsu and J. Glass. Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data. 2018. URL http://arxiv.org/abs/1805.11264.

Huang et al. (2018)
X. Huang, M.Y. Liu, S. Belongie, and J. Kautz.
Multimodal unsupervised imagetoimage translation.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–189, 2018.  Karpathy and FeiFei (2015) A. Karpathy and L. FeiFei. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. (Ml):1–14, 2013. URL http://arxiv.org/abs/1312.6114.
 Kurle et al. (2018) R. Kurle, S. Günnemann, and P. van der Smagt. MultiSource Neural Variational Inference. 2018. URL http://arxiv.org/abs/1811.04451.
 LeCun and Cortes (2010) Y. LeCun and C. Cortes. {MNIST} handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lin (1991) J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.
 Liu et al. (2015) Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. In The IEEE International Conference on Computer Vision (ICCV), 2015.
 Netzer et al. (2011) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 Ngiam et al. (2011) J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pages 689–696, 2011.
 Nielsen (2019) F. Nielsen. On the JensenShannon symmetrization of distances relying on abstract means. Entropy, 2019. ISSN 10994300. doi: 10.3390/e21050485.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and others. Scikitlearn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
 Ravuri and Vinyals (2019) S. Ravuri and O. Vinyals. Classification accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, pages 12247–12258, 2019.
 Sajjadi et al. (2018) M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pages 5228–5237, 2018.
 Shi et al. (2019) Y. Shi, N. Siddharth, B. Paige, and P. Torr. Variational MixtureofExperts Autoencoders for MultiModal Deep Generative Models. In Advances in Neural Information Processing Systems, pages 15692–15703, 2019.
 Suzuki et al. (2016) M. Suzuki, K. Nakayama, and Y. Matsuo. Joint Multimodal Learning with Deep Generative Models. pages 1–12, 2016. URL http://arxiv.org/abs/1611.01891.
 Tian and Engel (2019) Y. Tian and J. Engel. Latent translation: Crossing modalities by bridging generative models. arXiv preprint arXiv:1902.08261, 2019.
 Tomczak and Welling (2017) J. M. Tomczak and M. Welling. VAE with a VampPrior. arXiv preprint arXiv:1705.07120, 2017.
 Tsai et al. (2018) Y.H. H. Tsai, P. P. Liang, A. Zadeh, L.P. Morency, and R. Salakhutdinov. Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176, 2018.
 Vedantam et al. (2017) R. Vedantam, I. Fischer, J. Huang, and K. Murphy. Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762, 2017.
 Wu and Goodman (2018) M. Wu and N. Goodman. Multimodal Generative Models for Scalable WeaklySupervised Learning. (Nips), 2018. URL http://arxiv.org/abs/1802.05335.
 Zadeh et al. (2017) A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.P. Morency. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017.
 Zhu et al. (2017) J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired ImageToImage Translation Using CycleConsistent Adversarial Networks. In The IEEE International Conference on Computer Vision (ICCV), 2017.
Appendix A Theoretical Background
The ELBO can be derived by reformulating the KLdivergence between the joint posterior approximation function and the true posterior distribution :
(11) 
It follows:
(12) 
From the nonnegativity of the KLdivergence, it directly follows:
(13) 
In the absence of one or multiple data types, we would still like to be able to approximate the true multimodal posterior distribution . However, we are only able to approximate the posterior by a variational function with . In addition, for different samples, different modalities might be missing. The derivation of the ELBO formulation changes accordingly:
(14) 
From where it again follows:
(15) 
Appendix B Multimodal JensenShannon Divergence Objective
In this section, we provide the proofs to the Lemmas which were introduced in the main paper. Due to space restrictions, the proofs of these Lemmas had to be moved to the appendix.
b.1 Upper bound to the KLdivergence of a mixture distribution
Lemma 3 (Joint Approximation Function).
Under the assumption of being a mixture model of the unimodal variational posterior approximations , the KLdivergence of the multimodal variational posterior approximation is a lower bound for the weighted sum of the KLdivergences of the unimodal variational approximation functions :
(16) 
Proof.
Lemma 3 follows directly from the strict convexity of . ∎
b.2 MoEPrior
Definition 4 (MoEPrior).
The prior is defined as follows:
(17) 
where are again the unimodal approximation functions and is a predefined, parameterizable distribution. The mixture weights sum to one, i.e. .
We prove that the MoEprior is a welldefined prior (see Lemma 1):
Proof.
To be a welldefined prior, must satisfy the following condition:
(18) 
Therefore,
(19) 
The unimodal approximation functions as well as the predefined distribution are welldefined probability distributions. Hence, for all and . The last line in equation B.2 follows from the assumptions. Therefore, equation (17) is a welldefined prior. ∎
b.3 PoEPrior
Lemma 4.
Under the assumption that all are Gaussian distributed by , is Gaussian distributed:
(20) 
where and are defined as follows:
(21) 
which makes a welldefined prior.
Proof.
As is Gaussian distributed, it follows immediately that is a welldefined dynamic prior. ∎
b.4 Factorization of Representations
We mostly base our derivation of factorized representations on the paper by Bouchacourt et al. [2018]. Tsai et al. [2018] and Hsu and Glass [2018] used a similar idea. A set of modalities can be seen as group and analogous every modality as a member of a group. We model every to have its own modalityspecific latent code .
(22) 
From Equation (22), we see that is the collection of all modalityspecific latent variables for the set . Contrary to this, the modalityinvariant latent code is shared between all modalities of the set . Also like Bouchacourt et al. [2018], we model the variational approximation function to be conditionally independent given , i.e.:
(23) 
From the assumptions it is clear that factorizes:
(24) 
From Equation (24) and the fact that the multimodal relationships are only modelled by the latent factor , it is reasonable to only apply the mmJSD objective to . It follows:
In Equation (B.4), we can rewrite the KLdivergence which includes using the multimodal dynamic prior and the JSdivergence for multiple distributions:
(26) 
The expectation over can be rewritten as a concatenation of expectations over and :
(27) 
From Equation (B.4), the final form of follows directly:
(28) 
b.5 JSdivergence as intermodality divergence
Utilizing the JSdivergence as regularization term as proposed in this work has multiple effects on the training procedure. The first is the introduction of the dynamic prior as described in the main paper. A second effect is the minimization of the intermodalitydivergence. The intermodalitydivergence is the difference of the posterior approximations between modalities. For a coherent generation, the posterior approximations of all modalities should be similar such that  if only a single modality is given  the decoders of the missing data types are able to generate coherent samples. Using the JSdivergence as regularization term keeps the unimodal posterior approximations similar to its mixture distribution. This can be compared to minimizing the divergence between the unimodal distributions and its mixture which again can be seen as an efficient approximation of minimizing the pairwise unimodal divergences, the intermodalitydivergences. Wu and Goodman [2018] report problems in optimizing the unimodal posterior approximations. These problems lead to diverging posterior approximations which again results in bad coherence for missing data generation. Diverging posterior approximations cannot be handled by the decoders of the missing modality.
Appendix C Experiments
In this section we describe the architecture and implementation details of the different experiments. Additionally, we show more results and ablation studies.
c.1 Evaluation
First we describe the architectures and models used for evaluating classification accuracies.
c.1.1 Latent Representations
To evaluate the learned latent representations, we use a simple logistic regression classifier without any regularization. We use a predefined model by scikitlearn
Pedregosa et al. [2011]. Every linear classifier is trained on a single batch of latent representations. For simplicity, we always take the last batch of the training set to train the classifier. The trained linear classifier is then used to evaluate the latent representations of all samples in the test set.c.1.2 Generated Samples
To evaluate generated samples regarding their content coherence, we classify them according to the attributes of the dataset. In case of missing data, the estimated data types must coincide with the available ones according to the attributes present in the available data types. In case of random generation, generated samples of all modalities must be coincide with each other. To evaluate the coherence of generated samples, classifiers are trained for every modality. If the detected attributes for all involved modalities are the same, the generated samples are called coherent. For all modalities, classifiers are trained on the original, unimodal training set. The architectures of all used classifiers can be seen in Tables 6 to 8.
MNIST  SVHN  

Layer  Type  #F. In  #F. Out  Spec.  Layer  Type  #F. In  #F. Out  Spec. 
1  conv  1  32  (4, 2, 1, 1)  1  conv  1  32  (4, 2, 1, 1) 
2  conv  32  64  (4, 2, 1, 1)  2  conv  32  64  (4, 2, 1, 1) 
3  conv  64  128  (4, 2, 1, 1)  3  conv  64  64  (4, 2, 1, 1) 
4  linear  128  10  4  conv  64  128  (4, 2, 0, 1)  
5  linear  128  10 
Layers for MNIST and SVHN classifiers. For MNIST and SVHN, every convolutional layer is followed by a ReLU activation function. For SVHN, every convolutional layer is followed by a dropout layer (dropout probability = 0.5). Then, batchnorm is applied followed by a ReLU activation function. The output activation is a sigmoid function for both classifiers. Specifications (Spec.) name kernel size, stride, padding and dilation.
Layer  Type  #F. In  #F. Out  Spec. 

1  conv  71  128  (1, 1, 1, 1) 
2  residual  128  192  (4, 2, 1, 1) 
3  residual  192  256  (4, 2, 1, 1) 
4  residual  256  256  (4, 2, 1, 1) 
5  residual  256  128  (4, 2, 0, 1) 
6  linear  128  10 
Image  Text  

Layer  Type  #F. In  #F. Out  Spec.  Layer  Type  #F. In  #F. Out  Spec. 
1  conv  3  128  (3, 2, 1, 1)  1  conv  71  128  (3, 2, 1, 1) 
2  res  128  256  (4, 2, 1, 1)  2  res  128  256  (4, 2, 1, 1) 
3  res  256  384  (4, 2, 1, 1)  3  res  256  384  (4, 2, 1, 1) 
4  res  384  512  (4, 2, 1, 1)  4  res  384  512  (4, 2, 1, 1) 
5  res  512  640  (4, 2, 0, 1)  5  res  512  640  (4, 2, 1, 1) 
6  linear  640  40  6  residual  640  768  (4, 2, 1, 1)  
7  residual  768  896  (4, 2, 0, 1)  
8  linear  896  40 
followed by a linear layer which maps to 40 output neurons representing the 40 attributes. The text classifier also uses residual layers, but for 1dconvolutions. The output activation is a sigmoid function for both classifiers. Specifications (Spec.) name kernel size, stride, padding and dilation.
c.2 MNISTSVHNText
c.2.1 Text Modality
To have an additional modality, we generate text from labels. As a single word is quite easy to learn, we create strings of length 8 where everything is a blank space except the digitword. The starting position of the word is chosen randomly to increase the difficulty of the learning task. Some example strings can be seen in Table 9.
six 
eight 
three 
five 
nine 
zero 
four 
three 
seven 
five 
Encoder  Decoder  

Layer  Type  # Features In  # Features Out  Layer  Type  # Features In  # Features Out 
1  linear  784  400  1  linear  20  400 
2a  linear  400  20  2  linear  400  784 
2b  linear  400  20 
Encoder  Decoder  

Layer  Type  #F. In  #F. Out  Spec.  Layer  Type  #F. In  #F. Out  Spec. 
1  conv  3  32  (4, 2, 1, 1)  1  linear  20  128  
2  conv  32  64  (4, 2, 1, 1)  2  conv  128  64  (4, 2, 0, 1) 
3  conv  64  64  (4, 2, 1, 1)  3  conv  64  64  (4, 2, 1, 1) 
4  conv  64  128  (4, 2, 0, 1)  4  conv  64  32  (4, 2, 1, 1) 
5a  linear  128  20  5  conv  32  3  (4, 2, 1, 1)  
5b  linear  128  20 
Encoder  Decoder  

Layer  Type  #F. In  #F. Out  Spec.  Layer  Type  #F. In  #F. Out  Spec. 
1  conv  71  128  (1, 1, 0, 1)  1  linear  20  128  
2  conv  128  128  (4, 2, 1, 1)  2  conv  128  128  (4, 1, 0, 1) 
3  conv  128  128  (4, 2, 0, 1)  3  conv  128  128  (4, 2, 1, 1) 
4a  linear  128  20  4  conv  128  71  (1, 1, 0, 1)  
4b  linear  128  20 
c.2.2 Implementation Details
For MNIST and SVHN, we use the network architectures also utilized by Shi et al. [2019] (see Table 10 and Table 11). The network architecture used for the Text modality is described in Table 12. For all encoders, the last layers named a and b are needed to map to and of the posterior distribution. In case of modalityspecific subspaces, there are four last layers to map to and and and .
To enable a joint latent space, all modalities are mapped to have a 20 dimensional latent space (like in Shi et al. [2019]). For a latent space with modalityspecific and independent subspaces, this restriction is not needed anymore. Only the modalityinvariant subspaces of all data types must have the same number of latent dimensions. Nevertheless, we create modalityspecific subspaces of the same size for all modalities. For the results reported in the main text, we set it to 4. To have an equal number of parameters as in the experiment with only a shared latent space, we set the shared latent space to 16 dimensions. This allows for a fair comparison between the two variants regarding the capacity of the latent space. See section C.2.5 and Figure 5 for a detailed comparison regarding the size of the modality specificsubspaces. Modalityspecific subspaces are a possibility to account for the difficulty of every data type.
The image modalities are modelled with a Laplace likelihood and the text modality is modelled with a categorical likelihood. The likelihoodscaling is done according to the data size of every modality. The weight of the largest data type, i.e. SVHN, is set to 1.0. The weight for MNIST is given by and the text weight by
. This scaling scheme stays the same for all experiments. The weight of the unimodal posteriors are equally weighted to form the joint distribution. This is true for MMVAE and mmJSD. For MVAE, the posteriors are weighted according to the inverse of their variance. For mmJSD, all modalities and the predefined distribution are weighted
. We keep this for all experiments reported in the main paper. See section C.2.6 and Figure 6 for a more detailed analysis of distribution weights.For all experiments, we set to 5.0. For all experiments with modalityspecific subspaces, the for the modalityspecific subspaces is set equal to the number of modalities, i.e. 3. Additionally, the for the text modality is set to 5.0, for the other 2 modalities it is set to 1.0. The evaluation of different values shows the stability of the model according to this hyperparameter (see Figure 3).
All unimodal posterior approximations are assumed to be Gaussian distributed , as well as the predefined distribution which is defined as .
For training, we use a batch size of 256 and a starting learning rate of 0.001 together with an ADAM optimizer Kingma and Ba [2014]
. We pair every MNIST image with 20 SVHN images which increases the dataset size by a factor of 20. We train our models for 50 epochs in case of a shared latent space only. In case of modalityspecific subspaces we train the models for 100 epochs. This is the same for all methods.
c.2.3 Qualitative Results
Figure 4 shows qualitative results for the random generation of MNIST and SVHN samples.
c.2.4 Comparison to Shi et al.
The results reported in Shi et al. [2019]’s paper with the MMVAE model rely heavily on importance sampling (IS) (as can be seen by comparing to the numbers of a model without IS reported in their appendix). The ISbased objective Burda et al. [2015] is a different objective and difficult to compare to models without an ISbased objective. Hence, to have a fair comparison between all models we compared all models without ISbased objective in the main paper. The focus of the paper was on the different joint posterior approximation functions and the corresponding ELBO which should reflect the problems of a multimodal model.
For completeness we compare the proposed model to the ISbased MMVAE model here in the appendix. Table 13 shows the training times for the different models. Although the MMVAE (I=30) only needs 30 training epochs for convergence, these 30 epochs take approximately 3 times as long as for the other models without importance sampling. (I=30) names the model with 30 importance samples. What is also adding up to the training time for the MMVAE (I=30) model is the paths through the decoder. The MMVAE model and mmJSD need approximately the same time until training is finished. MVAE takes longer as the training objective is a combination of ELBOs instead of a single objective.
Model  #epochs  runtime 

MVAE  50  3h 01min 
MMVAE  50  2h 01min 
MMVAE (I=30)  30  15h 15min 
mmJSD  50  2h 16min 
MVAE (MS)  100  6h 15min 
MMVAE (MS)  100  4h 10min 
mmJSD (MS)  100  4h 36min 
Model  M  S  T  M,S  M,T  S,T  Joint 

MMVAE  0.96  0.81  0.99  0.89  0.97  0.90  0.93 
MMVAE (I=30)  0.92  0.67  0.99  0.80  0.96  0.83  0.86 
mmJSD  0.97  0.82  0.99  0.93  0.99  0.92  0.98 
MMVAE (MS)  0.96  0.81  0.99  0.89  0.98  0.91  0.92 
mmJSD (MS)  0.98  0.85  0.99  0.94  0.98  0.94  0.99 
M  S  T  
Model  Random  S  T  S,T  M  T  M,T  M  S  M,S 
MMVAE (I=30)  0.60  0.71  0.99  0.85  0.76  0.68  0.72  0.95  0.73  0.84 
MMVAE  0.54  0.82  0.99  0.91  0.32  0.30  0.31  0.96  0.83  0.90 
mmJSD  0.60  0.82  0.99  0.95  0.37  0.36  0.48  0.97  0.83  0.92 
MMVAE (MS)  0.67  0.77  0.97  0.86  0.88  0.93  0.90  0.82  0.70  0.76 
mmJSD (MS)  0.66  0.80  0.97  0.93  0.89  0.93  0.92  0.92  0.79  0.86 
Model  

MVAE  1864 
MMVAE (I=30)  1891 
MMVAE  1916 
mmJSD  1961 
MVAE (MS)  1870 
MMVAE (MS)  1893 
mmJSD (MS)  1900 
Tables 14, 15 and 16 show that the models without any importance samples achieve stateoftheart performance compared to the MMVAE model using importance samples. Using modalityspecific subspaces seems to have a similar effect towards test set loglikelihood performance as using importance samples with a much lower impact on computational efficiency as it can be seen in the comparison of training times in Table 13.
c.2.5 ModalitySpecific Subspaces
The introduction of modalityspecific subspaces introduces an additional degree of freedom. In Figure
5, we show a comparison of different modalityspecific subspace sizes. The size is the same for all modalities. Also, the total number of latent dimensions is constant, i.e. the number of dimensions in the modalityspecific subspaces is subtracted from the shared latent space. If we have modalityspecific latent spaces of size 2, the shared latent space is of size 18. This allows to ensure that the capacity of latent spaces stays constant. Figure 5 shows that the introduction of modalityspecific subspaces only has minor effect on the quality of learned representations, despite the lower number of dimensions in the shared space. Generation coherence suffers with increasing number of modalityspecific dimensions, but the quality of samples improves. We guess that the coherence becomes lower due to information which is shared between modalities but encoded in modalityspecific spaces. In future work, we are interested in finding better schemes to identify shared and modalityspecific information.