1 Introduction
Humans are provided with a remarkable cognitive framework which allows them to create a rich representation of their external reality. This framework contains the tools to learn novel representations of their environment and to recognize previously learned representations, which are stored in memory [1, 2]. The information provided by the environment is of a multimodal nature, captured and processed by the different sensory input channels (senses) humans possess. Yet, information is often incomplete, be it due to some modality not being provided by the environment or due to human sensory malfunction. To overcome such events, the human cognitive framework also allows for cross-modality inference, a process in which an available input modality can induce perceptual experiences of the missing modalities [3, 4, 5]. Figure 1 illustrates how cross-modality inference is essential in humans to act upon their environment in scenarios of incomplete perceptual observations.

An example of the importance of multimodal representation learning for human tasks: in the absence of light, humans can navigate their environment by employing perceptual information from other modalities (such as sound) to generate the absent visual perceptual experience. Following human cognitive models, in this work, we contribute with the MHVAE, a novel multimodal hierarchical variational autoencoder able to perform cross-modality inference.
Artificial agents, on the other hand, struggle to obtain rich representations of their environment. For example, in spite of being endowed with multiple sensors, robots often disregard the multimodal nature of environmental information and learn internal representations from a single perceptual modality, often vision [6, 7]. However, such disregard leads to the agent’s inability to understand and act upon its environment when that modality-specific information is unavailable or in the (frequent) case of sensory malfunction. If we aim at having artificial agents—such as service robots or autonomous vehicles—acting reliably in their environments, they must be provided with mechanisms to overcome potential perceptual issues. Rich joint-modality representations can play a fundamental role in robust policy transfer across different input modalities of artificial agents [8].
Inspired by the human cognitive framework, we contribute a novel model capable of learning rich multimodal representations and performing cross-modality inference. Multimodal generative models have shown great promise in doing so by learning a single joint-distribution of multiple modalities
[9, 10, 11, 12]. This single representation space has to encode information to account for the complete generation process of all modalities, often of different complexities. As such, for each input modality, the representation capability of this single joint-representation space must pale in comparison with that of an individual modality-specific space. Indeed, according to the Convergence-Divergence Zone (CDZ) cognitive model [2], humans process perceptual information not in a single representation space but in a hierarchical structure: sensory data is processed at lower-levels of the model, generating modality-specific representations; and divergent information from these representations is merged at higher levels of the model, generating multimodal representations [1, 13]. The architecture of the CDZ model is presented in Figure 2.Inspired by CDZ architecture, we propose the MHVAE, a novel generative model that learns multimodal representations in an unsupervised way. The MHVAE model is a multimodal hierarchical Variational Autoencoder (VAE) that learns modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, allowing for cross-modality inference. Moreover, we formally derive the model’s evidence lower bound (ELBO) and, based on modality-specific representation dropout, we propose a novel methodology to approximate the joint-modality posterior. This approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training, with minimal computational cost.
We evaluate the potential of the MHVAE as a multimodal generative model on standard multimodal datasets. We show that the MHVAE outperforms other state-of-the-art multimodal generative models on modality-specific reconstruction and cross-modality inference.
In summary, the main contributions of this paper are:
-
We propose a novel multimodal hierarchical VAE, inspired by the CDZ-based human neural architecture [2]. The model learns modality-specific distributions and a joint distribution of all modalities, allowing for cross-modality inference in the presence of incomplete perceptual information. We formally derive the model’s evidence lower bound.
-
We propose a new methodology for approximating the joint-modality posterior, based on modality-specific representation dropout. This approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training, with minimal computational cost.
-
We evaluate the model on standard multimodal datasets and show that the MHVAE performs on par to other state-of-the-art multimodal generative models on modality-specific reconstruction and cross-modality inference.
![]() |
![]() |
2 A Deep Hierarchical Generative Model for Multimodal Representation Learning
Deep generative models have shown great promise in learning generalized representations of data. For single-modality data, the VAE is widely used. It learns a joint distribution of data , which is generated by a latent variable
. This latent variable is often of lower dimensionality in comparison with the modality itself and acts as the representation vector in which data is encoded.
The joint distribution takes the form
where (the prior distribution) is often an unitary Gaussian (). The generative distribution , parameterized by , is usually composed with a simple likelihood term (e.g. Bernoulli or Gaussian).
The training procedure of the VAE model involves the maximization of the evidence likelihood , by marginalizing over the latent variable:
However, the above likelihood is intractable. As such, we resort to an inference network
for its estimation:
Applying the logarithm and Jensen’s inequality we obtain a lower-bound on the log-likelihood of the evidence (ELBO), i.e., , where
where the Kullback-Leibler divergence term,
, promotes a balance between the latent variable’s capacity and the encoding process of data. During training, this balance can be adjusted through the introduction of a hyper-parameter ,where we recover the original VAE formulation when taking . The optimization of the ELBO is done using gradient-based methods, applying a re-parametrization technique [14].
2.1 Mhvae
We now introduce the MHVAE model, which extends the single modality nature of VAEs to the multimodal hierarchical setting. In the multimodal setting, we consider a set of modalities , generated accordingly to some environmental-dependent process , parameterized by . We model the generation process of information in a hierarchical fashion: each modality is generated by a corresponding modality-specific latent variable in the set , conditionally independent given a core latent variable . The main goal of the MHVAE is to simultaneously learn single-modality latent spaces, for reconstructing modality-specific data, and a joint distribution of modalities, encoded in a core latent distribution, allowing cross-modality inference. The architecture of the proposed model is presented in Figure 2.
2.1.1 Evidence Lower-bound of the MHVAE
In order to train the model, we aim at maximizing the likelihood of the generative process, , by marginalizing over the modality-specific and core latent variables,
(1) |
Given its hierarchical nature and the conditional independence of each modality-specific latent variable in regards to the core latent variable, we can decompose the joint-modality probability as
(2) |
However, since the marginal likelihood of each modality is intractable, we estimate its posterior resorting to an inference model , parameterized by . We consider an inference model , as shown in Figure 2, in which modality information is encoded simultaneously into the modality-specific latent spaces and into the core latent space, yielding
(3) |
Introducing the inference model in the decomposed joint probability and rewriting the likelihood of the evidence as an expectation over the latent variables, we obtain
(4) |
Taking the logarithm and applying Jensen’s inequality [15], we estimate a lower-bound on the log-likelihood of the evidence as
(5) |
The lower-limit can be seen as containing three distinct groups. The first group, similar to the original VAE formulation, corresponds to the reconstruction loss of input , generated by the modality-specific latent variables . For the -th modality, this is given by
(6) |
The second component parallels the encoding capacity constrain of the latent variable in the VAE formulation, now considering the multimodal core latent variable . This constraint penalizes encoding distributions that deviate from the prior and is given by
(7) |
Finally, the third term associates the distribution generated by the single-modality encoders, , and the distribution generated from the multimodal core latent space, :
(8) |
Taking into consideration the previous components, we can write the evidence lower-bound of the MHVAE model as
(9) |
where we introduce weight factors for each modality-specific reconstruction loss and a divergence term , in addition to a core capacity weight .
2.1.2 Modality Representation Dropout
![]() |
![]() |
We now turn to the methodology to approximate the joint-modality posterior distribution. In the case of the MHVAE, we wish to encode information from the modality-specific data into the multimodal core latent variable .
One approach to do so, the product-of-experts (POE), approximates the joint posterior with the product of Gaussian experts including a prior expert[12]. However, this solution is computationally intensive (as it requires artificial sub-sampling of the observations during training) and suffers from overconfident expert prediction, resulting in sub-par cross-modality inference performance [9].
We propose a novel methodology to approximate the joint-modality posterior based on the dropout of modality-specific representations, as shown in Figure 3. We introduce a modality-data dropout masks , with dimensionality , such that
(10) |
where corresponds to the list of hidden-layer representations, computed by the modality-data encoders, as seen in Figure 2. We effectively zero-out the selected components by considering that
(11) |
During training, for each datapoint, we sample
from a Bernoulli distribution,
(12) |
where the hyper-parameters control the dropout probability of each modality representation. Moreover, we condition the mask sampling procedure in order to always allow at least a single modality representation to be non-zero. As such, for each sample we concatenate the resulting representations to be used as input to the multimodal encoder. Accounting for latent modality dropout, the modified ELBO of the MHVAE becomes
(14) |
3 Evaluation
In this section, we evaluate the MHVAE’s performance as a multimodal generative model on standard multimodal datasets. Our model outperforms other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.
3.1 Multimodal Datasets
As in previous literature, we transform single modality datasets into bimodal datasets by considering the label associated to each image as a modality of its own right. We also compare the MHVAE to existing multimodal generative models: JMVAE-kl [10] and MVAE [12]. For the JMVAE-kl model we consider . For the MVAE model, trained using the publicly available official implementation,111Implementation available at https://github.com/mhw32/multimodal-vae-public we employ the author’s suggested training hyper-parameters.
We evaluate our model on literature standard datasets: MNIST [16], FashionMNIST [17], and CelebA [18]. We report state-of-the-art performance on the first two datasets regarding generative modelling and cross-modality capabilities.
We train the MHVAE with no hyper-parameter tuning, i.e., . Moreover, we fix the dropout hyper-parameters , for all modalities. For the MHVAE model, as shown in Figure 2, we consider two different types of networks: the modality network, responsible for encoding the input data into the modality-specific latent space , the associated hidden representation , and the inverse generative process; and the core network, responsible for the encoding of the multimodal core latent variable, from the representation , from which we generate the modality-specific latent spaces . For fairness, on each dataset, we keep the network architectures consistent across models: the generative and inference networks of the baseline models share their architecture with the modality-specific networks of the MHVAE.
Moreover, we also consider a warm-up period on the regularization terms of the ELBO [19]: we linearly increase the value of the prior regularization term on the modality-specific latent variable for epochs; and we linearly increase the value of the Gaussian prior on the core latent space for epochs. For the baselines we consider a single warm-up period on the prior regularization of the latent space, .
We evaluate the reconstruction capabilities and cross-modality inference performance of the models. To do so, we estimate the image marginal log-likelihood, , the joint log-likelihood, , and the conditional log-likelihood, , of the observations, through importance sampling. For MNIST and FashionMNIST, we consider importance samples, and for CelebA we consider
samples. The evaluation metrics are derived in appendix.
3.1.1 Mnist
Metric | Input | JMVAE | MVAE | MHVAE |
---|---|---|---|---|
I | -90.189 | - | -89.050 | |
I | -90.241 | - | -89.183 | |
L | -125.381 | - | -121.401 | |
I,L | -90.335 | - | -89.143 | |
L | -123.070 | - | -118.856 |
For the MNIST dataset, we train all models on images and labels . We consider a dataset division of 85% for training, of which for validation purposes, and the remaining 15 for evaluation.
We compose the image modality network of the MHVAE model with three linear layers with 512 hidden units, leaky rectifiers as activation function and applying batch-normalization between each hidden layer. Furthermore, we consider a 16-dimensional image-specific latent space. The label modality network is similarly composed with three linear layers with 128 hidden units, considering a 16-dimensional label-specific latent space. The core network is composed with three linear layers with 64 hidden units, considering a 10-dimensional latent space. For the baselines, we consider a single 26-dimensional latent space.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We consider as a Bernoulli distributed likelihood and as a multinomial likelihood. Moreover, for the MHVAE model, we consider epochs and for the baselines epochs.
We train all models for 500 epochs, considering a learning rate of and batch-size . The estimates of the test log-likelihoods for all the models are presented in Table 1. We report that the MHVAE outperforms other state-of-the-art multimodal models on both single-modality and joint-modality metrics, despite the fact that these separate representation spaces are of lower dimensionality than the joint representation space employed by the JMVAE and the MVAE. Moreover, the MHVAE model is able to provide better cross-modality inference than other models, as observed by the significantly lower value of the conditional log likelihood , employing only the lower-dimensional label modality to estimate this quantity.
In Figure 4, we present images generated by the MHVAE, sampled from the prior and conditioned on a given label, i.e., estimated using . The quality of the sampled images indicates a suitable performance of the generative networks of the MHVAE model.
3.1.2 FashionMNIST
Metric | Input | JMVAE | MVAE | MHVAE |
---|---|---|---|---|
I | -232.427 | -236.613 | -231.753 | |
I | -232.739 | -242.628 | -232.276 | |
L | -244.378 | -557.582 | -243.932 | |
I,L | -232.573 | -241.534 | -232.248 | |
L | -242.060 | -552.679 | -241.662 |
For FashionMNIST, we train the generative models on greyscale images and their class labels , with the same proportional division of the dataset as for the previous case. For the MHVAE model, we implement a miniature DCGAN [20] architecture as the image-modality encoder, with Swish [21] as the activation function due to its performance in deep convolutional-based models. The network is composed of two convolutional layers of 32 and 64 channels followed by a linear layer of 128 hidden units. For the core and text-modality inference and generator networks, we maintain the same architecture. We consider modality-specific and core latent spaces with the same dimensionality as before and employ the same training hyper-parameters as in the previous evaluation case.
We train all models for 500 epochs, employing the Adam optimization algorithm [22] in the training procedure with learning rate and batch-size . The estimates of the test log-likelihoods for all the models are presented in Table 2. Once again, we report that the MHVAE outperforms other state-of-the-art multimodal models on both single-modality and joint-modality metrics, as well as on label-to-image cross-modality inference.
In Figure 4, we present the images generated by the MHVAE, sampled from the prior and conditioned on a given label, which provide evidence that the generative networks of the model have a suitable performance.
3.1.3 CelebA
For CelebA, we train the MHVAE on re-scaled colored images and a subset of 18 visually distinctive attributes [23]. We compose the image modality network of the MHVAE model as a miniature DCGAN [20]. This network is composed of four convolution layers, with channels, respectively, followed by a linear layer of 512 hidden units. Furthermore, we consider an 48-dimensional image-specific latent space. The label modality network is composed with three linear layers with 512 hidden units, considering an 48-dimensional label-specific latent space. The core network is composed with three linear layers with 256 hidden units, considering a 16-dimensional latent space. The baselines consider a single 64-dimensional latent space.
We train all models for 50 epochs employing a learning rate and batch-size . For the MHVAE model, we consider and epochs. For the baselines models, we consider a single warm-up period of epochs. The estimates of the test log-likelihoods, computed using 500 importance samples, are presented in Table 3. In this scenario, the MHVAE performs on par with other state-of-the-art multimodal models on all metrics, albeit with slight less performance in comparison with the previous evaluations. In Figure 4, we present the images generated by the MHVAE, sampled from the prior and conditioned on a given set of attributes.
Metric | Input | JMVAE | MVAE | MHVAE |
---|---|---|---|---|
I | -6260.35 | -6256.65 | -6271.35 | |
I | -6264.59 | -6270.86 | -6278.19 | |
A | -7204.36 | -7316.12 | -7303.64 | |
I,A | -6262.67 | -6266.14 | -6276.57 | |
A | -7191.11 | -7309.10 | -7296.22 |
3.2 Discussion
We have evaluated the MHVAE against a baseline of state-of-the-art regarding performance on standard multimodal datasets. We have compared our model with the JMVAE and the MVAE models, two widely used models for multimodal representation learning.
The results, on increasingly complex datasets, attest to the importance of considering hierarchical representation spaces to model multimodal data distributions. Even considering lower-dimensional spaces to learn the modality distributions, in comparison with the single-multimodal space of the baselines, the MHVAE is able to achieve state-of-the-art results on the MNIST and FashionMNIST datasets, with minimal hyper-parameter tuning.
On the CelebA dataset, the MHVAE behaves on par with the other baseline models, which raises the question about the importance of the dimensionality of the representation spaces for complex scenarios. Indeed, for a fair comparison with the other baselines, we limited the MHVAE model to have lower-dimensional representations spaces which, on a complex datasets such as CelebA, result in a lower log-likelihood of the modalities. However, the MHVAE model is still capable of outperforming the MVAE model in regards to joint-modality and cross-modality inference, estimated from the label. Regarding future work, we intend to address the question of the balance between representative capacity in the core and in the modality-specific distributions.
4 Related Work
Deep generative models have shown great promise in learning generalized latent representations of data. The VAE model [14] estimates a deep generative model through variational inference methods, encoding univariate data in a single latent space, regularized by a prior distribution. The regularization distribution is often an unitary Gaussian, or a more complex posterior distribution [24, 25]. Due to the intractability of the marginal likelihood of the data, the model resorts to an inference network in the computation of the model’s evidence lower-bound. This lower-bound can be estimated, for example, through importance sampling techniques [26].
Hierarchical generative models have also been proposed in literature to learn complex relationships between latent variables [19, 27, 28, 29]. However, these models consider representations created from a single modality and, as such, are not able to provide a framework for cross-modality inference nor to represent multimodal data. On the other hand, VAE models have also been extended in order to learn joint distributions of several modalities by forcing the estimated single-modality representations to be similar, thus allowing cross-modality inference [10, 30, 11]. However, the necessity of introducing specific divergence terms in the model’s evidence lower-bound for each combination of modalities hinders its application in scenarios with a large number of modalities. Another approach introduced, was the POE inference network which reduces the number of encoding networks required for multimodal encoding [12], albeit with increased computational training cost associated. In order to provide cross-modality inference capabilities, the existing models encode information from all modalities into a single, common, latent variable space. Thus, they relinquish the generative capabilities that single-modality latent representational spaces possess. In this work, we present a novel multimodal generative model, capable of learning hierarchical representation spaces.
5 Conclusions
In this work, by taking inspiration from the human cognitive framework, we presented the MHVAE, a novel multimodal hierarchical generative model. The MHVAE is able to learn separate modality-specific representations and a joint-modality representation, allowing for improved representation learning in comparison with the single-representation choice of other multimodal generative models. We have shown that, on standard multimodal datasets, the MHVAE is able to outperform other state-of-the-art multimodal generative models regarding modality-specific reconstruction and cross-modality inference.
We also proposed a novel methodology to approximate joint-modality posterior, based on modality-specific representation dropout. With minimal computational cost, this approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training. We aim at exploring scenarios with larger number of modalities in the future.
Moreover, we aim to employ the MHVAE as a perceptual representation model for artificial agents and explore its application in deep multimodal reinforcement learning scenarios, when the agent has to perform cross-modality inference to perform the task. Further inspired by human cognition and perceptual learning, we also intend to explore reinforcement learning mechanisms for the construction of the multimodal representation themselves.
References
- [1] Kaspar Meyer and Antonio Damasio. Convergence and divergence in a neural architecture for recognition and memory. Trends in neurosciences, 32(7):376–382, 2009.
- [2] Antonio R Damasio. Time-locked multiregional retroactivation: A systems-level proposal for the neural substrates of recall and recognition. Cognition, 33(1-2):25–62, 1989.
- [3] Peter Walker, J Gavin Bremner, Uschi Mason, Jo Spring, Karen Mattock, Alan Slater, and Scott P Johnson. Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences. Psychological Science, 21(1):21–25, 2010.
- [4] Daphne Maurer, Thanujeni Pathman, and Catherine J Mondloch. The shape of boubas: Sound–shape correspondences in toddlers and adults. Developmental science, 9(3):316–322, 2006.
- [5] Charles Spence. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psychophysics, 73(4):971–995, 2011.
- [6] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
-
[7]
Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta.
The curious robot: Learning visual representations via physical
interactions.
In
European Conference on Computer Vision
, pages 3–18. Springer, 2016. - [8] Rui Silva, Miguel Vasco, Francisco S. Melo, Ana Paiva, and Manuela Veloso. Playing games in the dark: An approach for cross-modality transfer in reinforcement learning. arXiv:1911.12851 [cs], November 2019. arXiv: 1911.12851.
- [9] Yuge Shi, N Siddharth, Brooks Paige, and Philip Torr. Variational mixture-of-experts autoencoders for multi-modal deep generative models. In Advances in Neural Information Processing Systems, pages 15692–15703, 2019.
- [10] Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
- [11] Timo Korthals, Daniel Rudolph, Jürgen Leitner, Marc Hesse, and Ulrich Rückert. Multi-modal generative models for learning epistemic active sensing. In 2019 IEEE International Conference on Robotics and Automation, 2019.
-
[12]
Mike Wu and Noah Goodman.
Multimodal generative models for scalable weakly-supervised learning.
In Advances in Neural Information Processing Systems, pages 5575–5585, 2018. - [13] Stephane Lallee and Peter Ford Dominey. Multi-modal convergence maps: from body schema and self-representation to mental imagery. Adaptive Behavior, 21(4):274–285, 2013.
- [14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [15] Johan Ludwig William Valdemar Jensen et al. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30:175–193, 1906.
- [16] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- [18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018.
- [19] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
- [20] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- [21] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [23] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.
- [24] Jianlin Su and Guang Wu. f-vaes: Improve vaes with conditional flows. arXiv preprint arXiv:1809.05861, 2018.
- [25] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- [26] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
-
[27]
Shengjia Zhao, Jiaming Song, and Stefano Ermon.
Learning hierarchical features from deep generative models.
In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
, pages 4091–4099. JMLR. org, 2017. - [28] Philip Bachman. An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems, pages 4826–4834, 2016.
- [29] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al. Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217, 2018.
-
[30]
Hang Yin, Francisco S Melo, Aude Billard, and Ana Paiva.
Associate latent encodings in learning from demonstrations.
In
Thirty-First AAAI Conference on Artificial Intelligence
, 2017.