MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

06/04/2020
by   Miguel Vasco, et al.
Inesc-ID
16

Humans are able to create rich representations of their external reality. Their internal representations allow for cross-modality inference, where available perceptions can induce the perceptual experience of missing input modalities. In this paper, we contribute the Multimodal Hierarchical Variational Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning. Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, responsible for cross-modality inference. We formally derive the model's evidence lower bound and propose a novel methodology to approximate the joint-modality posterior based on modality-specific representation dropout. We evaluate the MHVAE on standard multimodal datasets. Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.

READ FULL TEXT VIEW PDF

Authors

page 7

06/16/2018

Learning Factorized Multimodal Representations

Learning representations of multimodal data is a fundamentally complex r...
10/07/2021

How to Sense the World: Leveraging Hierarchy in Multimodal Perception for Robust Reinforcement Learning Agents

This work addresses the problem of sensing the world: how to learn a mul...
05/29/2018

Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data

Multimodal sensory data resembles the form of information perceived by h...
11/18/2019

Modality To Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Learning joint embedding space for various modalities is of vital import...
02/07/2022

GMC – Geometric Multimodal Contrastive Representation Learning

Learning representations of multimodal data that are both informative an...
05/11/2021

Cross-Modal Generative Augmentation for Visual Question Answering

Data augmentation is an approach that can effectively improve the perfor...
10/13/2020

Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

Natural human interactions for Mixed Reality Applications are overwhelmi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans are provided with a remarkable cognitive framework which allows them to create a rich representation of their external reality. This framework contains the tools to learn novel representations of their environment and to recognize previously learned representations, which are stored in memory [1, 2]. The information provided by the environment is of a multimodal nature, captured and processed by the different sensory input channels (senses) humans possess. Yet, information is often incomplete, be it due to some modality not being provided by the environment or due to human sensory malfunction. To overcome such events, the human cognitive framework also allows for cross-modality inference, a process in which an available input modality can induce perceptual experiences of the missing modalities [3, 4, 5]. Figure 1 illustrates how cross-modality inference is essential in humans to act upon their environment in scenarios of incomplete perceptual observations.

Figure 1:

An example of the importance of multimodal representation learning for human tasks: in the absence of light, humans can navigate their environment by employing perceptual information from other modalities (such as sound) to generate the absent visual perceptual experience. Following human cognitive models, in this work, we contribute with the MHVAE, a novel multimodal hierarchical variational autoencoder able to perform cross-modality inference.

Artificial agents, on the other hand, struggle to obtain rich representations of their environment. For example, in spite of being endowed with multiple sensors, robots often disregard the multimodal nature of environmental information and learn internal representations from a single perceptual modality, often vision [6, 7]. However, such disregard leads to the agent’s inability to understand and act upon its environment when that modality-specific information is unavailable or in the (frequent) case of sensory malfunction. If we aim at having artificial agents—such as service robots or autonomous vehicles—acting reliably in their environments, they must be provided with mechanisms to overcome potential perceptual issues. Rich joint-modality representations can play a fundamental role in robust policy transfer across different input modalities of artificial agents [8].

Inspired by the human cognitive framework, we contribute a novel model capable of learning rich multimodal representations and performing cross-modality inference. Multimodal generative models have shown great promise in doing so by learning a single joint-distribution of multiple modalities 

[9, 10, 11, 12]. This single representation space has to encode information to account for the complete generation process of all modalities, often of different complexities. As such, for each input modality, the representation capability of this single joint-representation space must pale in comparison with that of an individual modality-specific space. Indeed, according to the Convergence-Divergence Zone (CDZ) cognitive model [2], humans process perceptual information not in a single representation space but in a hierarchical structure: sensory data is processed at lower-levels of the model, generating modality-specific representations; and divergent information from these representations is merged at higher levels of the model, generating multimodal representations [1, 13]. The architecture of the CDZ model is presented in Figure 2.

Inspired by CDZ architecture, we propose the MHVAE, a novel generative model that learns multimodal representations in an unsupervised way. The MHVAE model is a multimodal hierarchical Variational Autoencoder (VAE) that learns modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, allowing for cross-modality inference. Moreover, we formally derive the model’s evidence lower bound (ELBO) and, based on modality-specific representation dropout, we propose a novel methodology to approximate the joint-modality posterior. This approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training, with minimal computational cost.

We evaluate the potential of the MHVAE as a multimodal generative model on standard multimodal datasets. We show that the MHVAE outperforms other state-of-the-art multimodal generative models on modality-specific reconstruction and cross-modality inference.

In summary, the main contributions of this paper are:

  • We propose a novel multimodal hierarchical VAE, inspired by the CDZ-based human neural architecture [2]. The model learns modality-specific distributions and a joint distribution of all modalities, allowing for cross-modality inference in the presence of incomplete perceptual information. We formally derive the model’s evidence lower bound.

  • We propose a new methodology for approximating the joint-modality posterior, based on modality-specific representation dropout. This approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training, with minimal computational cost.

  • We evaluate the model on standard multimodal datasets and show that the MHVAE performs on par to other state-of-the-art multimodal generative models on modality-specific reconstruction and cross-modality inference.

Figure 2: Schematic of the CDZ and MHVAE models, where grey rectangles indicate observed variables: (a) the inference and generative models of the CDZ model; (b) the MHVAE model, in which a high-level core latent variable generates the modality-specific latent distributions , responsible for sampling modality data . This generative process is represented by the orange segments. The blue segments represent the inference model of the MHVAE, where modality observations are encoded both in the modality-specific latent distributions and in the core latent distribution , considering the hidden modality-specific representations .

2 A Deep Hierarchical Generative Model for Multimodal Representation Learning

Deep generative models have shown great promise in learning generalized representations of data. For single-modality data, the VAE is widely used. It learns a joint distribution of data , which is generated by a latent variable

. This latent variable is often of lower dimensionality in comparison with the modality itself and acts as the representation vector in which data is encoded.

The joint distribution takes the form

where (the prior distribution) is often an unitary Gaussian (). The generative distribution , parameterized by , is usually composed with a simple likelihood term (e.g. Bernoulli or Gaussian).

The training procedure of the VAE model involves the maximization of the evidence likelihood , by marginalizing over the latent variable:

However, the above likelihood is intractable. As such, we resort to an inference network

for its estimation:

Applying the logarithm and Jensen’s inequality we obtain a lower-bound on the log-likelihood of the evidence (ELBO), i.e., , where

where the Kullback-Leibler divergence term,

, promotes a balance between the latent variable’s capacity and the encoding process of data. During training, this balance can be adjusted through the introduction of a hyper-parameter ,

where we recover the original VAE formulation when taking . The optimization of the ELBO is done using gradient-based methods, applying a re-parametrization technique [14].

2.1 Mhvae

We now introduce the MHVAE model, which extends the single modality nature of VAEs to the multimodal hierarchical setting. In the multimodal setting, we consider a set of modalities , generated accordingly to some environmental-dependent process , parameterized by . We model the generation process of information in a hierarchical fashion: each modality is generated by a corresponding modality-specific latent variable in the set , conditionally independent given a core latent variable . The main goal of the MHVAE is to simultaneously learn single-modality latent spaces, for reconstructing modality-specific data, and a joint distribution of modalities, encoded in a core latent distribution, allowing cross-modality inference. The architecture of the proposed model is presented in Figure 2.

2.1.1 Evidence Lower-bound of the MHVAE

In order to train the model, we aim at maximizing the likelihood of the generative process, , by marginalizing over the modality-specific and core latent variables,

(1)

Given its hierarchical nature and the conditional independence of each modality-specific latent variable in regards to the core latent variable, we can decompose the joint-modality probability as

(2)

However, since the marginal likelihood of each modality is intractable, we estimate its posterior resorting to an inference model , parameterized by . We consider an inference model , as shown in Figure 2, in which modality information is encoded simultaneously into the modality-specific latent spaces and into the core latent space, yielding

(3)

Introducing the inference model in the decomposed joint probability and rewriting the likelihood of the evidence as an expectation over the latent variables, we obtain

(4)

Taking the logarithm and applying Jensen’s inequality [15], we estimate a lower-bound on the log-likelihood of the evidence as

(5)

The lower-limit can be seen as containing three distinct groups. The first group, similar to the original VAE formulation, corresponds to the reconstruction loss of input , generated by the modality-specific latent variables . For the -th modality, this is given by

(6)

The second component parallels the encoding capacity constrain of the latent variable in the VAE formulation, now considering the multimodal core latent variable . This constraint penalizes encoding distributions that deviate from the prior and is given by

(7)

Finally, the third term associates the distribution generated by the single-modality encoders, , and the distribution generated from the multimodal core latent space, :

(8)

Taking into consideration the previous components, we can write the evidence lower-bound of the MHVAE model as

(9)

where we introduce weight factors for each modality-specific reconstruction loss and a divergence term , in addition to a core capacity weight .

2.1.2 Modality Representation Dropout

Figure 3: Diagram of the proposed Modality Representation Dropout (MRD) procedure for training the MHVAE: (a) after encoding modality observations , we compute the modality-specifichidden representations , sampling the dropout mask , in order to zero-out the selected modality representations; (b) after the procedure, we concatenate the hidden representations and encode the multimodal latent variable .

We now turn to the methodology to approximate the joint-modality posterior distribution. In the case of the MHVAE, we wish to encode information from the modality-specific data into the multimodal core latent variable .

One approach to do so, the product-of-experts (POE), approximates the joint posterior with the product of Gaussian experts including a prior expert[12]. However, this solution is computationally intensive (as it requires artificial sub-sampling of the observations during training) and suffers from overconfident expert prediction, resulting in sub-par cross-modality inference performance [9].

We propose a novel methodology to approximate the joint-modality posterior based on the dropout of modality-specific representations, as shown in Figure 3. We introduce a modality-data dropout masks , with dimensionality , such that

(10)

where corresponds to the list of hidden-layer representations, computed by the modality-data encoders, as seen in Figure 2. We effectively zero-out the selected components by considering that

(11)

During training, for each datapoint, we sample

from a Bernoulli distribution,

(12)

where the hyper-parameters control the dropout probability of each modality representation. Moreover, we condition the mask sampling procedure in order to always allow at least a single modality representation to be non-zero. As such, for each sample we concatenate the resulting representations to be used as input to the multimodal encoder. Accounting for latent modality dropout, the modified ELBO of the MHVAE becomes

(14)

3 Evaluation

In this section, we evaluate the MHVAE’s performance as a multimodal generative model on standard multimodal datasets. Our model outperforms other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.

3.1 Multimodal Datasets

As in previous literature, we transform single modality datasets into bimodal datasets by considering the label associated to each image as a modality of its own right. We also compare the MHVAE to existing multimodal generative models: JMVAE-kl [10] and MVAE [12]. For the JMVAE-kl model we consider . For the MVAE model, trained using the publicly available official implementation,111Implementation available at https://github.com/mhw32/multimodal-vae-public we employ the author’s suggested training hyper-parameters.

We evaluate our model on literature standard datasets: MNIST [16], FashionMNIST [17], and CelebA [18]. We report state-of-the-art performance on the first two datasets regarding generative modelling and cross-modality capabilities.

We train the MHVAE with no hyper-parameter tuning, i.e., . Moreover, we fix the dropout hyper-parameters , for all modalities. For the MHVAE model, as shown in Figure 2, we consider two different types of networks: the modality network, responsible for encoding the input data into the modality-specific latent space , the associated hidden representation , and the inverse generative process; and the core network, responsible for the encoding of the multimodal core latent variable, from the representation , from which we generate the modality-specific latent spaces . For fairness, on each dataset, we keep the network architectures consistent across models: the generative and inference networks of the baseline models share their architecture with the modality-specific networks of the MHVAE.

Moreover, we also consider a warm-up period on the regularization terms of the ELBO [19]: we linearly increase the value of the prior regularization term on the modality-specific latent variable for epochs; and we linearly increase the value of the Gaussian prior on the core latent space for epochs. For the baselines we consider a single warm-up period on the prior regularization of the latent space, .

We evaluate the reconstruction capabilities and cross-modality inference performance of the models. To do so, we estimate the image marginal log-likelihood, , the joint log-likelihood, , and the conditional log-likelihood, , of the observations, through importance sampling. For MNIST and FashionMNIST, we consider importance samples, and for CelebA we consider

samples. The evaluation metrics are derived in appendix.

3.1.1 Mnist

Metric Input JMVAE MVAE MHVAE
I -90.189 - -89.050
I -90.241 - -89.183
L -125.381 - -121.401
I,L -90.335 - -89.143
L -123.070 - -118.856
Table 1: Log-likelihood values of the proposed evaluation metrics on the MNSIT dataset for the MHVAE and other multimodal generative models. We estimate the latent variables considering as input image (I), label (L) or joint (I, L) modalities, resorting to importance samples. Due to numerical instabilities we were unable to train or evaluate the MVAE baseline model.

For the MNIST dataset, we train all models on images and labels . We consider a dataset division of 85% for training, of which for validation purposes, and the remaining 15 for evaluation.

We compose the image modality network of the MHVAE model with three linear layers with 512 hidden units, leaky rectifiers as activation function and applying batch-normalization between each hidden layer. Furthermore, we consider a 16-dimensional image-specific latent space. The label modality network is similarly composed with three linear layers with 128 hidden units, considering a 16-dimensional label-specific latent space. The core network is composed with three linear layers with 64 hidden units, considering a 10-dimensional latent space. For the baselines, we consider a single 26-dimensional latent space.

Figure 4: Image samples generated by the MHVAE model trained on standard multimodal datasets: (a), (b), (c) show images conditionally generated from sampling , with (a) , (b) {Sneaker}, (c) {Male Smiling}; and (d), (e), (f) show images generated by sampling from the prior .

We consider as a Bernoulli distributed likelihood and as a multinomial likelihood. Moreover, for the MHVAE model, we consider epochs and for the baselines epochs.

We train all models for 500 epochs, considering a learning rate of and batch-size . The estimates of the test log-likelihoods for all the models are presented in Table 1. We report that the MHVAE outperforms other state-of-the-art multimodal models on both single-modality and joint-modality metrics, despite the fact that these separate representation spaces are of lower dimensionality than the joint representation space employed by the JMVAE and the MVAE. Moreover, the MHVAE model is able to provide better cross-modality inference than other models, as observed by the significantly lower value of the conditional log likelihood , employing only the lower-dimensional label modality to estimate this quantity.

In Figure 4, we present images generated by the MHVAE, sampled from the prior and conditioned on a given label, i.e., estimated using . The quality of the sampled images indicates a suitable performance of the generative networks of the MHVAE model.

3.1.2 FashionMNIST

Metric Input JMVAE MVAE MHVAE
I -232.427 -236.613 -231.753
I -232.739 -242.628 -232.276
L -244.378 -557.582 -243.932
I,L -232.573 -241.534 -232.248
L -242.060 -552.679 -241.662
Table 2: Log-likelihood values of the proposed evaluation metrics on the FashionMNIST dataset for the MHVAE and other multimodal generative models. We estimate the latent variables considering as input image (I), label (L) or joint (I, L) modalities, resorting to 5000 importance samples.

For FashionMNIST, we train the generative models on greyscale images and their class labels , with the same proportional division of the dataset as for the previous case. For the MHVAE model, we implement a miniature DCGAN [20] architecture as the image-modality encoder, with Swish [21] as the activation function due to its performance in deep convolutional-based models. The network is composed of two convolutional layers of 32 and 64 channels followed by a linear layer of 128 hidden units. For the core and text-modality inference and generator networks, we maintain the same architecture. We consider modality-specific and core latent spaces with the same dimensionality as before and employ the same training hyper-parameters as in the previous evaluation case.

We train all models for 500 epochs, employing the Adam optimization algorithm [22] in the training procedure with learning rate and batch-size . The estimates of the test log-likelihoods for all the models are presented in Table 2. Once again, we report that the MHVAE outperforms other state-of-the-art multimodal models on both single-modality and joint-modality metrics, as well as on label-to-image cross-modality inference.

In Figure 4, we present the images generated by the MHVAE, sampled from the prior and conditioned on a given label, which provide evidence that the generative networks of the model have a suitable performance.

3.1.3 CelebA

For CelebA, we train the MHVAE on re-scaled colored images and a subset of 18 visually distinctive attributes  [23]. We compose the image modality network of the MHVAE model as a miniature DCGAN [20]. This network is composed of four convolution layers, with channels, respectively, followed by a linear layer of 512 hidden units. Furthermore, we consider an 48-dimensional image-specific latent space. The label modality network is composed with three linear layers with 512 hidden units, considering an 48-dimensional label-specific latent space. The core network is composed with three linear layers with 256 hidden units, considering a 16-dimensional latent space. The baselines consider a single 64-dimensional latent space.

We train all models for 50 epochs employing a learning rate and batch-size . For the MHVAE model, we consider and epochs. For the baselines models, we consider a single warm-up period of epochs. The estimates of the test log-likelihoods, computed using 500 importance samples, are presented in Table 3. In this scenario, the MHVAE performs on par with other state-of-the-art multimodal models on all metrics, albeit with slight less performance in comparison with the previous evaluations. In Figure 4, we present the images generated by the MHVAE, sampled from the prior and conditioned on a given set of attributes.

Metric Input JMVAE MVAE MHVAE
I -6260.35 -6256.65 -6271.35
I -6264.59 -6270.86 -6278.19
A -7204.36 -7316.12 -7303.64
I,A -6262.67 -6266.14 -6276.57
A -7191.11 -7309.10 -7296.22
Table 3: Log-likelihood values of the proposed evaluation metrics on the CelebA dataset for the MHVAE and other multimodal generative models. We estimate the latent variables considering as input image (I), attributes (A) or joint (I, A) modalities, resorting to 500 importance samples.

3.2 Discussion

We have evaluated the MHVAE against a baseline of state-of-the-art regarding performance on standard multimodal datasets. We have compared our model with the JMVAE and the MVAE models, two widely used models for multimodal representation learning.

The results, on increasingly complex datasets, attest to the importance of considering hierarchical representation spaces to model multimodal data distributions. Even considering lower-dimensional spaces to learn the modality distributions, in comparison with the single-multimodal space of the baselines, the MHVAE is able to achieve state-of-the-art results on the MNIST and FashionMNIST datasets, with minimal hyper-parameter tuning.

On the CelebA dataset, the MHVAE behaves on par with the other baseline models, which raises the question about the importance of the dimensionality of the representation spaces for complex scenarios. Indeed, for a fair comparison with the other baselines, we limited the MHVAE model to have lower-dimensional representations spaces which, on a complex datasets such as CelebA, result in a lower log-likelihood of the modalities. However, the MHVAE model is still capable of outperforming the MVAE model in regards to joint-modality and cross-modality inference, estimated from the label. Regarding future work, we intend to address the question of the balance between representative capacity in the core and in the modality-specific distributions.

4 Related Work

Deep generative models have shown great promise in learning generalized latent representations of data. The VAE model [14] estimates a deep generative model through variational inference methods, encoding univariate data in a single latent space, regularized by a prior distribution. The regularization distribution is often an unitary Gaussian, or a more complex posterior distribution [24, 25]. Due to the intractability of the marginal likelihood of the data, the model resorts to an inference network in the computation of the model’s evidence lower-bound. This lower-bound can be estimated, for example, through importance sampling techniques [26].

Hierarchical generative models have also been proposed in literature to learn complex relationships between latent variables [19, 27, 28, 29]. However, these models consider representations created from a single modality and, as such, are not able to provide a framework for cross-modality inference nor to represent multimodal data. On the other hand, VAE models have also been extended in order to learn joint distributions of several modalities by forcing the estimated single-modality representations to be similar, thus allowing cross-modality inference [10, 30, 11]. However, the necessity of introducing specific divergence terms in the model’s evidence lower-bound for each combination of modalities hinders its application in scenarios with a large number of modalities. Another approach introduced, was the POE inference network which reduces the number of encoding networks required for multimodal encoding [12], albeit with increased computational training cost associated. In order to provide cross-modality inference capabilities, the existing models encode information from all modalities into a single, common, latent variable space. Thus, they relinquish the generative capabilities that single-modality latent representational spaces possess. In this work, we present a novel multimodal generative model, capable of learning hierarchical representation spaces.

5 Conclusions

In this work, by taking inspiration from the human cognitive framework, we presented the MHVAE, a novel multimodal hierarchical generative model. The MHVAE is able to learn separate modality-specific representations and a joint-modality representation, allowing for improved representation learning in comparison with the single-representation choice of other multimodal generative models. We have shown that, on standard multimodal datasets, the MHVAE is able to outperform other state-of-the-art multimodal generative models regarding modality-specific reconstruction and cross-modality inference.

We also proposed a novel methodology to approximate joint-modality posterior, based on modality-specific representation dropout. With minimal computational cost, this approach allows the encoding of information from an arbitrary number of modalities and naturally promotes cross-modality inference during the model’s training. We aim at exploring scenarios with larger number of modalities in the future.

Moreover, we aim to employ the MHVAE as a perceptual representation model for artificial agents and explore its application in deep multimodal reinforcement learning scenarios, when the agent has to perform cross-modality inference to perform the task. Further inspired by human cognition and perceptual learning, we also intend to explore reinforcement learning mechanisms for the construction of the multimodal representation themselves.

References