Log In Sign Up

Benchmarking Multimodal Variational Autoencoders: GeBiD Dataset and Toolkit

Multimodal Variational Autoencoders (VAEs) have been a subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed for the evaluation of multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. Second, we present a synthetic bimodal dataset designed for a comprehensive evaluation of the joint generation and cross-generation capabilities. We demonstrate the utility of the dataset by comparing state-of-the-art models.


page 8

page 17


A survey of multimodal deep generative models

Multimodal learning is a framework for building models that make predict...

Multimodal Transformer for Parallel Concatenated Variational Autoencoders

In this paper, we propose a multimodal transformer using parallel concat...

Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models

Multimodal learning for generative models often refers to the learning o...

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Though deep generative models have gained a lot of attention, most of th...

Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts

Multimodal generative models should be able to learn a meaningful latent...

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

We investigate deep generative models that can exchange multiple modalit...

On the Limitations of Multimodal VAEs

Multimodal variational autoencoders (VAEs) have shown promise as efficie...

1 Introduction

Variational Autoencoders (VAEs) kingma2013auto have become a multipurpose tool applied to various machine perception tasks and robotic applications over the past years pu2016variationalxu2017variationalnair2018visual. For example, some of the recent implementations address areas such as visual question answering chen2019multi, visual question generation uppal2021c3vqg or emotion recognition yang2019attribute. Recently, VAEs were also extended for the integration of multiple modalities, enabling mapping different kinds of inputs into a joint latent space. It is then possible to reconstruct one modality from another or to generate semantically matching pairs, provided the model successfully learns the semantic overlap among them.

Several different methods for joint multimodal learning have been proposed so far and new models are still being developed wu2018multimodalshi2019variationalsutter2021generalized. Two of the widely recognized and compared approaches are the MVAE model wu2018multimodal

, utilizing the Product of Experts (PoE) model to learn a single joint distribution of the joint posterior, and MMVAE

shi2019variational, using a Mixture of Experts (MoE). sutter2021generalized also recently proposed a combination of these two architectures referred to as Mixture-of-Products-of-Experts (MoPoE-VAE), which approximates the joint posterior on all subsets of the modalities using a generalized multimodal ELBO.

The versatility of these models (i.e. the possibility to classify, reconstruct and jointly generate data using a single model) naturally raises a dispute on how to assess their generative qualities. Wu and Goodman used test set log-likelihoods to report their results


, Shi et al. proposed four criteria that measure the coherence, synergy and latent factorization of the models using various qualitative and quantitative (usually dataset-dependent) metrics. Evaluation for these criteria was originally performed on several multimodal benchmark datasets such as MNIST

deng2012mnist, CelebA liu2015faceattributes

, MNIST and SVHN combination

shi2019variational or the Caltech-UCSD Birds (CUB) dataset wah2011caltech. All of the mentioned datasets are bimodal and comprise images paired either with labels (MNIST, CelebA), other images (MNIST-SVHN, PolyMNIST) or text (CelebA, CUB). Since none of these datasets was designed specifically for the evaluation of multimodal integration and semantic coherence of the generated samples, their usage is in certain aspects limited. More specifically, datasets comprising real-world images/captions (such as CUB or CelebA) are highly noisy, often biased and do not enable automated evaluation, which makes these datasets unsuitable for detailed analysis of the model’s performance. On the other hand, toy datasets like MNIST and SVHN do not offer different levels of complexity, provide only few categories (10 digit labels) and cannot be used for generalization experiments.

Due to the above-mentioned limitations of the currently used benchmark datasets and also due to a high number of various implementations, objective functions and hyperparameters that are used for the newly developed multimodal VAEs, the conclusions on which models outperform the others substantially differ among the authors

shi2019variational, kutuzova2021multimodal, daunhawer2021limitations. Recently, Daunhawer et al. daunhawer2021limitations published a comparative study of multimodal VAEs where they conclude that new benchmarks and a more systematic approach to their evaluation are needed. We agree with this statement and propose a unification of the various implementations into a single toolkit which enables training, evaluation and comparison of the state-of-the-art models and also faster prototyping of new methods. The toolkit is written in Python and enables the user to quickly train and test their model on arbitrary data with an automatic hyperparameter grid search. Moreover, we address the lack of suitable benchmark datasets for multimodal VAE evaluation by proposing a custom synthetic dataset called GeBiD (Geometric shapes Bimodal Dataset

). This dataset comprises images of geometric shapes and their textual descriptions. It is designed for fast data generation, easy evaluation and also for the gradual increase of complexity to better estimate the progress of the tested models. It is described in greater detail in Section

3. For a brief comparison of the benchmark dataset qualities, see also Table 1.

In conclusion, the contributions of this paper are following:

  1. We propose a public toolkit which enables systematic development, training and evaluation of the state-of-the-art multimodal VAEs.

  2. We provide a synthetic image-text dataset called GeBiD designed specifically for the evaluation of the generative capabilities of multimodal VAEs on 5 levels of complexity.

The toolkit and code for the generation of the dataset (as well as a download link for a ready-to-use version of the dataset) are available on GitHub

2 Related Work

In this section, we first briefly describe the state-of-the-art multimodal variational autoencoders and how they are evaluated, then we focus on datasets that have been used to demonstrate the models’ capabilities.

2.1 Multimodal VAEs and Evaluation

Multimodal VAEs are an extension of the standard Variational Autoencoder (as proposed by Kingma et al. kingma2013auto) that enables joint integration and reconstruction of two or more modalities. During the past years, a number of approaches toward multimodal integration have been presented suzuki2016joint, wu2018multimodal, shi2019variational, vasco2020mhvae, sutter2021generalized, joy2021learning. For example, the model proposed by Suzuki et al. suzuki2016joint

learns the joint multimodal probability through a joint inference network and instantiates an additional inference network for each subset of modalities. A more scalable solution is the MVAE model

wu2018multimodal, where the joint posterior distribution is approximated using the product of experts (PoE), exploiting the fact that a product of individual Gaussians is itself a Gaussian. In contrast, the MMVAE approach shi2019variational

uses a mixture of experts (MoE) to estimate the joint variational posterior based on individual unimodal posteriors. The MoPoE architecture from Sutter et al.

sutter2021generalized combines the benefits of PoE and MoE approaches by computing the joint posterior for all subsets of modalities. Another recent extension is the DMVAE model, in which the authors enable the encoders to separate the shared and modality-specific features in the latent space for a disentangled multimodal VAE lee2021private.

The evaluation of the above-mentioned models has also evolved over time. Wu and Goodman wu2018multimodal

measured the test marginal, joint and conditional log-likelihoods together with the variance of log importance weights. Shi et al. 

shi2019variational proposed four criteria for evaluation of the generative capabilities of multimodal VAEs: coherent joint generation, coherent cross-generation, latent factorisation and synergy

. All criteria are evaluated both qualitatively (through empirical observation of the generated samples) and quantitatively: by adopting pre-trained classifiers for evaluation of the generated content, by training a classifier on the latent vectors to test whether the classes are distinguishable in the latent space, or by calculating the correlation between the jointly and cross-generated samples using the Canonical Correlation Analysis (CCA).

Besides certain dataset-dependent alternatives, the most recent papers use a combination of the above-mentioned metrics for multimodal VAE comparison sutter2021generalized, joy2021learning daunhawer2021limitations, kutuzova2021multimodal. Despite that, the conclusions on which model performs the best according to these criteria substantially differ. According to a thorough comparative study from daunhawer2021limitations, none of the current multimodal VAEs sufficiently fulfils all of the four criteria specified by shi2019variational. Furthermore, the optima of certain training hyperparameters might be different for each model (as was proven e.g. with the regularisation parameter   daunhawer2021limitations), which naturally raises the need for automated and systematic comparison of these models over a large number of hyperparameters, datasets and training schemes. sutter2021generalized released public code which allows comparison of the MoE, PoE, MoPoE and DMVAE approaches - however, the experiments are dataset-dependent and an extension for other data types would thus require writing a substantial amount of new code.

In this paper, we propose a publicly available toolkit for systematic training, evaluation and comparison of the state-of-the-art multimodal VAEs. Special attention is paid to hyperparameter grid search and automatic visualizations of the learning during training. To our knowledge, it is the only model- and dataset-agnostic tool available in this area that would allow fast implementation of new approaches and their testing on arbitrary types of data.

2.2 Multimodal Datasets

At this time, there are several benchmark datasets commonly used for multimodal VAE evaluation. The majority is bimodal, where one or both modalities are images. In some of the datasets, the bimodality is achieved by splitting the data into images as one modality and the class labels as the second - this simplifies the evaluation during inference, yet such dataset does not enable e.g. testing the models for generalization capabilities and the overall number of classes is usually very low. Examples of such datasets are MNIST deng2012mnist, FashionMNIST xiao2017fashion or MultiMNIST sabour2017dynamic. Another class are image-image datasets such as MNIST and SVHN netzer2011reading (as used in shi2019variational), where the content is semantically identical and the only difference is the form (i.e. handwritten digits vs. house numbers). This can be seen as integrating two identical modalities with different amounts of noise - while such an experiment might serve as a basic proof of concept, it does not evaluate whether the model can integrate multiple modalities with completely different data representations.

Figure 1: Examples of our proposed GeBiD dataset. The dataset contains RGB images (left columns) and their textual descriptions (right columns). We provide 5 levels of difficulty (left to right). Level 1 only varies the shape attribute, Level 2 varies shape and size, Level 3 varies also the colour attribute, Level 4 varies also the background color and Level 5 varies also the shape location. See a more detailed description of the dataset in Section 3.

An example of a bimodal dataset with an image-text combination is the CelebA dataset containing real-world images of celebrities and textual descriptions of up to 40 binary attributes (such as male, smiling, beard etc.) liu2015faceattributes. While such a dataset is far more complex, the qualitative evaluation of the generated samples is difficult and ambiguous due to the subjectiveness of certain attributes (attractive, young, wearing lipstick) combined with the usually blurry output images generated by the models. Another recently used image-text dataset is the Caltech-UCSD Birds (CUB) dataset wah2011caltech comprising real-world bird images accompanied with manually annotated text descriptions (e.g. this bird is all black and has a long pointy beak). However, these images are too complex to be generated by the state-of-the-art models (as proven by Daunhawer et al. daunhawer2021limitations, we also show our own results in the Appendix) and the authors thus only use their features and perform the nearest-neighbour lookup to match them with an actual image shi2019variational. This also makes it impossible to test the models for generalization capabilities (i.e. making an image of a previously unseen bird).

Besides the abovementioned datasets that have already been used for multimodal VAE comparison, there are also other multimodal (mainly image-text) datasets available such as Laion-400m schuhmann2021laion

, Microsoft COCO

lin2014microsoft or Conceptual Captions sharma2018conceptual. Similar to CUB, these datasets include real-world images with human-made annotations, which can be used to estimate how the model would perform on real-world data. However, they cannot be used for deeper analysis and comparison of the models as they are very noisy (including typos, wrong grammar, many synonyms etc.), subjective, biased or they require common sense (e.g. “piercing eyes”, “clouds in the sky” or “powdered with sugar”). This makes the automation of the semantic evaluation of the generated outputs difficult. A more suitable example would be the CLEVR dataset johnson2017clevr, comprising synthetic images and natural language questions. However, these questions require complex logical reasoning which is currently out of bounds for the state-of-the-art multimodal VAEs.

In conclusion, the currently used benchmark datasets for multimodal VAE evaluation are not optimally suited for benchmarking due to oversimplification of the multimodal scenario (such as the image-label combination or image-image translation where both images show digits), or, in the opposite case, due to overly complex modalities that are challenging to reconstruct and difficult to evaluate. In this paper, we address the lack of suitable benchmark datasets for multimodal VAE evaluation by proposing a custom synthetic dataset called GeBiD (Geometric shapes Bimodal Dataset). This dataset comprises images of geometric shapes and their textual descriptions. It is designed for fast data generation, easy evaluation and also for the gradual increase of complexity to better estimate the progress of the tested models. It is described in greater detail in Section 3. For a brief comparison of the benchmark dataset qualities, see also Table 1.

3 GeBiD Dataset

We propose a synthetic image-text dataset for a clear and systematic evaluation of the multimodal VAE models. The main highlights of the dataset are its scaled complexity, quick adjustability towards noise or the number of classes and its rigid structure which enables automated qualitative and quantitative evaluation. A ready-to-use version of the dataset can be downloaded on the link provided in our repository 1, however, we recommend the users to generate the dataset on their own as it is much faster (around 1 minute on CPU for 10 000 samples) compared to average download speed.

3.1 Dataset structure

The dataset comprises images of geometric shapes (64x64x3 pixels) with a defined set of 1 - 5 attributes (designed for a gradual increase of complexity) and their textual descriptions. The full variability covers 6 shape primitives (spiral, line, circle, square, semicircle, pie-slice), 2 sizes (large, small), 12 colours, 4 locations (top/bottom + left/right) and 2 backgrounds (black/white), creating 1152 possible unique combinations (see Fig. 1 for examples and Table 7 in the Appendix for the statistics and Section A1 for further information). Although the default presented version of the dataset only adds noise to the exact shape positions within the given image quadrant, it is also possible to introduce noise in the shades of the colours, shape rotations, shape sizes or background colours. The user can also limit the spectrum of attributes for certain shapes to see whether the models can generalize during testing (e.g., generate a blue square although it only saw other blue shapes and different colour squares but never this exact combination). The default version of the dataset has 9000 training and 1000 validation samples.

The textual descriptions comprise 1 - 9 words (based on the selected number of attributes) which have a rigid order within the sentence (i.e. size, colour, shape, position and background colour). To make this modality challenging to learn, we do not represent the text at the word level, but rather at the character level. The text is thus represented as vectors where is the number of characters in the sentence (the longest sentence has 52 characters) and

are the one-hot encodings of length 27 (full English alphabet + space). The total sequence length is thus different for each textual description - this is automatically handled by the toolkit using zero-padding and the creation of masks (see Section

5). A PCA visualization of GeBiD Level 5 for both image and text modalities is shown in the Appendix in Fig. 3.

3.2 Scalable complexity

To enable finding the minimal functioning scenario for each model, the GeBiD dataset can be generated in multiple levels of difficulty - by the difficulty we mean the number of semantic domains the model has to distinguish (rather than the size/dimensionality of the modalities). Altogether, there are 5 difficulty levels (see Fig. 1):

  • Level 1 - the model only learns the shape names (e.g. circle)

  • Level 2 - the model learns the shape and its size (e.g. small circle)

  • Level 3 - the model has to learn the shape, its size and colour (small red circle)

  • Level 4 - the model also has to learn the background colour (small red circle on black)

  • Level 5 - the model learns also the location (small red circle at the bottom right on black)

Since the above-mentioned levels are defined for the text modality, the user can also choose whether to keep the image complexity at the same level (e.g., for Level 1, only the shape varies with a fixed size, colour, position and background), or whether to use the full spectrum of attributes. The complexity of the bimodal dataset can be thus highly balanced (Level 5 for images and text) or highly imbalanced (Level 5 for images and Level 1 for text). We provide a detailed description of how to configure and generate the dataset on our Github repository 1.

3.3 Automated evaluation

The rigid structure and synthetic origin of the GeBiD dataset enable automatic evaluation of the coherence of the generated samples. The generated text can be tokenized and each word can be looked up in the dataset vocabulary based on its position in the sentence (for example, the first word always has to refer to the size in case of Level 5 of the dataset). The visual attributes of the generated images can be estimated using histograms (to see if the shape and background colours are correct) and computer vision libraries such as OpenCV (

findContours(), matchTemplate(), making masks using thresholding etc.) for template matching of the corresponding shapes. Based on these metrics, it is possible to calculate the joint- and cross-generation accuracies for the given data pair - the image and text samples are considered coherent only if they correspond in all evaluated attributes. The fraction of semantically coherent (correct) generated pairs can thus serve as a percentual accuracy of the model.

We report the joint- and cross-generation accuracies on three levels: Strict, Feats and Letters (only applicable for text outputs). The Strict metrics measures the percentage of completely correct samples, i.e. there is zero error tolerance (all letters and all features in the image must be correct). The Feats metrics measures the ratio of correct features per sample (words for the text modality and visual features for the image modality) , i.e. accuracy ratio 1.75(5)/5 on the level 5 means that on average 1.75 0.05 features out of 5 are recognized correctly for each sample, and Letters measures the average percentage of correct letters in the text outputs.

Dataset Two Complex Modalities Different Data Domains Unambiguous Qualitative Evaluation Multiple Difficulty Levels
MNIST deng2012mnist N/A
FashionMNIST xiao2017fashion N/A
MNIST-SVHN shi2019variational
CelebA liu2015faceattributes
CUB wah2011caltech
GeBiD (ours)
Table 1: Comparison of the currently used bimodal benchmark datasets for multimodal VAE evaluation. We compare whether both modalities are complex (e.g. the second modality is not a label), whether the modalities are from different domains (e.g. not two images), whether their empirical evaluation is unambiguous and whether they are scalable for various difficulty levels.

4 Benchmarking Toolkit

Our proposed toolkit 1 was developed to facilitate and unify the evaluation and comparison of multimodal VAEs. Due to its modular structure, the tool enables adding new models, datasets, encoder and decoder networks or objectives without the need to modify any of the remaining functionalities. It is also possible to train unimodal VAEs to see whether the multimodal integration distorts the quality of the generated samples.

The toolkit is written in Python, the models are defined and trained using the PyTorch library

NEURIPS2019_9015. For clarity, we directly state in the code which of the functions (e.g. various objectives) were taken from previous implementations so that the user can easily look up the corresponding mathematical expressions.

Currently, the toolkit incorporates the MVAE wu2018multimodal, MMVAE shi2019variational, MoPoE-VAE sutter2021generalized and DMVAE lee2021private

models. The datasets supported by default are MNIST, MNIST-SVHN, CUB and our GeBiD dataset. We also provide instructions on how to easily train the models on any new kind of data. As for the encoder and decoder neural networks, we offer fine-tuned convolutional networks for image data (for all of the supported datasets) and several Transformer networks aimed at sequential data such as text, image sequences or actions (these can be robotic, human or any other kind of actions).

Although we are continuously extending the toolkit with new models and functionalities, the main emphasis is placed on providing a tool that any scientist can easily adjust for their own experiments. The long-term effort behind the project is to make the findings on multimodal VAEs reproducible and replicable. We thus also have tutorials on how to add new models or datasets to the toolkit in our Github documentation 222

In the following subsections, we highlight some additional features that our toolkit offers to the user.

4.1 Experiment configuration

The training setup and hyperparameters for each experiment can be defined using a YAML config file. Here the user can define any number and combination of modalities (unimodal training is also possible), modality-specific encoders and decoders, desired multimodal integration method, reconstruction loss term, objective (this includes various subsampling strategies due to their reported impact on the model convergence wu2018multimodal, daunhawer2021limitations) and several training hyperparameters (batch size, latent vector size, learning rate, seed, optimizer and more). For an example of the training config file, please refer to the toolkit repository 1.

4.2 Evaluation methods

We provide both dataset-independent and dataset-dependent evaluation metrics. The dataset-independent evaluation methods are the estimation of the test log-likelihood, plots with the KL divergence and reconstruction losses for each modality, and visualizations of the latent space using Principal Component Analysis (PCA), T-SNE and latent space traversals (reconstructions of latent vectors randomly sampled across each dimension).

The dataset-dependent methods focus on evaluation of the 4 criteria specified by Shi et al. shi2019variational: coherent joint generation, coherent cross-generation, latent factorisation and synergy. For qualitative evaluation, joint- and cross-generated samples are visualised during and after training. For quantitative analysis of the MNIST, MNIST-SVHN and CUB datasets, we adopt the previously used methods by Shi et al. shi2019variational and Daunhawer et al. daunhawer2021limitations (see Subsection 2.1 for details on these metrics). For our GeBiD dataset, we offer an automated qualitative analysis of the joint and cross-modal coherence - the simplicity and rigid structure of the dataset enable binary (correct/incorrect) evaluation for each generated image (without the need for pre-trained classifiers) and also for each letter/word in the generated text. For more details, see Section 3.3.

4.3 Dataset generation

The proposed GeBiD dataset and its variations can be generated using a script provided in our toolkit - it allows the user to choose the complexity of the data (we provide 5 "difficulty" levels based on the number of included shape attributes) and whether the complexity is balanced or imbalanced for the two modalities (e.g. images with all 5 attributes can be described using only one word). The dataset generation requires only the standard computer vision libraries and 10 000 samples can be generated within a minute on a CPU machine. You can read more about the dataset in Section 3.

5 Benchmark Study

In this section, we demonstrate the utility of the proposed toolkit and dataset on a set of experiments comparing two selected state-of-the-art multimodal VAEs.

5.1 Experiments

We perform a set of experiments to compare the two widely discussed multimodal VAE approaches - MVAE wu2018multimodal, with the Product-of-Experts multimodal integration, and MMVAE shi2019variational, presenting the Mixture-of-Experts strategy. Firstly, to verify that the implementations are correct, we replicated the MNIST-SVHN experiments presented in shi2019variational. We used the same encoder and decoder networks and training hyperparameters as in the original implementation. We trained MVAE with the ELBO objective and MMVAE with the IWAE objective as proposed. The results were consistent with those reported in the paper (see Fig. 2 in the Appendix for visualizations).

Next, we train both models on the GeBiD dataset consecutively on all 5 levels of complexity and perform a hyperparameter grid search over the dimensionality of the latent space, batch size and reconstruction loss term. We show the qualitative and quantitative results and compare them with the findings from the recent studies daunhawer2021limitations, kutuzova2021multimodal.

Level Model Dims TxtImg Strict [%] TxtImg Feats [ratio] ImgTxt Strict [%] ImgTxt Feats [ratio] ImgTxt Letters [%] Joint Strict [%] Joint Feats [ratio]
1 MMVAE 64 18 (6) 0.18(6)/1 0 (1) 0.0(0)/1 16 (5) 21 (7) 0.21(7)/1
MVAE 64 94 (1) 0.94(1)/1 16 (1) 0.16(1)/1 25 (1) 7 (1) 0.07(1)/1
2 MMVAE 128 6 (2) 0.56(12)/2 6 (2) 0.64(4)/2 43 (1) 3 (1) 0.40(6)/2
MVAE 32 70 (0) 1.68(0)/2 7 (0) 0.54(4)/2 41 (3) 6 (1) 0.98(0)/2
3 MMVAE 128 0 (0) 0.66(3)/3 0 (0) 0.36(12)/3 26 (3) 0 (0) 0.36(6)/3
MVAE 128 49 (2) 2.16(3)/3 0 (1) 0.60(3)/3 27 (0) 0 (0) 0.93(3)/3
4 MMVAE 128 0 (0) 0.92(32)/4 0 (0) 0.32(28)/4 20 (1) 0 (0) 0.28(40)/4
MVAE 128 1 (1) 1.52(20)/4 0 (0) 0.96(24)/4 26 (2) 0 (0) 0(0)/4
5 MMVAE 64 0 (0) 0.80(10)/5 0 (0) 0.75(10)/5 18 (1) 0 (0) 0.45(10)/5
MVAE 32 1 (0) 1.75(5)/5 0 (0) 0.40(0)/5 20 (1) 0 (0) 0.10(0)/5
Table 2: Comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset for all five levels. Strict refers to the percentage of completely correct samples (sample pairs in joint generation), Feats shows the ratio of correct features (i.e., 1.52(20)/4 for level 4 for MVAE model means that on average 1.52 0.20 features out of 4 are recognized correctly for each sample) and Letters

shows the mean percentage of correctly reconstructed letters (computed sample-wise). Standard deviations over 5 seeds are in brackets. For each model, we chose the latent space dimensionality which performs the best (


Model comparison on GeBiD

We use the unified implementations of MVAE and MMVAE as presented in our toolkit 1, i.e. the encoder and decoder architectures are fixed for both models as well as the training/testing procedure, the only difference is thus in the multimodal mixing approach (PoE vs. MoE). For MVAE, we use the sub-sampling paradigm combining the ELBO terms for full and partial observations as was recommended in the original proposal wu2018multimodal. For the image (GeBiD comprises 64x64x3 RGB images) encoder and decoder, we use a 4-layer convolutional network, while for the text inputs (with the size , where 27 is the length of the one-hot encodings and 52 is the max. length of the text descriptions), we use a Transformer network adapted from the ACTOR model originally used for motion reconstruction petrovich2021action. We train with a fixed learning rate of using the Adam optimizer kingma2014adam

for 600 epochs and report results averaged over 5 seeds. Details on the used computational resources and training times can be found in the Appendix. In these experiments, We varied the batch size (16, 32, 64) and latent dimensionality (16-D, 32-D, 64-D and 128-D latent space) to find the optimal performance for both models.

Comparison of reconstruction loss terms

To show how certain implementation details might influence the overall performance of the compared models, we varied the specific implementation of the reconstruction term in the ELBO objective as they differ in the original codes for MVAE and MMVAE. Shi et al. shi2019variational used the standard negative log-likelihood (NLL) as the reconstruction loss, which is another term for multiclass cross-entropy:

Figure 2: Qualitative results of the MVAE wu2018multimodal and MMVAE shi2019variational models trained on Level 1, 3 and 5 of our GeBiD dataset. Output Img shows the output images conditioned either on the image (reconstruction) or the text description (cross-generation). Correspondingly, Output Txt shows the output text conditioned either on the image (cross-generation) or the text (reconstruction). The cross-generation results for the MVAE model show that it succeeded in learning the joint posterior for the two modalities, while MMVAE completely failed at this task. On the other hand, the mixture-of-experts approach in MMVAE enables better reconstruction quality of the image-image and text-text pairs when compared to MVAE.

while Wu and Goodman wu2018multimodal train the image reconstruction using binary cross-entropy loss (BCE):


where is the prediction for each of the M RGB values of the image and is the ground truth normalized between 0 and 1.

We report and discuss the results in the following section.

5.1.1 Results

The detailed results for all experiments can be found in the Appendix. Here we highlight the findings that best demonstrate the utility of our toolkit and dataset.

Model comparison on GeBiD

We first compare the MVAE and MMVAE models on the GeBiD dataset consecutively on 5 levels of difficulty. For an overview of what attributes each of the levels includes, please see Section 3. In Fig. 2, we show the qualitative results for levels 1, 3 and 5 of the dataset for experiments trained with batch size 32 and 128-D latent space, which were found as the optimal values for this dataset based on the hyperparameter grid search (see the Appendix). We show both the reconstructed and cross-generated (conditioned on the other modality) samples for images and text. Moreover, we report the cross- ( and ) and joint- () coherency accuracies for all levels in Table 2. The Strict metrics show the percentage of completely correct samples, while Feats and Letters show the average proportion of correct words (or visual features for image) or letters per sample.

The MVAE model clearly outperforms the MMVAE model in cross-generation capabilities - even at the most difficult level, the text retrieved from images (and vice versa) is semantically coherent. On the other hand, MMVAE produces better quality outputs in the uni-modal reconstructions (i.e. image-to-image and text-to-text). This finding is in contrast with the conclusions reported by Daunhawer et al. daunhawer2021limitations, which states that the sub-sampling paradigm reduces the generative quality of the models. However, they evaluate the models on PolyMNIST sutter2021generalized comprising digit images in various styles. On the contrary, in Kutuzova et al. kutuzova2021multimodal, the comparative study showed that the PoE approach indeed outperforms the MoE approach on datasets with a multiplicative ("AND") combination of modalities, which is also the case of our GeBiD dataset.

Figure 3: Influence of two different reconstruction loss terms on the reconstruction quality of the images in the GeBiD dataset. The upper line (input) shows the original, recon. shows the reconstructions. BCE stands for binary cross-entropy, NLL stands for negative log-likelihood reconstruction loss. We show the results for the MVAE and MMVAE models.

In summary, more research is needed to conclude whether the MVAE and MMVAE models are indeed suitable for different types of modalities. However, we demonstrate that the GeBiD dataset is challenging for the state-of-the-art models as they reach relatively low accuracy even for Level 3 (see Appendix for more details). We also did not introduce any additional noise in our experiments such as colour and size alterations or object rotations which would additionally increase the complexity.

Comparison of reconstruction terms

In this experiment, we compared how the different implementations of the reconstruction loss terms influence the resulting reconstructions. We find that the negative log-likelihood loss leads to significantly impaired reconstruction quality when compared to binary cross-entropy (see Fig. 3 and the table for the cross-coherency accuracies in the Appendix). This is a surprising finding as the negative log-likelihood is also used in the original implementation from Shi et al. shi2019variational

. It is disputable whether such difference might be caused for example by the numerical instability of the loss function, however, this result confirms the importance of unified implementation of the models on the code level as many factors might distort the final results.

6 Conclusions and Potential Impact

In this work, we present a tool and a dataset for a systematic evaluation and comparison of multimodal variational autoencoders. The tool enables the user to easily configure the experimental setup by specifying the dataset, encoder and decoder architectures, multimodal integration strategy and the desired training hyperparameters all in one config. The framework can be easily extended for new models, loss functions or the encoder and decoder architectures without the need to restructure the whole environment.

Furthermore, the proposed synthetic bimodal dataset offers an automated evaluation of the cross-generation and joint-generation capabilities of the multimodal VAE models on 5 different scales of complexity. We also offer several automatic visualization modules that further inform on the latent factorisation of the trained models.

We are aware that the current version of our tool does not cover all the recent multimodal VAE models. Although we plan on extending the toolkit for these models and also new encoder/decoder architectures for different types of data, the main purpose is to provide a tool that anybody can use to implement their own ideas in a replicable and reproducible fashion. As for our proposed dataset, we wish to add the option to generate at least one additional modality for a more challenging (trimodal dataset) evaluation. This could be for example sound signals of synthesised speech describing the image content. We have also considered any potential misuse of the proposed dataset and toolkit. Highly advanced multimodal VAEs could indeed be potentially used for the generation of fake and harmful real-world data. However, both our dataset and the networks provided in the public toolkit as they are are too weak for the creation of such content.


This work was supported by the INAFYM project (CZ.02.1.01/0.0/0.0/16_019/0000766), by CTU Student Grant Agency project SGS21/184/OHK3/3T/37 and by the Czech Science Foundation (GA ČR) under Project no. 21-31000S.


Appendix A Appendix

This appendix provides supplementary material for the GeBiD dataset and toolkit aimed for multimodal VAE training, evaluation and comparison. In Section A.1, we provide additional details for the GeBiD dataset. In Section A.2, we describe the technical information and detailed results for the experiments presented in the paper. Furthermore, we provide an overview of the dataset in terms of documentation, intended use, hosting and licensing (Sections B and C).

a.1 GeBiD dataset statistics

The presented version of the benchmark dataset comprises 5 different levels of difficulty, where each level varies in the number of included features (see Table 9 for their overview). The default size of the dataset is 9000 training samples and 1000 validation samples, although the user can easily generate a larger set of the data. The scripts for calculation of the cross- and joint- coherency accuracies use separate batches of testing data provided in the toolkit.

In all levels, the source of noise in the images is their random position - in levels 1-4, the shapes are located around the centre with the variance of 6 pixels along both and axes. In Level 5, the shapes are shifted in random quadrants where their position also varies with the same variance as in previous levels. Also in Level 5, some of the shapes can be partially located out of the image but never more than 50 % of the shape surface. In Levels 2-5 where ve vary the shape size, the proportion of the small object’s pixels to the overall image is on average 0.12 and the proportion of the large object’s surface is on average 0.5.

You can also see a PCA visualization of the GeBiD Level 5 dataset in Fig. 6.

a.2 Benchmark study results

Here we provide the specific training configuration and hyperparameters used for the experiments on the GeBiD dataset as listed in the paper. We also report the detailed results for hyperparameter grid search in terms of the cross- and joint-generation accuracies.

a.2.1 Training configuration

All our experiments were trained with the GeForce GTX 1080 and NVIDIA Tesla V100 GPU cards for 600 epochs, the mean computation time per experiment was 6 hours 22 minutes. We used the Adam optimizer, the learning rate of and all experiments were repeated for 5 seeds (we report standard deviations for the results in the tables). In the hyperparameter grid search, we first varied the batch size (24, 32, 64) for all 5 dataset levels and both models while keeping a fixed latent dimensionality (64-D) and binary cross-entropy reconstruction loss (5 seeds, altogether 150 experiments). Based on this experiment, we chose batch size 32 as the joint- and cross-coherencies were the highest. Then we trained both models with varying latent dimensionality (16, 32, 64 and 128 dimensions) and the reconstruction loss terms (binary cross-entropy or negative log-likelihood) for all 5 dataset levels in all combinations (400 experiments altogether). We used the default training dataset size and validation split as reported in the statistics Table 9. In Tables 3, 4, 5, 6 and 7, we show results for both MVAE and MMVAE models and the four latent dimensionalities with a fixed batch size (32) and with binary cross-entropy loss.

a.2.2 Used architecture

For both evaluated models, we used the standard ELBO loss function with the

parameter fixed to 1. For the MVAE [5] model, we used the sub-sampling approach where the model is trained on all subsets of modalities (i.e. images only, text only and images+text). For the image encoder and decoder, we used 4 fully connected layers with ReLu activations. In the case of the text, we used a Transformer network with 8 layers, 2 attention heads, 1024 hidden features and a dropout of 0.1.

a.2.3 Evaluation

After training, we used the script for automated evaluation (provided in our toolkit) to compute the cross- and joint-coherency of the models. For cross-coherency, we generated a 300-sample test dataset using the dataset generator and used first the images, and then text descriptions as input to the model to reconstruct the missing modality. For joint-coherency, we generated 10 traversal samples over each dimension of the latent space (i.e. 640 samples for a 64-D latent space) and fed these latent vectors into the models to reconstruct both text and images.

For both the cross- and joint-coherencies, we report the following metrics: Strict, Feat, and Letters to provide more information on what the models are capable to do. In the first ( Strict) metrics, we considered the text sample as accurate only if all letters in the description were 100 % accurate, i.e. we did not tolerate any noise. For the image outputs, we considered the images as correct only if all the attributes for the given difficulty level could be detected using our methods (i.e. template matching for the shape detection, the mean over the shape mask for colour detection, the shape centroids in case of the position, and proportion of the shape mask to the whole image for size detection). For joint-coherency, we considered the generated pair as correct only when both the image and text fulfilled these criteria and were semantically matching.

For the feature-level metrics, we calculated the percentage of correctly reconstructed/generated features (e.g. whole words or image attributes such as shape) and reported the mean percentage of correct features per sample. For the image-text cross-generation accuracy, we also calculated the average percentage of correct letters per output sample.

In the following section, we report the mean accuracies for both cross- and joint-coherency - these numbers describe the proportion of the correct outputs to all outputs.

Model (Dim) TxtImg Strict % TxtImg Feats % ImgTxt Strict % ImgTxt Feats % ImgTxt Letters % Joint Strict % Joint Feats %
MMVAE (16-D) 8 (3) N/A 0 (0) N/A 16 (4) 11 (4) N/A
MVAE (16-D) 95 (2) N/A 0 (0) N/A 16 (1) 2 (4) N/A
MMVAE (32-D) 9 (6) N/A 0 (0) N/A 13 (4) 17 (9) N/A
MVAE (32-D) 95 (1) N/A 15 (1) N/A 20 (0) 12 (4) N/A
MMVAE (64-D) 18 (6) N/A 0 (1) N/A 16 (5) 21 (7) N/A
MVAE (64-D) 94 (1) N/A 16 (1) N/A 25 (1) 7 (1) N/A
MMVAE (128-D) 9 (5) N/A 12 (8) N/A 23 (4) 0 (0) N/A
MVAE (128-D) 94 (3) N/A 16 (1) N/A 26 (0) 0 (0) N/A
Table 3: Level 1 comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset. Strict refers to percentage of completely correct samples (sample pairs in joint generation), Feats shows the average percentage of correct features (Level 1 has only 1 feature and Feats and Strict are thus the same) and Letters shows the mean percentage of correctly reconstructed letters. Standard deviations over 5 seeds are in brackets. For each model, we report the latent space dimensionality (Dim).
Model (Dim) TxtImg Strict % TxtImg Feats % ImgTxt Strict % ImgTxt Feats % ImgTxt Letters % Joint Strict % Joint Feats %
MMVAE (16-D) 6 (3) 33 (1) 2 (3) 24 (8) 37 (5) 1 (1) 22 (4)
MVAE (16-D) 69 (1) 84 (0) 4 (3) 27 (2) 40 (2) 7 (5) 45 (6)
MMVAE (32-D) 8 (3) 35 (4) 0 (0) 16 (9) 33 (7) 4 (3) 25 (4)
MVAE (32-D) 70 (0) 84 (0) 7 (0) 27 (2) 41 (3) 6 (1) 49 (0)
MMVAE (64-D) 6 (3) 31 (5) 5 (2) 25 (6) 40 (3) 4 (2) 23 (2)
MVAE (64-D) 69 (2) 84 (1) 4 (4) 24 (7) 41 (5) 3 (1) 41 (7)
MMVAE (128-D) 6 (2) 28 (6) 6 (2) 32 (2) 43 (1) 3 (1) 20 (3)
MVAE (128-D) 67 (2) 83 (2) 5 (2) 27 (3) 41 (3)) 2 (1) 46 (3)
Table 4: Level 2 comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset. Strict refers to the percentage of completely correct samples (sample pairs in joint generation), Feats shows the average percentage of correct features (Level 2 has 2 features) and Letters shows the mean percentage of correctly reconstructed letters. Standard deviations over 5 seeds are in brackets. For each model, we report the latent space dimensionality (Dim).
Model (Dim) TxtImg Strict % TxtImg Feats % ImgTxt Strict % ImgTxt Feats % ImgTxt Letters % Joint Strict % Joint Feats %
MMVAE (16-D) 0 (0) 21 (1) 0 (0) 11 (7) 25 (5) 0 (0) 14 (4)
MVAE (16-D) 52 (2) 74 (2) 0 (0) 16 (2) 28 (1) 0 (1) 15 (5)
MMVAE (32-D) 0 (0) 20 (2) 0 (0) 10 (5) 24 (6) 0 (0) 13 (8)
MVAE (32-D) 48 (1) 72 (1) 0 (0) 19 (2) 28 (1) 1 (1) 26 (7)
MMVAE (64-D) 0 (0) 19 (2) 0 (0) 14 (2) 27 (2) 0 (0) 14 (1)
MVAE (64-D) 44 (2) 70 (1) 1 (1) 20 (1) 27 (1) 0 (0) 31 (1)
MMVAE (128-D) 0 (0) 22 (1) 0 (0) 12 (4) 26 (3) 0 (0) 12 (2)
MVAE (128-D) 49 (2) 72 (1) 0 (1) 20 (1) 27 (0) 0 (0) 31 (1)
Table 5: Level 3 comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset. Strict refers to the percentage of completely correct samples (sample pairs in joint generation), Feats shows the average percentage of correct features (Level 3 has 3 features) and Letters shows the mean percentage of correctly reconstructed letters. Standard deviations over 5 seeds are in brackets. For each model, we report the latent space dimensionality (Dim).
Model (Dim) TxtImg Strict % TxtImg Feats % ImgTxt Strict % ImgTxt Feats % ImgTxt Letters % Joint Strict % Joint Feats %
MMVAE (16-D) 0 (0) 17 (2) 0 (0) 0 (0) 14 (4) 0 (0) 0 (0)
MVAE (16-D) 0 (0) 27 (13) 0 (0) 9 (4) 26 (1) 0 (0) 0 (0)
MMVAE (32-D) 0 (0) 21 (6) 0 (0) 7 (5) 18 (2) 0 (0) 3 (3)
MVAE (32-D) 1 (2) 40 (6) 0 (0) 18 (4) 27 (1) 0 (0) 1 (1)
MMVAE (64-D) 0 (0) 19 (6) 0 (0) 8 (3) 20 (1) 0 (0) 3 (1)
MVAE (64-D) 1 (1) 30 (17) 0 (1) 23 (2) 27 (1) 0 (0) 2 (1)
MMVAE (128-D) 0 (0) 23 (8) 0 (0) 8 (7) 20 (1) 0 (0) 7 (10)
MVAE (128-D) 1 (1) 38 (5) 0 (0) 24 (1) 26 (2) 0 (0) 0 (0)
Table 6: Level 4 comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset. Strict refers to the percentage of completely correct samples (sample pairs in joint generation), Feats shows the average percentage of correct features (Level 4 has only 4 features) and Letters shows the mean percentage of correctly reconstructed letters. Standard deviations over 5 seeds are in brackets. For each model, we report the latent space dimensionality (Dim).
Model (Dim) TxtImg Strict % TxtImg Feats % ImgTxt Strict % ImgTxt Feats % ImgTxt Letters % Joint Strict % Joint Feats %
MMVAE (16-D) 0 (0) 22 (3) 0 (0) 7 (3) 18 (3) 0 (0) 3 (1)
MVAE (16-D) 1 (0) 34 (1) 0 (0) 6 (2) 18 (1) 0 (0) 1 (1)
MMVAE (32-D) 0 (0) 19 (4) 0 (0) 9 (1) 17 (1) 0 (0) 5 (3)
MVAE (32-D) 1 (0) 35 (1) 0 (0) 8 (0) 20 (1) 0 (0) 2 (0)
MMVAE (64-D) 0 (0) 16 (2) 0 (0) 15 (2) 18 (1) 0 (0) 9 (2)
MVAE (64-D) 1 (1) 33 (1) 0 (0) 8 (1) 20 (1) 0 (0) 7 (5)
MMVAE (128-D) 0 (0) 20 (3) 0 (0) 22 (2) 18 (1) 0 (0) 11 (1)
MVAE (128-D) 1 (1) 33 (2) 0 (0) 9 (1) 21 (1) 0 (0) 2 (1)
Table 7: Level 5 comparison of accuracies for the MMVAE and MVAE models trained on our GeBiD dataset. Strict refers to the percentage of completely correct samples (sample pairs in joint generation), Feats shows the average percentage of correct features (Level 5 has 5 features) and Letters shows the mean percentage of correctly reconstructed letters. Standard deviations over 5 seeds are in brackets. For each model, we report the latent space dimensionality (Dim).
Figure 4: Results for the MMVAE model trained on the CUB dataset after 700 training epochs. We show the original captions and images as well as MMVAE image reconstructions and images cross-generated from the text. Unlike the original paper, we trained the model on raw images (instead of pre-processed features). These results were chosen after a hyperparameter search, with 64-D latent space and batch size 64.
Figure 5: Results for the MVAE and MMVAE models trained on the MNIST-SVHN dataset using our toolkit. For MMVAE, we used the IWAE objective as proposed by the authors, MVAE was trained with ELBO. We used the encoder and decoder networks from the original implementations. The top figures are traversals for each modality, below we show cross-generated samples. The bottom figures are T-SNE visualizations of the latent space - please note that for MVAE we show samples from the single joint posterior, while for MMVAE we show samples for both modality-specific distirbutions.
Figure 6: PCA visualizations for the GeBiD dataset Level 5. We show a separate figure for each of the 5 features for both the text (left) and image (right) modalities.
Figure 7: T-SNE visualizations for the MVAE model’s (128-D) joint latent space trained on GeBiD level 4. We show the latent space for each of the 4 features (shape, colour, background and size) individually.
Figure 8: T-SNE visualizations for the MMVAE model’s (128-D) unimodal latent spaces trained on GeBiD level 4. We show the latent space for each of the 4 features (shape, colour, background and size) individually.

a.2.4 Results

In Tables 3, 4, 5, 6 and 7, we show the comparison of the MVAE and MMVAE models on the 5 difficulty levels of the GeBiD dataset. Here we varied the latent dimensionality (16-D to 128-D) with the fixed batch size of 32 and binary cross-entropy reconstruction loss which showed better results than log-likelihood. The values are the mean cross-generation and joint-generation accuracies over 5 seeds with the standard deviations listed in brackets. According to the Strict metrics (with zero noise tolerance, see Sec. A.2.3), both models failed in both tasks at Levels 4 and 5 (MMVAE failed already at level 3 in both tasks). The Feature and Letter accuracies significantly decrease across levels as the complexity increases and the MVAE model consistently otperforms MMVAE. You can see the T-SNE visualizations for both models trained on Level 4 in Figs. 8 and 7.

To compare the effects of the reconstruction loss terms on the resulting reconstruction quality, we calculated joint log-likelihoods on the test data given both modalities. The results are shown in Table 8 - both models were trained on level 5 of the GeBiD dataset with batch size 32 and 64-D latent space. The BCE reconstruction term provided higher log-likelihoods in both models compared to NLL. Reconstruction samples are shown in the paper in Fig. 3.

a.3 Used assets

In our toolkit, we adopt parts of the original (publicly available) model implementations from Shi et al. [7], Wu and Goodman [6], Sutter et al. [7] and Lee and Pavlovic [19]. We cite these works wherever applied in our code. We also confirm that we used all the existing assets in accordance with their license.

BCE 3.9 (0.2 ) 10.5 (0.5)
NLL 0.7 (0.0 ) 1.6 (0.0 )
Table 8: Test joint log-likelihoods according to the used reconstruction loss terms: binary cross-entropy (BCE) or negative log-likelihood (NLL). We compare the MVAE and MMVAE models and show standard deviations calculated over five seeds (in brackets).
Level Train Samples Validation Samples Shapes Sizes Colours Backgrounds Positions
1 9000 1000 6 1 1 1 1
2 9000 1000 6 2 1 1 1
3 9000 1000 6 2 12 1 1
4 9000 1000 6 2 12 2 1
5 9000 1000 6 2 12 2 4
Table 9: Statistics of the GeBiD benchmark dataset. We show the number of train/validation samples and the number of various shapes, colours, backgrounds or object poses used in each difficulty level. The text captions only describe features that vary (e.g. in level 1, the text descriptions only include the shape name).

Appendix B Dataset documentation and intended uses

Dataset documentation.

The dataset documentation, code for its generation, as well as a link to its download, are all available at

Intended uses

The synthetic GeBiD dataset was created for a systematic evaluation and comparison of multimodal variational autoencoders. The users can choose among 5 different difficulty levels to find the minimum functional scenario for their model. Anyone can configure and generate the dataset according to their needs, i.e. with an arbitrary number of training/testing samples, with any subset of the provided attributes, or in a balanced or imbalanced scenario (in terms of complexity of the two modalities). We also provide automated scripts that enable computing the cross- and joint- accuracies after training. In general, we hope that scientists developing new multimodal VAEs will be able to use this dataset to demonstrate the models’ capabilities.

Appendix C Hosting, licensing, and maintainance plan

Dataset hosting

Although the users are recommended to generate the dataset on their own (as it is faster), we host a downloadable version of our GeBiD dataset on servers that are managed by our dedicated IT department. Both the dataset and the source code are publicly available via


The authors will provide important bug fixes for the public toolkit (part of which is also code for the GeBiD dataset) presented on GitHub. They will also keep extending the toolkit for new functionalities and models.


The provided dataset and supplementary code are copyrighted by us and published under the CC BY-NC-SA 4.0 license333


The contact e-mail address of the person managing the dataset and the toolkit: