Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

01/26/2018 ∙ by Masahiro Suzuki, et al. ∙ The University of Tokyo 0

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. A major approach to achieve this objective is to train a model that integrates all the information of different modalities into a joint representation and then to generate one modality from the corresponding other modality via this joint representation. We simply applied this approach to variational autoencoders (VAEs), which we call a joint multimodal variational autoencoder (JMVAE). However, we found that when this model attempts to generate a large dimensional modality missing at the input, the joint representation collapses and this modality cannot be generated successfully. Furthermore, we confirmed that this difficulty cannot be resolved even using a known solution. Therefore, in this study, we propose two models to prevent this difficulty: JMVAE-kl and JMVAE-h. Results of our experiments demonstrate that these methods can prevent the difficulty above and that they generate modalities bi-directionally with equal or higher likelihood than conventional VAE methods, which generate in only one direction. Moreover, we confirm that these methods can obtain the joint representation appropriately, so that they can generate various variations of modality by moving over the joint representation or changing the value of another modality.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Information in our world is represented through various modalities. Although images are represented by pixel information, these can also be described with text or tag information. People often exchange such information bi-directionally. For instance, we can not only imagine what a “young female with a smile who does not wear glasses” looks like, but we can also add this caption to a corresponding photograph. Our objective is to design a model that can exchange different modalities bi-directionally like people. We call this ability bi-directional generation.

Figure 1: Example of bi-directional generation between different modalities via a joint representation with JMVAE-kl, one of our proposed models.

Each modality typically has a different kind of dimension and structure, e.g., images (real-valued and dense) and text (discrete and sparse). Therefore, the relations among modalities might have high nonlinearity. To discover such relations, deep neural network architectures have been used widely

[Ngiam et al., 2011, Srivastava and Salakhutdinov, 2012].

A major approach to realizing bi-directional generation between modalities is to train the network architecture that shares the top of hidden layers in modality specific networks [Ngiam et al., 2011]. The salient advantages of this approach are that the model with multiple modalities can be trained end-to-end, and that the trained model can extract a joint representation, which is a more compact representation integrating all modalities. The model can easily generate one modality from another modality via this representation if it can obtain the joint representation properly. Another simple approach might be to create networks for each direction and to train them independently. Several models have been proposed to generate modalities in one direction [Kingma et al., 2014, Sohn et al., 2015, Pandey and Dukkipati, 2016]. However, when generating modalities bi-directionally, the number of required networks is expected to increase exponentially as the modality increases. Moreover, the hidden layers of each network would not be synchronized during training. Therefore, this simple approach is not efficient to realize our objective.

To generate different modalities, it is important to model the joint representation as probabilistic latent variables. This is because, as described above, different modalities have different structures and dimensions, so their relation should not be deterministic. The best known approach by this probabilistic manner is to use deep Boltzmann machines (DBMs)

[Srivastava and Salakhutdinov, 2012, Sohn et al., 2014]

. However, it is computationally difficult for DBMs to train especially high-dimensional data because of MCMC training.

Variational autoencoders (VAEs) [Kingma and Welling, 2013, Rezende et al., 2014], a deep generation model, have an advantage that they can handle higher-dimensional datasets than DBMs because back-propagation can be used to train them. Therefore, we extend VAEs to a model that be able to generate modalities bi-directionally. This extension method is extremely simple: as with previous neural networks and DBMs approaches, latent variables of generative models corresponding to each modality are shared. We call this model a joint multimodal variational autoencoder (JMVAE). However, results show that if we miss the input of the high-dimensional modality we want to generate, the latent variable, i.e. the joint representation, of JMVAE collapses and this modality cannot be generated successfully. Although a method for addressing this difficulty has been proposed [Rezende et al., 2014], we demonstrate that this method cannot resolve this difficulty when the missing modality has higher dimensions than another one.

Therefore, we propose two new models to address difficulty presented above: JMVAE-kl and JMVAE-h. JMVAE-kl takes an approach of preparing new encoders with one input for each modality apart from the encoder of JMVAE and reducing the divergence between them. By contrast, JMVAE-h makes its latent variable a stochastic hierarchical structure to prevent its collapse. Figure 1 shows that JMVAE-kl can generate modalities bi-directionally with different dimensions such as images and attributes.

The main contributions of this paper are described below.

  • We present a simple extension of VAEs to generate modality bi-directionally, which we call JMVAE. However, JMVAE cannot generate a high-dimensional modality well if the input of this modality is missing, and a known method of solution cannot resolve this issue.

  • We propose two models, JMVAE-kl and JMVAE-h, which prevent a latent variable from collapse when a high-dimensional modality is missing. We confirm experimentally that this method resolves this issue.

  • We demonstrate that these methods can generate modalities similarly or more properly than conventional VAEs that generate in only one direction.

  • We demonstrate that they can appropriately obtain the joint representation containing different modality information, which shows that they can generate various variations of modality by moving over these latent variables or changing the value of another modality.

Related work

A common approach to dealing with multiple modalities in deep neural networks is to share the top of hidden layers in modality-specific networks. Ngiam et al. [2011] proposes this approach with deep autoencoders, which revealed that it can extract better representations than single-modality settings can. Actually, Srivastava and Salakhutdinov [2012] adapts this idea to work with deep Boltzmann machines (DBMs) [Salakhutdinov and Hinton, 2009], which are generative models with undirected connections based on maximum joint likelihood learning of all modalities. Therefore, this model can generate modalities bi-directionally. Another study Sohn et al. [2014] improves this model to exchange multiple modalities effectively, which are based on minimizing the variation of information.

Recently, VAEs [Kingma and Welling, 2013, Rezende et al., 2014] have been used to train such high-dimensional modalities. Several studies [Kingma et al., 2014, Sohn et al., 2015] have examined the use of conditional VAEs (CVAEs), which maximize a conditional log-likelihood by variational methods. In fact, many studies are based on efforts with CVAEs to train various multiple modalities such as handwriting digits and labels [Kingma et al., 2014, Sohn et al., 2015], object images and degrees of rotation [Kulkarni et al., 2015], facial images and attributes [Larsen et al., 2015, Yan et al., 2015], and natural images and captions [Mansimov et al., 2015]. The main features of CVAEs are that the relation between modalities is one-way and a latent variable does not include information of a conditioned modality111According to [Louizos et al., 2015], this independence might not be satisfied strictly because the encoder in CVAEs still has dependence., which is an unsuitable characteristic for our objective. Pandey and Dukkipati [2016] proposes the use of a conditional multimodal autoencoder (CMMA), which also maximizes the conditional log-likelihood but which makes this latent variable connected directly from a conditional variable, i.e., these variables are not independent. However, CMMA still considers that these modalities are generated in a single fixed direction.

Another well-known approach of deep generation models is generative adversarial nets (GANs) [Goodfellow et al., 2014]. Even in the case of GANs, when handling multiple modalities, it is often modeled by generation in one direction such as conditional GANs [Mirza and Osindero, 2014].

Recently, GANs that can generate images bi-directionally from images have been proposed [Liu and Tuzel, 2016, Liu et al., 2017, Zhu et al., 2017]. These models train perfect pixel-by-pixel correspondence between modalities of the same dimension and intentionally ignore probabilistic factors. However, no complete correspondence exists between modalities of different kinds that we examine specifically in this study. Our methods can train the probabilistic joint representation integrating all modality information, so they can obtain a probabilistic relation between modalities.

Figure 2: Inference (or encoder, left figures) and generative (or decoder, right figures) distributions for (a) JMVAE (b) JMVAE-h, and (c) JMVAE-kl. These figures reflect how and are modeled on each approach. Circles represent stochastic variables. Diamonds represent deterministic variables.

VAEs with multiple modalities

This section first briefly presents the formulation of VAEs. Subsequently, a simple extension of VAEs to multiple modalities is introduced.

Variational autoencoders

Given observation variables and corresponding latent variables , their generating processes are definable as and , where is a model parameter of . The objective of VAEs is maximization of the marginal distribution . Because this distribution is intractable, we instead train the model to maximize the following lower bound.


where is an approximate distribution of posterior and is a model parameter of . We designate as encoder and as decoder. Moreover, in Eq. 1, the first term represents a regularization. The second one represents a negative reconstruction error.

To optimize the lower bound with respect to parameters

, we estimate the gradients of Eq.

1 using stochastic gradient variational Bayes (SGVB). If we consider

as a Gaussian distribution

, where , then we can reparameterize to , where . Therefore, we can estimate the gradients of the negative reconstruction term in Eq. 1 with respect to and as . The gradients of the regularization term are solvable analytically. Therefore, we can optimize Eq. 1 using standard stochastic optimization methods.

Joint Multimodal Variational Autoencoders

Next, we examine the dataset , where two modalities and have dimensions and structures of different kinds222In our experiment, they depend on the dataset, see the section of setting.. We assume that these generative models are conditionally independent of the same latent variable , i.e. joint representation. It means that the latent variables of generative models corresponding to each modality are shared. Therefore, their generative process becomes and , where and respectively represent the model parameters of each independent .

Considering an approximate posterior distribution as , we can estimate a lower bound of the log-likelihood , as shown below.


Eq. 2 has two negative reconstruction terms that are correspondent to each modality. As with VAEs, we designate as the encoder and both and as decoders. This model is a simple extension of VAEs to multiple modalities, and we call it a joint multimodal variational autoencoder (JMVAE). Because each modality has different feature representation, we set different networks for each decoder.

Complement of a missing modality

After training JMVAE, we can extract a joint latent representation by sampling from the encoder at testing time. Our objective is to exchange modalities bi-directionally, e.g., images to text and vice versa. In this setting, modalities that we want to sample are expected to be missing, so that inputs of such modalities are set to zero or random noise (Figure 2(a)). In discriminative multimodal settings, this is a common means of estimating modalities from other modalities [Ngiam et al., 2011], but it is difficult to handle this missing modality properly.

In VAE, Rezende et al. [2014]

proposes a sophisticated approach of iterative sampling by Markov chain with a transition kernel to complement the missing value of input. In the case of JMVAE, the transition kernel when

is missing is .

We can now estimate a missing modality by first setting the initial value of to random noise such as . Then we conduct iterative sampling following to the above kernel. As the number of iterations increases, we can estimate complemented values better. As described in this paper, we call it the iterative sampling method.

However, if missing modalities are high-dimensional and complicated such as natural images compared to other modalities, then the inferred latent variable might collapse and generated samples might become incomplete. In the experiment, we will show that this difficulty cannot be prevented even using the iterative sampling method.

To resolve this issue, we propose two new models: JMVAE-h and JMVAE-kl.


Recently, some studies have extended the latent variables of VAEs to a stochastic hierarchical structure to improve the expressiveness and likelihood of models [Burda et al., 2015, Sønderby et al., 2016, Gulrajani et al., 2016]. Such a hierarchical structure of latent variables becomes robust against a missing input, so it might contribute to preventing them from collapse.

We can extend JMVAE to a hierarchical structure of latent variables. Let the latent variable be the stochastic hierarchy of the layers 333Note that the stochastic layers differ from the deterministic layers of neural networks.

, the joint distribution of JMVAE becomes

, where all conditional distributions are Gaussian and are parameterized by deep neural networks.

Various means are available to decompose the approximate distribution . We follow [Gulrajani et al., 2016], which is , where all conditional distributions are Gaussian. In this study, as with [Gulrajani et al., 2016], the structure from the input to the final stochastic layer consists of deterministic mappings (each mapping is parameterized in a deep neural network), and the probabilistic output of each stochastic layer is obtained from a deterministic output (see Figure 2(b)).

Therefore, the lower bound of JMVAE with a stochastic hierarchical structure becomes

We call this model JMVAE-h, and demonstrate that it can prevent the issue of missing modality through experiments. In our experiments, we set . Figure 2(b) shows the flow of generating from with JMVAE-h.


In the stochastic hierarchical approach described above, the iterative sampling method is still important for generating missing modalities. However, it takes time to generate high-dimensional samples. Therefore, we propose JMVAE-kl as a model to generate appropriate missing samples without using the iterative sampling method.

Assume that we have encoders with a single input, and , where is a parameter, then we would like to train them by reducing the divergence between their encoders and an encoder (see Figure 2(c)). Therefore, the object function of JMVAE-kl becomes


Actually, JMVAE-h and JMVAE-kl have both benefits and shortcomings. JMVAE-kl must prepare encoders for each modality apart from the original encoder, but JMVAE-h requires no preparation of any additional encoder. Conversely, although JMVAE-h must apply the iterative sampling method to generate a missing modality, JMVAE-kl does not need to do so. Therefore, JMVAE-h is effective when there are three or more modalities and the dimension of the missing modality is not so large; JMVAE-kl is effective when there are modalities of only two kinds and the missing modality is high-dimensional.

Figure 3: Visualizations of 2-D latent representation. The number of iterative sampling on JMVAE and JMVAE-h is 10.
Model Label Image Image Label
JMVAE -977.2 -0.2361
JMVAE-kl -422.4 -0.2628
JMVAE-h -552.2 -3.589
CVAE -448.8 -5.293
CMMA -451.1 -0.2971
Model Attributes Image Image Attributes
JMVAE -48763 -43.97
JMVAE-kl -6852 -44.13
JMVAE-h -7355 -47.61
CVAE -6825 -44.28
CMMA -6920 -44.57
Table 1: Evaluation of the conditional log-likelihood. Models are trained and tested on MNIST (upper) and CelebA (lower).
Figure 4: Images () generation from labels () on the MNIST dataset. Each column corresponds to each element of space of , i.e., labels from 0 to 9. Going to the bottom of the row, the number of iterative sampling for generating increases: (a) JMVAE, (b) JMVAE-h, and (c) JMVAE-kl.
Figure 5: Log-likelihood values for JMVAE, JMVAE-h, and JMVAE-kl model with different numbers of iterative sampling on the MNIST dataset.


In this section, we confirm the following three points through experimentation: (1) The missing modality difficulty certainly occurs on JMVAE and our proposed methods can prevent this issue. (2) Our proposed models can generate modalities bi-directionally with the likelihood equivalent to (or higher than) models that generate modality in only one direction. (3) Our models can appropriately obtain the joint representation, which integrates information of different modalities.


As described herein, we used two datasets: MNIST and CelebA [Liu et al., 2015].

Originally, MNIST was not a dataset for multimodal setting. For this work, we used this dataset as a toy problem for various tests to verify of our model. We regard 784-dimensional handwriting images and corresponding 10 digit labels as two modalities. We used 50,000 as training set and the remaining 10,000 as a test set.

CelebA consists of 202,599 color facial images and 40 corresponding binary attributes such as male, eyeglasses, and mustache. For this work, we regard them as two modalities. This dataset is challenging because they have dimensions and structures of completely different kinds. Beforehand, we cropped the images to squares and resized them to 64 64 and normalized them. From the dataset, we chose 191,899 images that are identifiable facial images using OpenCV and used them for our experiment. We used 90% out of all the dataset contents as a training set and the remaining 10% of them as a test set.


For MNIST, we considered images as and corresponding labels as . We set as Bernoulli and as categorical distribution. We used warm-up [Bowman et al., 2015, Sønderby et al., 2016], which first forces training only of the term of the negative reconstruction error and which then gradually increases the effect of the regularization term to prevent local minima during early training. We increased this term linearly during the first epochs as with Sønderby et al. [2016]. Then we set and trained for epochs on MNIST. Moreover, as described by Burda et al. [2015], Sønderby et al. [2016]

, we resampled the binarized training values randomly from MNIST images for each epoch to prevent over-fitting.

For CelebA, we considered facial images as and corresponding attributes as

. We set a Gaussian distribution for the decoder of both modalities, where the variance of the Gaussian was fixed to 1. We set

and trained for epochs.

We used the Adam optimization algorithm [Kingma and Ba, 2014] with a learning rate of on MNIST and

on CelebA. The models were implemented using Theano

[Team et al., 2016], Lasagne [Dieleman et al., 2015], and Tars444 See appendix for details of network architectures.

Evaluation method

For this experiment, we estimated the test conditional log-likelihood (or ) to evaluate the model performance. This estimate indicates how well a model can generate a modality from the corresponding another modality. Therefore, higher is better. We can estimate as


where . In Eq. 5, we apply the sample approximation to the lower bound of the log-likelihood rather than to the log-likelihood directly. This is because the log-likelihood is biased555

This lower bound is unbiased estimator. Moreover,

Burda et al. [2015] shows that sample approximation of log-likelihood approaches the true log-likelihood as the number of samples increases.. We set for all experiments.

To estimate the above, we should approximate (or ) and draw samples from them. The way of it depends on how to complement missing modalities. In the cases of JMVAE and JMVAE-h, we first set the input of modality which we want to generate as missing, and then apply iterate sampling in multiple times to complement the missing modality. This means that the estimation of log-likelihood on these models depends on the number of iterate sampling. Conversely, JMVAE-kl can estimate the approximate distribution (or ) directly at the training stage. For the approximation of conditional log-likelihood in JMVAE-h, see appendix.

MNIST results

Confirmation of the missing modality difficulty

Figure 4(a) presents results of generating from . The top row, i.e. the image generated by application of iterative sampling only once, is blurred and is not properly generated. As the number of iterative sampling increases, the generated images become somewhat clearer, but it is readily apparent that these images do not correspond to their labels. This result demonstrates that a missing modality cannot be generated even using the iterative sampling method. Figure 4(b) presents the result obtained in the case of JMVAE-h. Unlike the results presented in Figure 4(a), it is apparent that images corresponding to their labels are generated appropriately as the sampling number increases. Figure 4(c) is the result for JMVAE-kl. JMVAE-kl can generate digit images conditioned on numbers appropriately without iterating sampling.

Figure 5 presents the conditional likelihood of each model under the number of iterative sampling changes. As the number of sampling increases, the log-likelihood increases for both JMVAE and JMVAE-h. However, in the case of normal JMVAE, its likelihood is much lower than that of JMVAE-kl no matter how much the number of sampling is increased. For JMVAE-h, the likelihood increases with a fewer sampling times than JMVAE, and the final likelihood becomes higher than normal JMVAE. On the other hand, in the case of JMVAE-kl, it can obtain high likelihood without iterative sampling.

Evaluation of conditional log-likelihood

This section describes evaluation of our models with conditional log-likelihood to evaluate the bi-directional generation of modalities quantitatively. Additionally, we compare the log-likelihood evaluation with conventional conditional VAEs, which are CVAE [Kingma et al., 2014, Sohn et al., 2015] and CMMA [Pandey and Dukkipati, 2016]. Note that these models cannot generate modalities bi-directionally, so it is necessary to train them separately in each direction.

The upper of Table 1 presents the test conditional log-likelihoods of JMVAE, our two improved models, and conventional conditional VAEs. The number of iterative sampling of JMVAE and JMVAE-h was set to 10. First, when generating an image from labels, JMVAE-kl and JMVAE-h improve their likelihood compared to JMVAE as expected. In addition, it turns out that these models have higher likelihood than existing VAEs modeling generation in only one direction. Next, in the case of generating labels from images, we find that there is not much difference in the likelihood for each model compared with generation in the opposite direction. This means that the missing modality difficulty does not occur when we generate a small dimensional modality from another large one. Note that the evaluation value of the conditional likelihood of JMVAE-h might be underestimated compared to the evaluation value of other models. This is because the approximation of conditional likelihood holds a lower bound than other models (see appendix for details). In light of this fact, we can see that both JMVAE-kl and JMVAE-h can be generated in both directions with the same or higher likelihood as that of conventional conditional VAEs.

Visualization of the joint representation

Figure 3 presents a visualization of the joint representation in each model. For the case of JMVAE-h, we sampled at the top stochastic layer. Here, we sampled from the trained encoders with the dimensions of the latent variable set to 2.

Specifically examining the left of each model results in Figure 3, samples from all models in the latent space are distributed and labeled for each label, which indicates that the joint representation including two modalities is obtained.

Next, we specifically examine the right in Figure 3. From JMVAE results, we can see that all the samples are distributed in a considerably small area irrespective of their labels. This result demonstrates that, if the image information is missing, the latent representation actually collapses. By contrast, the result of JMVAE-kl shows that it can obtain the latent representation that is almost unchanged from sampling from all modalities. Because samples are not gathered in a small area as JMVAE, JMVAE-kl prevents the latent representation from collapsing. Finally, regarding the result of JMVAE-h, they are distributed considerably separately for each label just like the result on the left figure. From this, it can be said that the collapse of latent representation is also prevented in JMVAE-h too.

Figure 6: Facial images () generation from attributes () on CelebA dataset. (a) is an example of the test set on CelebA. (b)-(d) are facial images which are generated from attributes of the example (a). Going to the bottom of the row, the number of iterative sampling for generating increases: (b) JMVAE, (c) JMVAE-h, and (d) JMVAE-kl.
Figure 7: (a) Generation of average faces and corresponding random faces. We first set all values of attributes randomly and designate them as Base. Then, we choose an attribute that we want to set (e.g., Male, Bald, Smiling) and change this value in Base to (or if we want to set ”Not”). Each column corresponds to the same attribute according to the legend. Average faces are generated from , where is a mean of . Moreover, we can obtain various images conditioned on the same values of attributes such as , where , . Each row in random faces has the same . (b) PCA visualizations of latent representation. Colors show which attribute is conditioned on for each sample.
Figure 8: Portraits of the Mona Lisa88footnotemark: 8(upper) and Mozart999, generated attributes and reconstructed images conditioned on varied attributes according to the legend. We cropped and resized it in the same way as CelebA. The procedure was the following: generate the corresponding attributes from an unlabeled image . Generate an average face from the attributes . Select attributes that we want to vary and change the values of these attributes. Generate the changed average face from the changed attributes. Obtain a changed reconstruction image by .

CelebA results

Confirmation of the missing modality difficulty

Figure 6 presents the results obtained from generating facial images from attributes. From Figure 6(b), we can see that an ordinary JMVAE cannot generate an appropriate facial images at all from its attributes. In addition, unlike MNIST, it can be seen that the facial image largely collapses if the number of iterative sampling is increased. This suggests that the greater the difference in the amount of information between modalities becomes, the more severely the missing modality difficulty may become. On the other hand, it is confirmed that JMVAE-h improves the facial image quality, and that it can generate a more beautiful images when the number of iterative sampling is increased (Figure 6

(c)). However, it can also be confirmed that these generated images do not correspond closely to the attributes. Specifically, no generated image shows a subject wearing eyeglasses, probably because the stochastic hierarchy deepened, the gap separating the modes corresponding to the attributes becomes smaller, which makes mixing in the latent space easier. Finally, for the case of JMVAE-kl, we can observe that facial images corresponding to attributes are generated by one sampling (Figure


Evaluation of the conditional log-likelihood

The lower of the Table 1 shows the test conditional log-likelihoods of all models. The number of iterative sampling of JMVAE and JMVAE-h was set to 40. First, in the case of generating facial images from attributes, we can see that the likelihoods of JMVAE-kl and JMVAE-h are improved considerably compared to that of JMVAE. From this, we can see that even if the dimensions between the modalities are differ greatly, our improved models contribute to preventing the missing modality difficulty. Furthermore, as with the result of MNIST, we can see that our improved models can be generated in both directions with the same or higher likelihood as that of conventional VAEs.

Generating faces from attributes and the joint representation on CelebA

Next, we confirm that our model can obtain the joint representation appropriately. Based on the results presented above, we used JMVAE-kl in the remaining experiments. Moreover, we combined JMVAE-kl with GANs to generate clearer images. We considered the network of as generator in GANs. Then we optimized the GAN’s loss with the lower bound of JMVAE-kl, which is the same approach to a VAE-GAN model [Larsen et al., 2015]. See appendix for the network structure of the discriminator of this GAN.

Figure 7(a) portrays generated faces conditioned on various attributes. Results show that we can generate an average face for each attribute and various random faces conditioned on a certain attributes. Figure 7(b) shows that these samples are gathered for each attribute. In addition, in the group for each attribute, the average face is positioned almost at the center of each group, and random facial images are arranged around the average face. Furthermore, it can be confirmed that these arrangements are almost the same in both Base and Not Male. These results indicate that manifold learning of the joint representation works well.

Bi-directional generation between faces and attributes on CelebA

Finally, we demonstrate that our model can do bi-directionally generation between faces and attributes. Figure 9 shows that JMVAE-kl can generate both attributes and changed images conditioned on various attributes from images that had no attribute information. These are possible because, as the result above shows, JMVAE-kl can properly obtain the joint representation integrating different modalities.


As described herein, we introduced the extension of VAEs to generate modalities bi-directionally, which we call JMVAE. However, results show that we cannot generate a high-dimensional modality properly if the input of this modality is missing, and that the known method, the iterative sampling method, cannot resolve this issue. We proposed two models, JMVAE-kl and JMVAE-h, to prevent the above missing modality difficulty. Results demonstrated that these proposed models prevent a latent variable from collapse and that they can generate modalities bi-directionally with equal or higher quality compared to models which can only generate them in one direction. Furthermore, because these methods appropriately obtain the joint representation, we found that samples of various variations can be generated as corresponding to a certain modality or a changed modality.


Appendix A The network architectures

Parameterization of distributions with deep neural networks

The Gaussian distribution can be parameterized with deep neural networks as follows.

where and are linear single layer neural networks and

means a deep neural network with arbitrary number of layers. Moreover, applying softplus function for each element of a vector is denoted as


The Bernoulli distribution is parameterized as


is sigmoid function.

In the case of categorical distribution, we can parameterize it as

where means softmax function.


For the notation of model structures, we denote a linear fully-connected layer with

units and ReLU as

DkR, and DkR without ReLU as Dk. In addition, the process of applying J after I is denoted as I-J, and the process of concatenating the last layers of the two networks I, J into one layer is denoted as (I,J).

Therefore, the network structures of encoders and decoders on MNIST are as follows.

  • (Gaussian)

    • and : D64

    • : (D512R-D512R, D512R-D512R)

  • , , and (Gaussian)

    • and : D64

    • : D512R-D512R

  • (Bernoulli)

    • : D784

    • : D512R-D512R

  • (Categorical)

    • : D10

    • : D512R-D512R

  • (Gaussian)

    • and : D64

    • : D512R-D512R-D64-D512R-D512R


We denote a convolutional layer (filter size : , number of channels :

, and stride : 2) with batch normalization and ReLU as

CkBR, and CkBR without ReLU as CkB. In addition, we denote a deconvolutional layer (filter size : , number of channels : , and crop : 2) with batch normalization and ReLU as DCkBR, DCkBR without ReLU as DCkB, and DCkB without batch normalization as DCk. Further, a linear fully-connected layer with units, ReLU, and batch normalization is denoted as DkBR and a flatten layer is denoted as F.

Using the above notations, the model structures on CelebA are as follows.

  • (Gaussian)

    • and : D128

    • : (C64R-C128BR-C256BR-C256BR-F, D512R-D512BR)-D1024R

  • (Gaussian)

    • and : D128

    • : C64R-C128BR-C256BR-C256BR-F-D1024R

  • (Gaussian)

    • and : D128

    • : D512R-D512BR-D1024R

  • (Gaussian)

    • and : D128

    • : D512R-D512R

  • (Gaussian with fixed variance)

    • : DC3

    • : D4096R-DC256BR-DC128BR-DC64BR

  • (Gaussian with fixed variance)

    • : D10

    • : D512R-D4096R

  • (Gaussian)

    • and : D128

    • : (C64R-C128BR-C256BR-C256BR-F, D512R-D512BR)-D1024R-D64-D512R-D512R

Moreover, we set C64R-C128BR-C256BR-C256BR -F-D1024R-D1S (where DkS means Dk with sigmoid function) as a network structure of the discriminator used in CelebA experiment.

Appendix B Conditional log-likelihood of JMVAE-h

We can estimate conditional log-likelihood of JMVAE-h as follows.


where and

where .We set for all experiments.

We find that the approximation of Eq. 6 holds a lower bound than that of JMVAE. Therefore, the evaluation value of the conditional likelihood of JMVAE-h might be underestimated compared to the evaluation value of JMVAE.