Introduction
Information in our world is represented through various modalities. Although images are represented by pixel information, these can also be described with text or tag information. People often exchange such information bidirectionally. For instance, we can not only imagine what a “young female with a smile who does not wear glasses” looks like, but we can also add this caption to a corresponding photograph. Our objective is to design a model that can exchange different modalities bidirectionally like people. We call this ability bidirectional generation.
Each modality typically has a different kind of dimension and structure, e.g., images (realvalued and dense) and text (discrete and sparse). Therefore, the relations among modalities might have high nonlinearity. To discover such relations, deep neural network architectures have been used widely
[Ngiam et al., 2011, Srivastava and Salakhutdinov, 2012].A major approach to realizing bidirectional generation between modalities is to train the network architecture that shares the top of hidden layers in modality specific networks [Ngiam et al., 2011]. The salient advantages of this approach are that the model with multiple modalities can be trained endtoend, and that the trained model can extract a joint representation, which is a more compact representation integrating all modalities. The model can easily generate one modality from another modality via this representation if it can obtain the joint representation properly. Another simple approach might be to create networks for each direction and to train them independently. Several models have been proposed to generate modalities in one direction [Kingma et al., 2014, Sohn et al., 2015, Pandey and Dukkipati, 2016]. However, when generating modalities bidirectionally, the number of required networks is expected to increase exponentially as the modality increases. Moreover, the hidden layers of each network would not be synchronized during training. Therefore, this simple approach is not efficient to realize our objective.
To generate different modalities, it is important to model the joint representation as probabilistic latent variables. This is because, as described above, different modalities have different structures and dimensions, so their relation should not be deterministic. The best known approach by this probabilistic manner is to use deep Boltzmann machines (DBMs)
[Srivastava and Salakhutdinov, 2012, Sohn et al., 2014]. However, it is computationally difficult for DBMs to train especially highdimensional data because of MCMC training.
Variational autoencoders (VAEs) [Kingma and Welling, 2013, Rezende et al., 2014], a deep generation model, have an advantage that they can handle higherdimensional datasets than DBMs because backpropagation can be used to train them. Therefore, we extend VAEs to a model that be able to generate modalities bidirectionally. This extension method is extremely simple: as with previous neural networks and DBMs approaches, latent variables of generative models corresponding to each modality are shared. We call this model a joint multimodal variational autoencoder (JMVAE). However, results show that if we miss the input of the highdimensional modality we want to generate, the latent variable, i.e. the joint representation, of JMVAE collapses and this modality cannot be generated successfully. Although a method for addressing this difficulty has been proposed [Rezende et al., 2014], we demonstrate that this method cannot resolve this difficulty when the missing modality has higher dimensions than another one.
Therefore, we propose two new models to address difficulty presented above: JMVAEkl and JMVAEh. JMVAEkl takes an approach of preparing new encoders with one input for each modality apart from the encoder of JMVAE and reducing the divergence between them. By contrast, JMVAEh makes its latent variable a stochastic hierarchical structure to prevent its collapse. Figure 1 shows that JMVAEkl can generate modalities bidirectionally with different dimensions such as images and attributes.
The main contributions of this paper are described below.

We present a simple extension of VAEs to generate modality bidirectionally, which we call JMVAE. However, JMVAE cannot generate a highdimensional modality well if the input of this modality is missing, and a known method of solution cannot resolve this issue.

We propose two models, JMVAEkl and JMVAEh, which prevent a latent variable from collapse when a highdimensional modality is missing. We confirm experimentally that this method resolves this issue.

We demonstrate that these methods can generate modalities similarly or more properly than conventional VAEs that generate in only one direction.

We demonstrate that they can appropriately obtain the joint representation containing different modality information, which shows that they can generate various variations of modality by moving over these latent variables or changing the value of another modality.
Related work
A common approach to dealing with multiple modalities in deep neural networks is to share the top of hidden layers in modalityspecific networks. Ngiam et al. [2011] proposes this approach with deep autoencoders, which revealed that it can extract better representations than singlemodality settings can. Actually, Srivastava and Salakhutdinov [2012] adapts this idea to work with deep Boltzmann machines (DBMs) [Salakhutdinov and Hinton, 2009], which are generative models with undirected connections based on maximum joint likelihood learning of all modalities. Therefore, this model can generate modalities bidirectionally. Another study Sohn et al. [2014] improves this model to exchange multiple modalities effectively, which are based on minimizing the variation of information.
Recently, VAEs [Kingma and Welling, 2013, Rezende et al., 2014] have been used to train such highdimensional modalities. Several studies [Kingma et al., 2014, Sohn et al., 2015] have examined the use of conditional VAEs (CVAEs), which maximize a conditional loglikelihood by variational methods. In fact, many studies are based on efforts with CVAEs to train various multiple modalities such as handwriting digits and labels [Kingma et al., 2014, Sohn et al., 2015], object images and degrees of rotation [Kulkarni et al., 2015], facial images and attributes [Larsen et al., 2015, Yan et al., 2015], and natural images and captions [Mansimov et al., 2015]. The main features of CVAEs are that the relation between modalities is oneway and a latent variable does not include information of a conditioned modality^{1}^{1}1According to [Louizos et al., 2015], this independence might not be satisfied strictly because the encoder in CVAEs still has dependence., which is an unsuitable characteristic for our objective. Pandey and Dukkipati [2016] proposes the use of a conditional multimodal autoencoder (CMMA), which also maximizes the conditional loglikelihood but which makes this latent variable connected directly from a conditional variable, i.e., these variables are not independent. However, CMMA still considers that these modalities are generated in a single fixed direction.
Another wellknown approach of deep generation models is generative adversarial nets (GANs) [Goodfellow et al., 2014]. Even in the case of GANs, when handling multiple modalities, it is often modeled by generation in one direction such as conditional GANs [Mirza and Osindero, 2014].
Recently, GANs that can generate images bidirectionally from images have been proposed [Liu and Tuzel, 2016, Liu et al., 2017, Zhu et al., 2017]. These models train perfect pixelbypixel correspondence between modalities of the same dimension and intentionally ignore probabilistic factors. However, no complete correspondence exists between modalities of different kinds that we examine specifically in this study. Our methods can train the probabilistic joint representation integrating all modality information, so they can obtain a probabilistic relation between modalities.
VAEs with multiple modalities
This section first briefly presents the formulation of VAEs. Subsequently, a simple extension of VAEs to multiple modalities is introduced.
Variational autoencoders
Given observation variables and corresponding latent variables , their generating processes are definable as and , where is a model parameter of . The objective of VAEs is maximization of the marginal distribution . Because this distribution is intractable, we instead train the model to maximize the following lower bound.
(1)  
where is an approximate distribution of posterior and is a model parameter of . We designate as encoder and as decoder. Moreover, in Eq. 1, the first term represents a regularization. The second one represents a negative reconstruction error.
To optimize the lower bound with respect to parameters
, we estimate the gradients of Eq.
1 using stochastic gradient variational Bayes (SGVB). If we consider , where , then we can reparameterize to , where . Therefore, we can estimate the gradients of the negative reconstruction term in Eq. 1 with respect to and as . The gradients of the regularization term are solvable analytically. Therefore, we can optimize Eq. 1 using standard stochastic optimization methods.Joint Multimodal Variational Autoencoders
Next, we examine the dataset , where two modalities and have dimensions and structures of different kinds^{2}^{2}2In our experiment, they depend on the dataset, see the section of setting.. We assume that these generative models are conditionally independent of the same latent variable , i.e. joint representation. It means that the latent variables of generative models corresponding to each modality are shared. Therefore, their generative process becomes and , where and respectively represent the model parameters of each independent .
Considering an approximate posterior distribution as , we can estimate a lower bound of the loglikelihood , as shown below.
(2) 
Eq. 2 has two negative reconstruction terms that are correspondent to each modality. As with VAEs, we designate as the encoder and both and as decoders. This model is a simple extension of VAEs to multiple modalities, and we call it a joint multimodal variational autoencoder (JMVAE). Because each modality has different feature representation, we set different networks for each decoder.
Complement of a missing modality
After training JMVAE, we can extract a joint latent representation by sampling from the encoder at testing time. Our objective is to exchange modalities bidirectionally, e.g., images to text and vice versa. In this setting, modalities that we want to sample are expected to be missing, so that inputs of such modalities are set to zero or random noise (Figure 2(a)). In discriminative multimodal settings, this is a common means of estimating modalities from other modalities [Ngiam et al., 2011], but it is difficult to handle this missing modality properly.
In VAE, Rezende et al. [2014]
proposes a sophisticated approach of iterative sampling by Markov chain with a transition kernel to complement the missing value of input. In the case of JMVAE, the transition kernel when
is missing is .We can now estimate a missing modality by first setting the initial value of to random noise such as . Then we conduct iterative sampling following to the above kernel. As the number of iterations increases, we can estimate complemented values better. As described in this paper, we call it the iterative sampling method.
However, if missing modalities are highdimensional and complicated such as natural images compared to other modalities, then the inferred latent variable might collapse and generated samples might become incomplete. In the experiment, we will show that this difficulty cannot be prevented even using the iterative sampling method.
To resolve this issue, we propose two new models: JMVAEh and JMVAEkl.
JMVAEh
Recently, some studies have extended the latent variables of VAEs to a stochastic hierarchical structure to improve the expressiveness and likelihood of models [Burda et al., 2015, Sønderby et al., 2016, Gulrajani et al., 2016]. Such a hierarchical structure of latent variables becomes robust against a missing input, so it might contribute to preventing them from collapse.
We can extend JMVAE to a hierarchical structure of latent variables. Let the latent variable be the stochastic hierarchy of the layers ^{3}^{3}3Note that the stochastic layers differ from the deterministic layers of neural networks.
, the joint distribution of JMVAE becomes
, where all conditional distributions are Gaussian and are parameterized by deep neural networks.Various means are available to decompose the approximate distribution . We follow [Gulrajani et al., 2016], which is , where all conditional distributions are Gaussian. In this study, as with [Gulrajani et al., 2016], the structure from the input to the final stochastic layer consists of deterministic mappings (each mapping is parameterized in a deep neural network), and the probabilistic output of each stochastic layer is obtained from a deterministic output (see Figure 2(b)).
Therefore, the lower bound of JMVAE with a stochastic hierarchical structure becomes
We call this model JMVAEh, and demonstrate that it can prevent the issue of missing modality through experiments. In our experiments, we set . Figure 2(b) shows the flow of generating from with JMVAEh.
JMVAEkl
In the stochastic hierarchical approach described above, the iterative sampling method is still important for generating missing modalities. However, it takes time to generate highdimensional samples. Therefore, we propose JMVAEkl as a model to generate appropriate missing samples without using the iterative sampling method.
Assume that we have encoders with a single input, and , where is a parameter, then we would like to train them by reducing the divergence between their encoders and an encoder (see Figure 2(c)). Therefore, the object function of JMVAEkl becomes
(4) 
Actually, JMVAEh and JMVAEkl have both benefits and shortcomings. JMVAEkl must prepare encoders for each modality apart from the original encoder, but JMVAEh requires no preparation of any additional encoder. Conversely, although JMVAEh must apply the iterative sampling method to generate a missing modality, JMVAEkl does not need to do so. Therefore, JMVAEh is effective when there are three or more modalities and the dimension of the missing modality is not so large; JMVAEkl is effective when there are modalities of only two kinds and the missing modality is highdimensional.
Model  Label Image  Image Label 

JMVAE  977.2  0.2361 
JMVAEkl  422.4  0.2628 
JMVAEh  552.2  3.589 
CVAE  448.8  5.293 
CMMA  451.1  0.2971 
Model  Attributes Image  Image Attributes 

JMVAE  48763  43.97 
JMVAEkl  6852  44.13 
JMVAEh  7355  47.61 
CVAE  6825  44.28 
CMMA  6920  44.57 
Experiments
In this section, we confirm the following three points through experimentation: (1) The missing modality difficulty certainly occurs on JMVAE and our proposed methods can prevent this issue. (2) Our proposed models can generate modalities bidirectionally with the likelihood equivalent to (or higher than) models that generate modality in only one direction. (3) Our models can appropriately obtain the joint representation, which integrates information of different modalities.
Datasets
As described herein, we used two datasets: MNIST and CelebA [Liu et al., 2015].
Originally, MNIST was not a dataset for multimodal setting. For this work, we used this dataset as a toy problem for various tests to verify of our model. We regard 784dimensional handwriting images and corresponding 10 digit labels as two modalities. We used 50,000 as training set and the remaining 10,000 as a test set.
CelebA consists of 202,599 color facial images and 40 corresponding binary attributes such as male, eyeglasses, and mustache. For this work, we regard them as two modalities. This dataset is challenging because they have dimensions and structures of completely different kinds. Beforehand, we cropped the images to squares and resized them to 64 64 and normalized them. From the dataset, we chose 191,899 images that are identifiable facial images using OpenCV and used them for our experiment. We used 90% out of all the dataset contents as a training set and the remaining 10% of them as a test set.
Settings
For MNIST, we considered images as and corresponding labels as . We set as Bernoulli and as categorical distribution. We used warmup [Bowman et al., 2015, Sønderby et al., 2016], which first forces training only of the term of the negative reconstruction error and which then gradually increases the effect of the regularization term to prevent local minima during early training. We increased this term linearly during the first epochs as with Sønderby et al. [2016]. Then we set and trained for epochs on MNIST. Moreover, as described by Burda et al. [2015], Sønderby et al. [2016]
, we resampled the binarized training values randomly from MNIST images for each epoch to prevent overfitting.
For CelebA, we considered facial images as and corresponding attributes as
. We set a Gaussian distribution for the decoder of both modalities, where the variance of the Gaussian was fixed to 1. We set
and trained for epochs.We used the Adam optimization algorithm [Kingma and Ba, 2014] with a learning rate of on MNIST and
on CelebA. The models were implemented using Theano
[Team et al., 2016], Lasagne [Dieleman et al., 2015], and Tars^{4}^{4}4https://github.com/masasu/Tars. See appendix for details of network architectures.Evaluation method
For this experiment, we estimated the test conditional loglikelihood (or ) to evaluate the model performance. This estimate indicates how well a model can generate a modality from the corresponding another modality. Therefore, higher is better. We can estimate as
(5) 
where . In Eq. 5, we apply the sample approximation to the lower bound of the loglikelihood rather than to the loglikelihood directly. This is because the loglikelihood is biased^{5}^{5}5
This lower bound is unbiased estimator. Moreover,
Burda et al. [2015] shows that sample approximation of loglikelihood approaches the true loglikelihood as the number of samples increases.. We set for all experiments.To estimate the above, we should approximate (or ) and draw samples from them. The way of it depends on how to complement missing modalities. In the cases of JMVAE and JMVAEh, we first set the input of modality which we want to generate as missing, and then apply iterate sampling in multiple times to complement the missing modality. This means that the estimation of loglikelihood on these models depends on the number of iterate sampling. Conversely, JMVAEkl can estimate the approximate distribution (or ) directly at the training stage. For the approximation of conditional loglikelihood in JMVAEh, see appendix.
MNIST results
Confirmation of the missing modality difficulty
Figure 4(a) presents results of generating from . The top row, i.e. the image generated by application of iterative sampling only once, is blurred and is not properly generated. As the number of iterative sampling increases, the generated images become somewhat clearer, but it is readily apparent that these images do not correspond to their labels. This result demonstrates that a missing modality cannot be generated even using the iterative sampling method. Figure 4(b) presents the result obtained in the case of JMVAEh. Unlike the results presented in Figure 4(a), it is apparent that images corresponding to their labels are generated appropriately as the sampling number increases. Figure 4(c) is the result for JMVAEkl. JMVAEkl can generate digit images conditioned on numbers appropriately without iterating sampling.
Figure 5 presents the conditional likelihood of each model under the number of iterative sampling changes. As the number of sampling increases, the loglikelihood increases for both JMVAE and JMVAEh. However, in the case of normal JMVAE, its likelihood is much lower than that of JMVAEkl no matter how much the number of sampling is increased. For JMVAEh, the likelihood increases with a fewer sampling times than JMVAE, and the final likelihood becomes higher than normal JMVAE. On the other hand, in the case of JMVAEkl, it can obtain high likelihood without iterative sampling.
Evaluation of conditional loglikelihood
This section describes evaluation of our models with conditional loglikelihood to evaluate the bidirectional generation of modalities quantitatively. Additionally, we compare the loglikelihood evaluation with conventional conditional VAEs, which are CVAE [Kingma et al., 2014, Sohn et al., 2015] and CMMA [Pandey and Dukkipati, 2016]. Note that these models cannot generate modalities bidirectionally, so it is necessary to train them separately in each direction.
The upper of Table 1 presents the test conditional loglikelihoods of JMVAE, our two improved models, and conventional conditional VAEs. The number of iterative sampling of JMVAE and JMVAEh was set to 10. First, when generating an image from labels, JMVAEkl and JMVAEh improve their likelihood compared to JMVAE as expected. In addition, it turns out that these models have higher likelihood than existing VAEs modeling generation in only one direction. Next, in the case of generating labels from images, we find that there is not much difference in the likelihood for each model compared with generation in the opposite direction. This means that the missing modality difficulty does not occur when we generate a small dimensional modality from another large one. Note that the evaluation value of the conditional likelihood of JMVAEh might be underestimated compared to the evaluation value of other models. This is because the approximation of conditional likelihood holds a lower bound than other models (see appendix for details). In light of this fact, we can see that both JMVAEkl and JMVAEh can be generated in both directions with the same or higher likelihood as that of conventional conditional VAEs.
Visualization of the joint representation
Figure 3 presents a visualization of the joint representation in each model. For the case of JMVAEh, we sampled at the top stochastic layer. Here, we sampled from the trained encoders with the dimensions of the latent variable set to 2.
Specifically examining the left of each model results in Figure 3, samples from all models in the latent space are distributed and labeled for each label, which indicates that the joint representation including two modalities is obtained.
Next, we specifically examine the right in Figure 3. From JMVAE results, we can see that all the samples are distributed in a considerably small area irrespective of their labels. This result demonstrates that, if the image information is missing, the latent representation actually collapses. By contrast, the result of JMVAEkl shows that it can obtain the latent representation that is almost unchanged from sampling from all modalities. Because samples are not gathered in a small area as JMVAE, JMVAEkl prevents the latent representation from collapsing. Finally, regarding the result of JMVAEh, they are distributed considerably separately for each label just like the result on the left figure. From this, it can be said that the collapse of latent representation is also prevented in JMVAEh too.
CelebA results
Confirmation of the missing modality difficulty
Figure 6 presents the results obtained from generating facial images from attributes. From Figure 6(b), we can see that an ordinary JMVAE cannot generate an appropriate facial images at all from its attributes. In addition, unlike MNIST, it can be seen that the facial image largely collapses if the number of iterative sampling is increased. This suggests that the greater the difference in the amount of information between modalities becomes, the more severely the missing modality difficulty may become. On the other hand, it is confirmed that JMVAEh improves the facial image quality, and that it can generate a more beautiful images when the number of iterative sampling is increased (Figure 6
(c)). However, it can also be confirmed that these generated images do not correspond closely to the attributes. Specifically, no generated image shows a subject wearing eyeglasses, probably because the stochastic hierarchy deepened, the gap separating the modes corresponding to the attributes becomes smaller, which makes mixing in the latent space easier. Finally, for the case of JMVAEkl, we can observe that facial images corresponding to attributes are generated by one sampling (Figure
6(d)).Evaluation of the conditional loglikelihood
The lower of the Table 1 shows the test conditional loglikelihoods of all models. The number of iterative sampling of JMVAE and JMVAEh was set to 40. First, in the case of generating facial images from attributes, we can see that the likelihoods of JMVAEkl and JMVAEh are improved considerably compared to that of JMVAE. From this, we can see that even if the dimensions between the modalities are differ greatly, our improved models contribute to preventing the missing modality difficulty. Furthermore, as with the result of MNIST, we can see that our improved models can be generated in both directions with the same or higher likelihood as that of conventional VAEs.
Generating faces from attributes and the joint representation on CelebA
Next, we confirm that our model can obtain the joint representation appropriately. Based on the results presented above, we used JMVAEkl in the remaining experiments. Moreover, we combined JMVAEkl with GANs to generate clearer images. We considered the network of as generator in GANs. Then we optimized the GAN’s loss with the lower bound of JMVAEkl, which is the same approach to a VAEGAN model [Larsen et al., 2015]. See appendix for the network structure of the discriminator of this GAN.
Figure 7(a) portrays generated faces conditioned on various attributes. Results show that we can generate an average face for each attribute and various random faces conditioned on a certain attributes. Figure 7(b) shows that these samples are gathered for each attribute. In addition, in the group for each attribute, the average face is positioned almost at the center of each group, and random facial images are arranged around the average face. Furthermore, it can be confirmed that these arrangements are almost the same in both Base and Not Male. These results indicate that manifold learning of the joint representation works well.
Bidirectional generation between faces and attributes on CelebA
Finally, we demonstrate that our model can do bidirectionally generation between faces and attributes. Figure 9 shows that JMVAEkl can generate both attributes and changed images conditioned on various attributes from images that had no attribute information. These are possible because, as the result above shows, JMVAEkl can properly obtain the joint representation integrating different modalities.
Conclusion
As described herein, we introduced the extension of VAEs to generate modalities bidirectionally, which we call JMVAE. However, results show that we cannot generate a highdimensional modality properly if the input of this modality is missing, and that the known method, the iterative sampling method, cannot resolve this issue. We proposed two models, JMVAEkl and JMVAEh, to prevent the above missing modality difficulty. Results demonstrated that these proposed models prevent a latent variable from collapse and that they can generate modalities bidirectionally with equal or higher quality compared to models which can only generate them in one direction. Furthermore, because these methods appropriately obtain the joint representation, we found that samples of various variations can be generated as corresponding to a certain modality or a changed modality.
References

Ngiam et al. [2011]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y
Ng.
Multimodal deep learning.
InProceedings of the 28th international conference on machine learning (ICML11)
, pages 689–696, 2011.  Srivastava and Salakhutdinov [2012] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
 Kingma et al. [2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 Sohn et al. [2015] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 Pandey and Dukkipati [2016] Gaurav Pandey and Ambedkar Dukkipati. Variational methods for conditional multimodal learning: Generating human faces from attributes. arXiv preprint arXiv:1603.01801, 2016.
 Sohn et al. [2014] Kihyuk Sohn, Wenling Shang, and Honglak Lee. Improved multimodal deep learning with variation of information. In Advances in Neural Information Processing Systems, pages 2141–2149, 2014.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Salakhutdinov and Hinton [2009] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
 Kulkarni et al. [2015] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
 Larsen et al. [2015] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 Yan et al. [2015] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. arXiv preprint arXiv:1512.00570, 2015.
 Mansimov et al. [2015] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
 Louizos et al. [2015] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair auto encoder. arXiv preprint arXiv:1511.00830, 2015.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Liu and Tuzel [2016] MingYu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477, 2016.
 Liu et al. [2017] MingYu Liu, Thomas Breuel, and Jan Kautz. Unsupervised imagetoimage translation networks. arXiv preprint arXiv:1703.00848, 2017.
 Zhu et al. [2017] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
 Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. arXiv preprint arXiv:1602.02282, 2016.
 Gulrajani et al. [2016] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.

Liu et al. [2015]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 3730–3738, 2015.  Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Team et al. [2016] The Theano Development Team, Rami AlRfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
 Dieleman et al. [2015] Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, Daniel Maturana, Martin Thoma, Eric Battenberg, Jack Kelly, Jeffrey De Fauw, Michael Heilman, Diogo Moitinho de Almeida, Brian McFee, Hendrik Weideman, Gábor Takács, Peter de Rivaz, Jon Crall, Gregory Sanders, Kashif Rasul, Cong Liu, Geoffrey French, and Jonas Degrave. Lasagne: First release., August 2015. URL http://dx.doi.org/10.5281/zenodo.27878.
Appendix A The network architectures
Parameterization of distributions with deep neural networks
The Gaussian distribution can be parameterized with deep neural networks as follows.
where and are linear single layer neural networks and
means a deep neural network with arbitrary number of layers. Moreover, applying softplus function for each element of a vector is denoted as
.In the case of categorical distribution, we can parameterize it as
where means softmax function.
Mnist
For the notation of model structures, we denote a linear fullyconnected layer with
units and ReLU as
DkR, and DkR without ReLU as Dk. In addition, the process of applying J after I is denoted as IJ, and the process of concatenating the last layers of the two networks I, J into one layer is denoted as (I,J).Therefore, the network structures of encoders and decoders on MNIST are as follows.

(Gaussian)

and : D64

: (D512RD512R, D512RD512R)


, , and (Gaussian)

and : D64

: D512RD512R


(Bernoulli)

: D784

: D512RD512R


(Categorical)

: D10

: D512RD512R


(Gaussian)

and : D64

: D512RD512RD64D512RD512R

CelebA
We denote a convolutional layer (filter size : , number of channels :
, and stride : 2) with batch normalization and ReLU as
CkBR, and CkBR without ReLU as CkB. In addition, we denote a deconvolutional layer (filter size : , number of channels : , and crop : 2) with batch normalization and ReLU as DCkBR, DCkBR without ReLU as DCkB, and DCkB without batch normalization as DCk. Further, a linear fullyconnected layer with units, ReLU, and batch normalization is denoted as DkBR and a flatten layer is denoted as F.Using the above notations, the model structures on CelebA are as follows.

(Gaussian)

and : D128

: (C64RC128BRC256BRC256BRF, D512RD512BR)D1024R


(Gaussian)

and : D128

: C64RC128BRC256BRC256BRFD1024R


(Gaussian)

and : D128

: D512RD512BRD1024R


(Gaussian)

and : D128

: D512RD512R


(Gaussian with fixed variance)

: DC3

: D4096RDC256BRDC128BRDC64BR


(Gaussian with fixed variance)

: D10

: D512RD4096R


(Gaussian)

and : D128

: (C64RC128BRC256BRC256BRF, D512RD512BR)D1024RD64D512RD512R

Moreover, we set C64RC128BRC256BRC256BR FD1024RD1S (where DkS means Dk with sigmoid function) as a network structure of the discriminator used in CelebA experiment.
Appendix B Conditional loglikelihood of JMVAEh
We can estimate conditional loglikelihood of JMVAEh as follows.
(6)  
where and
where ．We set for all experiments.
We find that the approximation of Eq. 6 holds a lower bound than that of JMVAE. Therefore, the evaluation value of the conditional likelihood of JMVAEh might be underestimated compared to the evaluation value of JMVAE.