Information in our world is represented through various modalities. Although images are represented by pixel information, these can also be described with text or tag information. People often exchange such information bi-directionally. For instance, we can not only imagine what a “young female with a smile who does not wear glasses” looks like, but we can also add this caption to a corresponding photograph. Our objective is to design a model that can exchange different modalities bi-directionally like people. We call this ability bi-directional generation.
Each modality typically has a different kind of dimension and structure, e.g., images (real-valued and dense) and text (discrete and sparse). Therefore, the relations among modalities might have high nonlinearity. To discover such relations, deep neural network architectures have been used widely[Ngiam et al., 2011, Srivastava and Salakhutdinov, 2012].
A major approach to realizing bi-directional generation between modalities is to train the network architecture that shares the top of hidden layers in modality specific networks [Ngiam et al., 2011]. The salient advantages of this approach are that the model with multiple modalities can be trained end-to-end, and that the trained model can extract a joint representation, which is a more compact representation integrating all modalities. The model can easily generate one modality from another modality via this representation if it can obtain the joint representation properly. Another simple approach might be to create networks for each direction and to train them independently. Several models have been proposed to generate modalities in one direction [Kingma et al., 2014, Sohn et al., 2015, Pandey and Dukkipati, 2016]. However, when generating modalities bi-directionally, the number of required networks is expected to increase exponentially as the modality increases. Moreover, the hidden layers of each network would not be synchronized during training. Therefore, this simple approach is not efficient to realize our objective.
To generate different modalities, it is important to model the joint representation as probabilistic latent variables. This is because, as described above, different modalities have different structures and dimensions, so their relation should not be deterministic. The best known approach by this probabilistic manner is to use deep Boltzmann machines (DBMs)[Srivastava and Salakhutdinov, 2012, Sohn et al., 2014]
. However, it is computationally difficult for DBMs to train especially high-dimensional data because of MCMC training.
Variational autoencoders (VAEs) [Kingma and Welling, 2013, Rezende et al., 2014], a deep generation model, have an advantage that they can handle higher-dimensional datasets than DBMs because back-propagation can be used to train them. Therefore, we extend VAEs to a model that be able to generate modalities bi-directionally. This extension method is extremely simple: as with previous neural networks and DBMs approaches, latent variables of generative models corresponding to each modality are shared. We call this model a joint multimodal variational autoencoder (JMVAE). However, results show that if we miss the input of the high-dimensional modality we want to generate, the latent variable, i.e. the joint representation, of JMVAE collapses and this modality cannot be generated successfully. Although a method for addressing this difficulty has been proposed [Rezende et al., 2014], we demonstrate that this method cannot resolve this difficulty when the missing modality has higher dimensions than another one.
Therefore, we propose two new models to address difficulty presented above: JMVAE-kl and JMVAE-h. JMVAE-kl takes an approach of preparing new encoders with one input for each modality apart from the encoder of JMVAE and reducing the divergence between them. By contrast, JMVAE-h makes its latent variable a stochastic hierarchical structure to prevent its collapse. Figure 1 shows that JMVAE-kl can generate modalities bi-directionally with different dimensions such as images and attributes.
The main contributions of this paper are described below.
We present a simple extension of VAEs to generate modality bi-directionally, which we call JMVAE. However, JMVAE cannot generate a high-dimensional modality well if the input of this modality is missing, and a known method of solution cannot resolve this issue.
We propose two models, JMVAE-kl and JMVAE-h, which prevent a latent variable from collapse when a high-dimensional modality is missing. We confirm experimentally that this method resolves this issue.
We demonstrate that these methods can generate modalities similarly or more properly than conventional VAEs that generate in only one direction.
We demonstrate that they can appropriately obtain the joint representation containing different modality information, which shows that they can generate various variations of modality by moving over these latent variables or changing the value of another modality.
A common approach to dealing with multiple modalities in deep neural networks is to share the top of hidden layers in modality-specific networks. Ngiam et al.  proposes this approach with deep autoencoders, which revealed that it can extract better representations than single-modality settings can. Actually, Srivastava and Salakhutdinov  adapts this idea to work with deep Boltzmann machines (DBMs) [Salakhutdinov and Hinton, 2009], which are generative models with undirected connections based on maximum joint likelihood learning of all modalities. Therefore, this model can generate modalities bi-directionally. Another study Sohn et al.  improves this model to exchange multiple modalities effectively, which are based on minimizing the variation of information.
Recently, VAEs [Kingma and Welling, 2013, Rezende et al., 2014] have been used to train such high-dimensional modalities. Several studies [Kingma et al., 2014, Sohn et al., 2015] have examined the use of conditional VAEs (CVAEs), which maximize a conditional log-likelihood by variational methods. In fact, many studies are based on efforts with CVAEs to train various multiple modalities such as handwriting digits and labels [Kingma et al., 2014, Sohn et al., 2015], object images and degrees of rotation [Kulkarni et al., 2015], facial images and attributes [Larsen et al., 2015, Yan et al., 2015], and natural images and captions [Mansimov et al., 2015]. The main features of CVAEs are that the relation between modalities is one-way and a latent variable does not include information of a conditioned modality111According to [Louizos et al., 2015], this independence might not be satisfied strictly because the encoder in CVAEs still has dependence., which is an unsuitable characteristic for our objective. Pandey and Dukkipati  proposes the use of a conditional multimodal autoencoder (CMMA), which also maximizes the conditional log-likelihood but which makes this latent variable connected directly from a conditional variable, i.e., these variables are not independent. However, CMMA still considers that these modalities are generated in a single fixed direction.
Another well-known approach of deep generation models is generative adversarial nets (GANs) [Goodfellow et al., 2014]. Even in the case of GANs, when handling multiple modalities, it is often modeled by generation in one direction such as conditional GANs [Mirza and Osindero, 2014].
Recently, GANs that can generate images bi-directionally from images have been proposed [Liu and Tuzel, 2016, Liu et al., 2017, Zhu et al., 2017]. These models train perfect pixel-by-pixel correspondence between modalities of the same dimension and intentionally ignore probabilistic factors. However, no complete correspondence exists between modalities of different kinds that we examine specifically in this study. Our methods can train the probabilistic joint representation integrating all modality information, so they can obtain a probabilistic relation between modalities.
VAEs with multiple modalities
This section first briefly presents the formulation of VAEs. Subsequently, a simple extension of VAEs to multiple modalities is introduced.
Given observation variables and corresponding latent variables , their generating processes are definable as and , where is a model parameter of . The objective of VAEs is maximization of the marginal distribution . Because this distribution is intractable, we instead train the model to maximize the following lower bound.
where is an approximate distribution of posterior and is a model parameter of . We designate as encoder and as decoder. Moreover, in Eq. 1, the first term represents a regularization. The second one represents a negative reconstruction error.
To optimize the lower bound with respect to parameters
, we estimate the gradients of Eq.1 using stochastic gradient variational Bayes (SGVB). If we consider , where , then we can reparameterize to , where . Therefore, we can estimate the gradients of the negative reconstruction term in Eq. 1 with respect to and as . The gradients of the regularization term are solvable analytically. Therefore, we can optimize Eq. 1 using standard stochastic optimization methods.
Joint Multimodal Variational Autoencoders
Next, we examine the dataset , where two modalities and have dimensions and structures of different kinds222In our experiment, they depend on the dataset, see the section of setting.. We assume that these generative models are conditionally independent of the same latent variable , i.e. joint representation. It means that the latent variables of generative models corresponding to each modality are shared. Therefore, their generative process becomes and , where and respectively represent the model parameters of each independent .
Considering an approximate posterior distribution as , we can estimate a lower bound of the log-likelihood , as shown below.
Eq. 2 has two negative reconstruction terms that are correspondent to each modality. As with VAEs, we designate as the encoder and both and as decoders. This model is a simple extension of VAEs to multiple modalities, and we call it a joint multimodal variational autoencoder (JMVAE). Because each modality has different feature representation, we set different networks for each decoder.
Complement of a missing modality
After training JMVAE, we can extract a joint latent representation by sampling from the encoder at testing time. Our objective is to exchange modalities bi-directionally, e.g., images to text and vice versa. In this setting, modalities that we want to sample are expected to be missing, so that inputs of such modalities are set to zero or random noise (Figure 2(a)). In discriminative multimodal settings, this is a common means of estimating modalities from other modalities [Ngiam et al., 2011], but it is difficult to handle this missing modality properly.
In VAE, Rezende et al. 
proposes a sophisticated approach of iterative sampling by Markov chain with a transition kernel to complement the missing value of input. In the case of JMVAE, the transition kernel whenis missing is .
We can now estimate a missing modality by first setting the initial value of to random noise such as . Then we conduct iterative sampling following to the above kernel. As the number of iterations increases, we can estimate complemented values better. As described in this paper, we call it the iterative sampling method.
However, if missing modalities are high-dimensional and complicated such as natural images compared to other modalities, then the inferred latent variable might collapse and generated samples might become incomplete. In the experiment, we will show that this difficulty cannot be prevented even using the iterative sampling method.
To resolve this issue, we propose two new models: JMVAE-h and JMVAE-kl.
Recently, some studies have extended the latent variables of VAEs to a stochastic hierarchical structure to improve the expressiveness and likelihood of models [Burda et al., 2015, Sønderby et al., 2016, Gulrajani et al., 2016]. Such a hierarchical structure of latent variables becomes robust against a missing input, so it might contribute to preventing them from collapse.
We can extend JMVAE to a hierarchical structure of latent variables. Let the latent variable be the stochastic hierarchy of the layers 333Note that the stochastic layers differ from the deterministic layers of neural networks.
, the joint distribution of JMVAE becomes, where all conditional distributions are Gaussian and are parameterized by deep neural networks.
Various means are available to decompose the approximate distribution . We follow [Gulrajani et al., 2016], which is , where all conditional distributions are Gaussian. In this study, as with [Gulrajani et al., 2016], the structure from the input to the final stochastic layer consists of deterministic mappings (each mapping is parameterized in a deep neural network), and the probabilistic output of each stochastic layer is obtained from a deterministic output (see Figure 2(b)).
Therefore, the lower bound of JMVAE with a stochastic hierarchical structure becomes
We call this model JMVAE-h, and demonstrate that it can prevent the issue of missing modality through experiments. In our experiments, we set . Figure 2(b) shows the flow of generating from with JMVAE-h.
In the stochastic hierarchical approach described above, the iterative sampling method is still important for generating missing modalities. However, it takes time to generate high-dimensional samples. Therefore, we propose JMVAE-kl as a model to generate appropriate missing samples without using the iterative sampling method.
Assume that we have encoders with a single input, and , where is a parameter, then we would like to train them by reducing the divergence between their encoders and an encoder (see Figure 2(c)). Therefore, the object function of JMVAE-kl becomes
Actually, JMVAE-h and JMVAE-kl have both benefits and shortcomings. JMVAE-kl must prepare encoders for each modality apart from the original encoder, but JMVAE-h requires no preparation of any additional encoder. Conversely, although JMVAE-h must apply the iterative sampling method to generate a missing modality, JMVAE-kl does not need to do so. Therefore, JMVAE-h is effective when there are three or more modalities and the dimension of the missing modality is not so large; JMVAE-kl is effective when there are modalities of only two kinds and the missing modality is high-dimensional.
|Model||Label Image||Image Label|
|Model||Attributes Image||Image Attributes|
In this section, we confirm the following three points through experimentation: (1) The missing modality difficulty certainly occurs on JMVAE and our proposed methods can prevent this issue. (2) Our proposed models can generate modalities bi-directionally with the likelihood equivalent to (or higher than) models that generate modality in only one direction. (3) Our models can appropriately obtain the joint representation, which integrates information of different modalities.
As described herein, we used two datasets: MNIST and CelebA [Liu et al., 2015].
Originally, MNIST was not a dataset for multimodal setting. For this work, we used this dataset as a toy problem for various tests to verify of our model. We regard 784-dimensional handwriting images and corresponding 10 digit labels as two modalities. We used 50,000 as training set and the remaining 10,000 as a test set.
CelebA consists of 202,599 color facial images and 40 corresponding binary attributes such as male, eyeglasses, and mustache. For this work, we regard them as two modalities. This dataset is challenging because they have dimensions and structures of completely different kinds. Beforehand, we cropped the images to squares and resized them to 64 64 and normalized them. From the dataset, we chose 191,899 images that are identifiable facial images using OpenCV and used them for our experiment. We used 90% out of all the dataset contents as a training set and the remaining 10% of them as a test set.
For MNIST, we considered images as and corresponding labels as . We set as Bernoulli and as categorical distribution. We used warm-up [Bowman et al., 2015, Sønderby et al., 2016], which first forces training only of the term of the negative reconstruction error and which then gradually increases the effect of the regularization term to prevent local minima during early training. We increased this term linearly during the first epochs as with Sønderby et al. . Then we set and trained for epochs on MNIST. Moreover, as described by Burda et al. , Sønderby et al. 
, we resampled the binarized training values randomly from MNIST images for each epoch to prevent over-fitting.
For CelebA, we considered facial images as and corresponding attributes as
. We set a Gaussian distribution for the decoder of both modalities, where the variance of the Gaussian was fixed to 1. We setand trained for epochs.
For this experiment, we estimated the test conditional log-likelihood (or ) to evaluate the model performance. This estimate indicates how well a model can generate a modality from the corresponding another modality. Therefore, higher is better. We can estimate as
where . In Eq. 5, we apply the sample approximation to the lower bound of the log-likelihood rather than to the log-likelihood directly. This is because the log-likelihood is biased555 This lower bound is unbiased estimator. Moreover,
This lower bound is unbiased estimator. Moreover,Burda et al.  shows that sample approximation of log-likelihood approaches the true log-likelihood as the number of samples increases.. We set for all experiments.
To estimate the above, we should approximate (or ) and draw samples from them. The way of it depends on how to complement missing modalities. In the cases of JMVAE and JMVAE-h, we first set the input of modality which we want to generate as missing, and then apply iterate sampling in multiple times to complement the missing modality. This means that the estimation of log-likelihood on these models depends on the number of iterate sampling. Conversely, JMVAE-kl can estimate the approximate distribution (or ) directly at the training stage. For the approximation of conditional log-likelihood in JMVAE-h, see appendix.
Confirmation of the missing modality difficulty
Figure 4(a) presents results of generating from . The top row, i.e. the image generated by application of iterative sampling only once, is blurred and is not properly generated. As the number of iterative sampling increases, the generated images become somewhat clearer, but it is readily apparent that these images do not correspond to their labels. This result demonstrates that a missing modality cannot be generated even using the iterative sampling method. Figure 4(b) presents the result obtained in the case of JMVAE-h. Unlike the results presented in Figure 4(a), it is apparent that images corresponding to their labels are generated appropriately as the sampling number increases. Figure 4(c) is the result for JMVAE-kl. JMVAE-kl can generate digit images conditioned on numbers appropriately without iterating sampling.
Figure 5 presents the conditional likelihood of each model under the number of iterative sampling changes. As the number of sampling increases, the log-likelihood increases for both JMVAE and JMVAE-h. However, in the case of normal JMVAE, its likelihood is much lower than that of JMVAE-kl no matter how much the number of sampling is increased. For JMVAE-h, the likelihood increases with a fewer sampling times than JMVAE, and the final likelihood becomes higher than normal JMVAE. On the other hand, in the case of JMVAE-kl, it can obtain high likelihood without iterative sampling.
Evaluation of conditional log-likelihood
This section describes evaluation of our models with conditional log-likelihood to evaluate the bi-directional generation of modalities quantitatively. Additionally, we compare the log-likelihood evaluation with conventional conditional VAEs, which are CVAE [Kingma et al., 2014, Sohn et al., 2015] and CMMA [Pandey and Dukkipati, 2016]. Note that these models cannot generate modalities bi-directionally, so it is necessary to train them separately in each direction.
The upper of Table 1 presents the test conditional log-likelihoods of JMVAE, our two improved models, and conventional conditional VAEs. The number of iterative sampling of JMVAE and JMVAE-h was set to 10. First, when generating an image from labels, JMVAE-kl and JMVAE-h improve their likelihood compared to JMVAE as expected. In addition, it turns out that these models have higher likelihood than existing VAEs modeling generation in only one direction. Next, in the case of generating labels from images, we find that there is not much difference in the likelihood for each model compared with generation in the opposite direction. This means that the missing modality difficulty does not occur when we generate a small dimensional modality from another large one. Note that the evaluation value of the conditional likelihood of JMVAE-h might be underestimated compared to the evaluation value of other models. This is because the approximation of conditional likelihood holds a lower bound than other models (see appendix for details). In light of this fact, we can see that both JMVAE-kl and JMVAE-h can be generated in both directions with the same or higher likelihood as that of conventional conditional VAEs.
Visualization of the joint representation
Figure 3 presents a visualization of the joint representation in each model. For the case of JMVAE-h, we sampled at the top stochastic layer. Here, we sampled from the trained encoders with the dimensions of the latent variable set to 2.
Specifically examining the left of each model results in Figure 3, samples from all models in the latent space are distributed and labeled for each label, which indicates that the joint representation including two modalities is obtained.
Next, we specifically examine the right in Figure 3. From JMVAE results, we can see that all the samples are distributed in a considerably small area irrespective of their labels. This result demonstrates that, if the image information is missing, the latent representation actually collapses. By contrast, the result of JMVAE-kl shows that it can obtain the latent representation that is almost unchanged from sampling from all modalities. Because samples are not gathered in a small area as JMVAE, JMVAE-kl prevents the latent representation from collapsing. Finally, regarding the result of JMVAE-h, they are distributed considerably separately for each label just like the result on the left figure. From this, it can be said that the collapse of latent representation is also prevented in JMVAE-h too.
Confirmation of the missing modality difficulty
Figure 6 presents the results obtained from generating facial images from attributes. From Figure 6(b), we can see that an ordinary JMVAE cannot generate an appropriate facial images at all from its attributes. In addition, unlike MNIST, it can be seen that the facial image largely collapses if the number of iterative sampling is increased. This suggests that the greater the difference in the amount of information between modalities becomes, the more severely the missing modality difficulty may become. On the other hand, it is confirmed that JMVAE-h improves the facial image quality, and that it can generate a more beautiful images when the number of iterative sampling is increased (Figure 6
(c)). However, it can also be confirmed that these generated images do not correspond closely to the attributes. Specifically, no generated image shows a subject wearing eyeglasses, probably because the stochastic hierarchy deepened, the gap separating the modes corresponding to the attributes becomes smaller, which makes mixing in the latent space easier. Finally, for the case of JMVAE-kl, we can observe that facial images corresponding to attributes are generated by one sampling (Figure6(d)).
Evaluation of the conditional log-likelihood
The lower of the Table 1 shows the test conditional log-likelihoods of all models. The number of iterative sampling of JMVAE and JMVAE-h was set to 40. First, in the case of generating facial images from attributes, we can see that the likelihoods of JMVAE-kl and JMVAE-h are improved considerably compared to that of JMVAE. From this, we can see that even if the dimensions between the modalities are differ greatly, our improved models contribute to preventing the missing modality difficulty. Furthermore, as with the result of MNIST, we can see that our improved models can be generated in both directions with the same or higher likelihood as that of conventional VAEs.
Generating faces from attributes and the joint representation on CelebA
Next, we confirm that our model can obtain the joint representation appropriately. Based on the results presented above, we used JMVAE-kl in the remaining experiments. Moreover, we combined JMVAE-kl with GANs to generate clearer images. We considered the network of as generator in GANs. Then we optimized the GAN’s loss with the lower bound of JMVAE-kl, which is the same approach to a VAE-GAN model [Larsen et al., 2015]. See appendix for the network structure of the discriminator of this GAN.
Figure 7(a) portrays generated faces conditioned on various attributes. Results show that we can generate an average face for each attribute and various random faces conditioned on a certain attributes. Figure 7(b) shows that these samples are gathered for each attribute. In addition, in the group for each attribute, the average face is positioned almost at the center of each group, and random facial images are arranged around the average face. Furthermore, it can be confirmed that these arrangements are almost the same in both Base and Not Male. These results indicate that manifold learning of the joint representation works well.
Bi-directional generation between faces and attributes on CelebA
Finally, we demonstrate that our model can do bi-directionally generation between faces and attributes. Figure 9 shows that JMVAE-kl can generate both attributes and changed images conditioned on various attributes from images that had no attribute information. These are possible because, as the result above shows, JMVAE-kl can properly obtain the joint representation integrating different modalities.
As described herein, we introduced the extension of VAEs to generate modalities bi-directionally, which we call JMVAE. However, results show that we cannot generate a high-dimensional modality properly if the input of this modality is missing, and that the known method, the iterative sampling method, cannot resolve this issue. We proposed two models, JMVAE-kl and JMVAE-h, to prevent the above missing modality difficulty. Results demonstrated that these proposed models prevent a latent variable from collapse and that they can generate modalities bi-directionally with equal or higher quality compared to models which can only generate them in one direction. Furthermore, because these methods appropriately obtain the joint representation, we found that samples of various variations can be generated as corresponding to a certain modality or a changed modality.
Ngiam et al. 
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y
Multimodal deep learning.In
Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
- Srivastava and Salakhutdinov  Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
- Kingma et al.  Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
- Sohn et al.  Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
- Pandey and Dukkipati  Gaurav Pandey and Ambedkar Dukkipati. Variational methods for conditional multimodal learning: Generating human faces from attributes. arXiv preprint arXiv:1603.01801, 2016.
- Sohn et al.  Kihyuk Sohn, Wenling Shang, and Honglak Lee. Improved multimodal deep learning with variation of information. In Advances in Neural Information Processing Systems, pages 2141–2149, 2014.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Rezende et al.  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Salakhutdinov and Hinton  Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
- Kulkarni et al.  Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
- Larsen et al.  Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
- Yan et al.  Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. arXiv preprint arXiv:1512.00570, 2015.
- Mansimov et al.  Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
- Louizos et al.  Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair auto encoder. arXiv preprint arXiv:1511.00830, 2015.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Mirza and Osindero  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Liu and Tuzel  Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477, 2016.
- Liu et al.  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017.
- Zhu et al.  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
- Burda et al.  Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- Sønderby et al.  Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. arXiv preprint arXiv:1602.02282, 2016.
- Gulrajani et al.  Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
Liu et al. 
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
- Bowman et al.  Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Kingma and Ba  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Team et al.  The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
- Dieleman et al.  Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, Daniel Maturana, Martin Thoma, Eric Battenberg, Jack Kelly, Jeffrey De Fauw, Michael Heilman, Diogo Moitinho de Almeida, Brian McFee, Hendrik Weideman, Gábor Takács, Peter de Rivaz, Jon Crall, Gregory Sanders, Kashif Rasul, Cong Liu, Geoffrey French, and Jonas Degrave. Lasagne: First release., August 2015. URL http://dx.doi.org/10.5281/zenodo.27878.
Appendix A The network architectures
Parameterization of distributions with deep neural networks
The Gaussian distribution can be parameterized with deep neural networks as follows.
where and are linear single layer neural networks and
means a deep neural network with arbitrary number of layers. Moreover, applying softplus function for each element of a vector is denoted as.
In the case of categorical distribution, we can parameterize it as
where means softmax function.
For the notation of model structures, we denote a linear fully-connected layer with
units and ReLU asDkR, and DkR without ReLU as Dk. In addition, the process of applying J after I is denoted as I-J, and the process of concatenating the last layers of the two networks I, J into one layer is denoted as (I,J).
Therefore, the network structures of encoders and decoders on MNIST are as follows.
and : D64
: (D512R-D512R, D512R-D512R)
, , and (Gaussian)
and : D64
and : D64
We denote a convolutional layer (filter size : , number of channels :CkBR, and CkBR without ReLU as CkB. In addition, we denote a deconvolutional layer (filter size : , number of channels : , and crop : 2) with batch normalization and ReLU as DCkBR, DCkBR without ReLU as DCkB, and DCkB without batch normalization as DCk. Further, a linear fully-connected layer with units, ReLU, and batch normalization is denoted as DkBR and a flatten layer is denoted as F.
Using the above notations, the model structures on CelebA are as follows.
and : D128
: (C64R-C128BR-C256BR-C256BR-F, D512R-D512BR)-D1024R
and : D128
and : D128
and : D128
(Gaussian with fixed variance)
(Gaussian with fixed variance)
and : D128
: (C64R-C128BR-C256BR-C256BR-F, D512R-D512BR)-D1024R-D64-D512R-D512R
Moreover, we set C64R-C128BR-C256BR-C256BR -F-D1024R-D1S (where DkS means Dk with sigmoid function) as a network structure of the discriminator used in CelebA experiment.
Appendix B Conditional log-likelihood of JMVAE-h
We can estimate conditional log-likelihood of JMVAE-h as follows.
where ．We set for all experiments.
We find that the approximation of Eq. 6 holds a lower bound than that of JMVAE. Therefore, the evaluation value of the conditional likelihood of JMVAE-h might be underestimated compared to the evaluation value of JMVAE.