We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.READ FULL TEXT VIEW PDF
We investigate deep generative models that can exchange multiple modalit...
Learning generative models that span multiple data modalities, such as v...
As deep neural networks become more adept at traditional tasks, many of ...
In this paper, we address the problem of conditional modality learning,
We investigate deep generative models, which allow us to use training da...
It is widely recognized that the predictions of deep neural networks are...
Understanding the relationship between the auditory and visual signals i...
In our world, information is represented through various modalities. While images are represented by pixel information, these can also be described with text or tag information. People often exchange such information bi-directionally. For instance, we can not only imagine what “a young female with a smile who does not wear glasses” looks like, but also add this caption to a corresponding photograph. To do so, it is important to extract a joint representation that captures high-level concepts among all modalities. Then we can bi-directionally generate modalities through the joint representations. However, each modality typically has a different kind of dimension and structure, e.g., images (real-valued and dense) and texts (discrete and sparse). Therefore, the relations between each modality and the joint representations might become high nonlinearity. To discover such relations, deep neural network architectures have been used widely for multimodal learning(Ngiam et al., 2011; Srivastava & Salakhutdinov, 2012)
. The common approach with these models to learn joint representations is to share the top of hidden layers in modality specific networks. Among them, generative approaches using deep Boltzmann machines (DBMs)(Srivastava & Salakhutdinov, 2012; Sohn et al., 2014) offer the important advantage that these can generate modalities bi-directionally.
have been proposed to estimate flexible deep generative models by variational inference methods. These models use back-propagation during training, so that it can be trained on large-scale and high-dimensional dataset compared with DBMs with MCMC training. Some studies have addressed to handle such large-scale and high-dimensional modalities on VAEs, but they are forced to model conditional distribution(Kingma et al., 2014; Sohn et al., 2015; Pandey & Dukkipati, 2016). Therefore, it can only generate modalities in one direction. For example, we cannot obtain generated images from texts if we train the likelihood of texts given images. To generate modalities bi-directionally, all modalities should be treated equally under the learned joint representations, which is the same as previous multimodal learning models before VAEs.
As described in this paper, we develop a novel multimodal learning model with VAEs, which we call a joint multimodal variational autoencoder (JMVAE). The most significant feature of our model is that all modalities, and (e.g., images and texts), are conditioned independently on a latent variable corresponding to joint representation, i.e., the JMVAE models a joint distribution of all modalities, . Therefore, we can extract a high-level representation that contains all information of modalities. Moreover, since it models a joint distribution, we can draw samples from both and . Because, at this time, modalities that we want to generate are usually missing, the inferred latent variable becomes incomplete and generated samples might be collapsed in the testing time when missing modalities are high-dimensional and complicated. To prevent this issue, we propose a method of preparing the new encoders for each modality, and , and reducing the divergence between the multimodal encoder , which we call JMVAE-kl. This contributes to more effective bi-directional generation of modalities, e.g., from face images to texts (attributes) and vice versa (see Figure 1).
The main contributions of this paper are as follows:
We introduce a joint multimodal variational autoencoder (JMVAE), which is the first study to train joint distribution of modalities with VAEs.
We propose an additional method (JMVAE-kl), which prevents generated samples from being collapsed when some modalities are missing. We experimentally confirm that this method solves this issue.
We show qualitatively and quantitatively that JMVAE can extract appropriate joint distribution and that it can generate and reconstruct modalities similarly or more properly than conventional VAEs.
We demonstrate that the JMVAE can generate multiple modalities bi-directionally even if these modalities have completely different kinds of dimensions and structures, e.g., high-dimentional color face images and low-dimentional binary attributes.
The common approach of multimodal learning with deep neural networks is to share the top of hidden layers in modality specific networks. Ngiam et al. (2011) proposed this approach with deep autoencoders (AEs) and found that it can extract better representations than single modality settings. Srivastava & Salakhutdinov (2012) also took this idea but used deep Boltzmann machines (DBMs) (Salakhutdinov & Hinton, 2009). DBMs are generative models with undirected connections based on maximum joint likelihood learning of all modalities. Therefore, this model can generate modalities bi-directionally. Sohn et al. (2014) improved this model to exchange multiple modalities effectively, which are based on minimizing the variation of information and JMVAE-kl in ours can be regarded as minimizing it by variational inference methods (see Section 3.3 and Appendix A
). However, it is computationally difficult for DBMs to train high-dimensional data such as natural images because of MCMC training.
Recently, VAEs (Kingma & Welling, 2013; Rezende et al., 2014) are used to train such high-dimensional modalities. Kingma et al. (2014); Sohn et al. (2015) propose conditional VAEs (CVAEs), which maximize a conditional log-likelihood by variational methods. Many studies are based on CVAEs to train various multiple modalities such as handwriting digits and labels (Kingma et al., 2014; Sohn et al., 2015), object images and degrees of rotation (Kulkarni et al., 2015), face images and attributes (Larsen et al., 2015; Yan et al., 2015), and natural images and captions (Mansimov et al., 2015). The main features of CVAEs are that the relation between modalities is one-way and a latent variable does not contain the information of a conditioned modality111According to Louizos et al. (2015), this independence might not be satisfied strictly because the encoder in CVAEs still has the dependence., which are unsuitable for our objective.
Pandey & Dukkipati (2016) proposed a conditional multimodal autoencoder (CMMA), which also maximizes the conditional log-likelihood. The difference between CVAEs is that a latent variable is connected directly from a conditional variable, i.e., these variables are not independent. Moreover, this model forces the latent representation from an input to be close to the joint representation from multiple inputs, which is similar to JMVAE-kl. However, the CMMA still considers that modalities are generated in fixed direction. This is the most different part from ours.
This section first introduces the algorithm of VAEs briefly and then proposes a novel multimodal learning model with VAEs, which we call the joint multimodal variational autoencoder (JMVAE).
Given observation variables and corresponding latent variables , their generating processes are definable as and , where is the model parameter of . The objective of VAEs is maximization of the marginal distribution . Because this distribution is intractable, we instead train the model to maximize the following lower bound of the marginal distribution as
where is an approximate distribution of posterior and is the model parameter of . We designate as encoder and as decoder. Moreover, in Equation 1, the first term represents a regularization. The second one represents a negative reconstruction error.
To optimize the lower bound with respect to parameters , we estimate gradients of Equation 1 using stochastic gradient variational Bayes (SGVB). If we consider, where , then we can reparameterize to , where . Therefore, we can estimate the gradients of the negative reconstruction term in Equation 1 with respect to and as . Because the gradients of the regularization term are solvable analytically, we can optimize Equation 1 with standard stochastic optimization methods.
Next, we consider dataset , where two modalities and have different kinds of dimensions and structures222In our experiment, these depend on dataset, see Section 4.2.. Our objective is to generate two modalities bi-directionally. For that reason, we assume that these are conditioned independently on the same latent concept : joint representation. Therefore, we assume their generating processes as and , where and represent the model parameters of each independent . Figure 2(a) shows a graphical model that represents generative processes. One can see that this models joint distribution of all modalities, . Therefore, we designate this model as a joint multimodal variational autoencoder (JMVAE).
Considering an approximate posterior distribution as , we can estimate a lower bound of the log-likelihood as follows:
Equation 3 has two negative reconstruction terms which are correspondent to each modality. As with VAEs, we designate as the encoder and both and as decoders.
We can apply the SGVB to Equation 3 just as Equation 1, so that we can parameterize the encoder and decoder as deterministic deep neural networks and optimize them with respect to their parameters, , , and . Because each modality has different feature representation, we should set different networks for each decoder, and . The type of distribution and corresponding network architecture depends on the representation of each modality, e.g., Gaussian when the representation of modality is continuous, and a Bernoulli when it is a binary value.
Unlike original VAEs and CVAEs, the JMVAE models joint distribution of all modalities. In this model, modalities are conditioned independently on a joint latent variable. Therefore, we can extract better representation that includes all information of modalities. Moreover, we can estimate both marginal distribution and conditional distribution in bi-directional, so that we can not only obtain images reconstructed themselves but also draw texts from corresponding images and vice versa. Additionally, we can extend JMVAEs to handle more than two modalities such as in the same learning framework.
In the JMVAE, we can extract joint latent features by sampling from the encoder at testing time. Our objective is to exchange modalities bi-directionally, e.g., images to texts and vice versa. In this setting, modalities that we want to sample are missing, so that inputs of such modalities are set to zero (the left panel of Figure 2(b)). The same is true of reconstructing a modality only from itself. This is a natural way in discriminative multimodal settings to estimate samples from unimodal information (Ngiam et al., 2011). However, if missing modalities are high-dimensional and complicated such as natural images, then the inferred latent variable becomes incomplete and generated samples might collapse.
We propose a method to solve this issue, which we designate as JMVAE-kl. Moreover, we describe the former way as JMVAE-zero to distinguish it. Suppose that we have encoders with a single input, and , where and are parameters. We would like to train them by bringing their encoders close to an encoder (the right panel of Figure 2(b)). Therefore, the object function of JMVAE-kl becomes
where is a factor that regulates the KL divergence terms.
From another viewpoint, maximizing Equation 4 can be regarded as minimizing the variation of information (VI) by variational inference methods (proven and derived in Appendix A). The VI, a measure of the distance between two variables, is written as , where is the data distribution. It is apparent that the VI is the sum of two negative conditional log-likelihoods. Therefore, minimizing the VI contributes to appropriate bi-directional exchange of modalities. Sohn et al. (2014) also train their model to minimize the VI for the same objective as ours. However, they use DBMs with MCMC training.
This section presents evaluation of the qualitative and quantitative performance and confirms the JMVAE functionality in practice.
As described herein, we used two datasets: MNIST and CelebA (Liu et al., 2015).
MNIST is not a dataset for multimodal setting. In this work, we used this dataset for toy problem of multimodal learning. We consider handwriting images and corresponding digit labels as two different modalities. We used 50,000 as training set and the remaining 10,000 as a test set.
CelebA consists of 202,599 color face images and corresponding 40 binary attributes such as male, eyeglasses, and mustache. In this work, we regard them as two modalities. This dataset is challenging because these have completely different kinds of dimensions and structures. Beforehand, we cropped the images to squares and resized to 64 64 and normalized. From the dataset, we chose 191,899 images that are identifiable face by OpenCV and used them for our experiment. We used 90% out of all the dataset contains as training set and the remaining 10% of them as test set.
For MNIST, we considered images as and corresponding labels as . We prepared two networks each with two dense layers of 512 hidden units and using leaky rectifiers and shared the top of each layers and mapped them into 64 hidden units. Moreover, we prepared two networks each with three dense layers of 512 units and set as Bernoulli and as categorical distribution whose output layer is softmax. We used warm-up (Bowman et al., 2015; Sønderby et al., 2016), which first forces training only of the term of the negative reconstruction error and then gradually increases the effect of the regularization term to prevent local minima during early training. We increased this term linearly during the first epochs as with Sønderby et al. (2016). We set and trained for epochs on MNIST. Moreover, same as Burda et al. (2015); Sønderby et al. (2016)
, we resampled the binarized training values randomly from MNIST for each epoch to prevent over-fitting.
For CelebA, we considered face images as and corresponding attributes as . We prepared two networks with layers (four convolutional and a flattened layers for and two dense layers for
) with ReLU and shared the top of each layers and mapped them into 128 units. For the decoder, we prepared two networks, with a dense and four deconvolutional layers forand three dense layers for
, and set Gaussian distribution for decoder of both modalities, where the variance of Gaussian was fixed to 1 for the decoder of. In CelebA settings, we combined JMVAE with generative adversarial networks (GANs) (Goodfellow et al., 2014) to generate clearer images. We considered the network of as generator in GAN, then we optimized the GAN loss with the lower bound of the JMVAE, which is the same way as a VAE-GAN model (Larsen et al., 2015). As presented herein, we describe this model as JMVAE-GAN. We set and trained for epochs on CelebA.
For this experiment, we estimated test log-likelihood to evaluate the performance of model. This estimate roughly corresponds to negative reconstruction error. Therefore, higher is better. From this performance, we can find that not only whether the JMVAE can generate samples properly but also whether it can obtain joint representation properly. If the log-likelihood of a modality is low, representation for this modality might be hurt by other modalities. By contrast, if it is the same or higher than model trained on a single modality, then other modalities contribute to obtaining appropriate representation.
We estimate the test marginal log-likelihood and test conditional log-likelihood on JMVAE. We compare the test marginal log-likelihood against VAEs (Kingma & Welling, 2013; Rezende et al., 2014) and the test conditional log-likelihood against CVAEs (Kingma et al., 2014; Sohn et al., 2015) and CMMAs (Pandey & Dukkipati, 2016). On CelebA, we combine all competitive models with GAN and describe them as VAE-GAN, CVAE-GAN, and CMMA-GAN. For fairness, architectures and parameters of these competitive models were set to be as close as possible to those of JMVAE.
We calculate the importance weighted estimator (Burda et al., 2015) from lower bounds at testing time because we would like to estimate the true test log-likelihood from lower bounds. To estimate the test marginal log-likelihood of the JMVAE, we use two possible lower bounds: sampling from or . We describe the former lower bound as the multiple lower bound and the latter one as the single lower bound. When we estimate the test conditional log-likelihood , we also use two lower bounds, each of which is estimated by sampling from (multiple) or (single) (see Appendix B for more details). To estimate the single lower bound, we should approximate the single encoder ( or ) by JMVAE-zero or JMVAE-kl. When the value of log-likelihood with the single lower bound is the same or larger than that with the multiple lower bound, the approximation of the single encoder is good. Note that original VAEs use a single lower bound and that CVAEs and CMMAs use a multiple lower bound.
Our first experiment evaluated the test marginal log-likelihood and compared it with that of the VAE on MNIST dataset. We trained the model with both JMVAE-zero and JMVAE-kl and confirmed these differences. As described in Section 4.3.1, we have two possible ways of estimating the marginal log-likelihood of the JMVAE, i.e., multiple and single lower bounds. The left of Table 1 shows the test marginal log-likelihoods of the VAE and JMVAE. It is apparent that log-likelihood of the JMVAE-zero is the same or slightly better than that of the VAE. In the case of the log-likelihood of JMVAE-kl, the log-likelihood becomes better as is small. Especially, JMVAE-kl with and single lower bound archives the highest log-likelihood in Table 1. If is 1, however, then the test log-likelihood on JMVAE-kl becomes much lower. This is because the influence of the regularization term becomes strong as is large.
Next, we evaluated the test conditional log-likelihood and compared it with that of the CVAE and CMMA conditioned on . As in the case of the marginal log-likelihood, we can estimate the JMVAE’s conditional log-likelihood by both the single and multiple lower bound. The single bound can be estimated using JMVAE-zero or JMVAE-kl. The right of Table 1 shows the test conditional log-likelihoods of the JMVAE, CVAE, and CMMA. It is apparent that the CVAE achieves the highest log-likelihood. Even so, in the case of multiple bound, log-likelihoods with both JMVAE-zero and JMVAE-kl (except ) outperform that of the CMMA.
It should be noted that the log-likelihood with JMVAE-zero and single bound is significantly low. As described in Section 3.3, this is because a modality is missing as input. By contrast, it is apparent that the log-likelihood with JMVAE-kl is improved significantly from that with JMVAE-zero. It shows that JMVAE-kl solves the issue of missing modalities (we can also find this result in generated images, see Appendix E). Moreover, we find that this log-likelihood becomes better as is large, which is opposite to the other results. Therefore, there is a trade-off between whether each modality can be reconstructed properly and whether multiple modalities can be exchanged properly and it can be regulated by .
In this section, we used CelebA dataset to evaluate the JMVAE. Table 2 presents the evaluations of marginal and conditional log-likelihood. From this table, it is apparent that values of both marginal and conditional log-likelihood with JMVAEs are larger than those with other competitive methods. Moreover, comparison with Table 1 shows that the improvement on CelebA is greater than that on MNIST, which suggests that joint representation with multiple modalities contributes to improvement of the quality of the reconstruction and generation in the case in which an input modality is large-dimensioned and complicated.
In this section, we first evaluated that the JMVAE can obtain joint representation that includes the information of modalities. Figure 3 shows the visualization of latent representation with the VAE, CVAE, and JMVAE on MNIST. It is apparent that the JMVAE obtains more discriminable latent representation by adding digit label information. Figure 3(b) shows that, in spite of using multimodal information as with the JMVAE, points in CVAE are distributed irrespective of labels because CVAEs force latent representation to be independent of label information, i.e., it is not objective for CVAEs to obtain joint representation.
Next, we confirm that JMVAE-GAN on CelebA can generate images from attributes. Figure 4(a) portrays generated faces conditioned on various attributes. We find that we can generate an average face of each attribute and various random faces conditioned on a certain attributes. Figure 4(b) shows that samples are gathered for each attribute and that locations of each variation are the same irrespective of attributes. From these results, we find that manifold learning of joint representation with images and attributes works well.
Finally, we demonstrate that JMVAE-GAN can generate bi-directionally between faces and attributes. Figure 6 shows that MVAE-GAN can generate both attributes and changed images conditioned on various attributes from images which had no attribute information. This way of generating an image by varying attributes is similar to the way of the CMMA (Pandey & Dukkipati, 2016). However, the CMMA cannot generate attributes from an image because it only generates images from attributes in one direction.
In this paper, we introduced a novel multimodal learning model with VAEs, the joint multimodal variational autoencoders (JMVAE). In this model, modalities are conditioned independently on joint representation, i.e., it models a joint distribution of all modalities. We further proposed the method (JMVAE-kl) of reducing the divergence between JMVAE’s encoder and a prepared encoder of each modality to prevent generated samples from collapsing when modalities are missing. We confirmed that the JMVAE can obtain appropriate joint representations and high log-likelihoods on MNIST and CelebA datasets. Moreover, we demonstrated that the JMVAE can generate multiple modalities bi-directionally on the CelebA dataset.
In future work, we would like to evaluate the multimodal learning performance of JMVAEs using various multimodal datasets such as containing three or more modalities.
Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, 2015.
Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011.
The variation of information (VI) can be expressed as , where is the data distribution. In this equation, we specifically examine the sum of two negative log-likelihoods and do not consider the expectation in this derivation. We can calculate the lower bounds of these log-likelihoods as follows:
Two lower bounds used to estimate test marginal log-likelihood of the JMVAE are as follows:
We can also estimate test conditional log-likelihood from these two lower bounds as
where and . In this paper, we set on MNIST and on CelebA.
We can obtain a tighter bound on the log-likelihood by -fold importance weighted sampling. For example, we obtain an importance weighted bound on from Equation 10 as follows:
Strictly speaking, these two lower bounds are not equal. However, if the number of importance samples is extremely large, the difference of these two lower bounds converges to .
Let the multiple and single -hold importance weighted lower bounds as and . From the theorem of the importance weighted bound, both and converge to as .
Figure 6 presents a comparison of the original image and reconstructed image by the JMVAE on both MNIST and CelebA datasets. It is apparent that the JMVAE can reconstruct the original image properly with either a multiple or single encoder.
Table 3 shows the joint log-likelihood of the JMVAE on MNIST dataset by both JMVAE-zero and JMVAE-kl. It is apparent that the log-likelihood test on both approaches is almost identical (strictly, JMVAE-zero is slightly lower). The test log-likelihood on JMVAE-kl becomes much lower if is large.
Figure 7 presents generation samples of conditioned on single input . It is apparent that the JMVAE with JMVAE-kl generates conditioned digit images properly, although that with JMVAE-zero cannot generate them. As results showed, we also confirmed qualitatively that JMVAE-kl can model properly compared to JMVAE-zero.