1 Introduction
The problem of learning from several modalities simultaneously has garnered the attention of several deep learning researchers over the past few years [1][2]. This is primarily because of the wide availability of such data, and the numerous realworld applications where multimodal data is used. For instance, speech may be accompanied with text and the resultant data can be used for training speechtotext or texttospeech engines. Even within the same medium, several modalities may exist simultaneously, for instance, the plan and elevation of a 3d object, or multiple translations of a text.
The task of learning from several modalities simultaneously is complicated by the fact that the correlations within a modality are often much stronger than the correlations across modalities. Hence, many multimodal learning approaches such as [1][3] try to capture the crossmodal correlations at an abstract latent feature level rather than at the visible feature level. The assumption is that the latent features are comparatively less correlated than the visible features, and hence, the latent features from different modalities can be concatenated and a single distribution can be learnt for the concatenated latent features [3].
An alternative approach to capture the joint distribution, is by modelling the conditional distribution across modalities as done in [2], whereby the authors make the simplifying assumption that the joint loglikelihood is maximized when the conditional loglikelihood of each modality given the other modality is maximized. While the assumption is untrue in general, the idea of learning conditional distributions to capture the joint distribution has several advantages. In particular, the conditional distributions are often less complex to model, since conditioning on one modality reduces the possibilities for the other modality. Moreover, if the underlying task is to generate one modality given the other, then learning conditional distributions directly addresses this task.
Hence, we address the problem of multimodal learning by capturing the conditional distributions. In particular, we use a variational approximation to the joint loglikelihood for training. In this paper, we restrict ourselves to directed graphical models, whereby a latent representation is sampled from one modality (referred to as the conditioning modality) and the other modality (referred to as the generated modality) is then sampled from the latent representation. Hence, the model is referred to as conditional multimodal autoencoder (CMMA).
2 Problem Formulation and Proposed Solution
A formal description of the problem is as follows. We are given an sequence of datapoints . For a fixed datapoint , let be the modality that we wish to generate and be the modality that we wish to condition on. We assume that is generated by first sampling a realvalued latent representation from the distribution , and then sampling from the distribution . The graphical representation of the model is given in Figure 2. Furthermore, we assume that the conditional distribution of the latent representation given and the distribution of given are parametric.
Given the above description of the model, our aim is to find the parameters so as maximize the joint loglikelihood of and for the given sequence of datapoints. The computation of loglikelihood comprises a term for (referred to as conditional loglikelihood) and a term for . The computation fo requires the marginalization of the latent variable from the joint distribution .
(1)  
(2) 
For most choices of and , the evaluation of conditional loglikelihood is intractable. Hence, we resort to the minimization of a variational lower bound to the conditional loglikelihood. This is achieved by approximating the posterior distribution of given and , that is by a tractable distribution . This is explained in more detail in the following section.
2.1 The variational bound
For a given collection of datapoints, , the loglikelihood can be written as
Distribution  Parametric form  Representation 

(3) 
Let the posterior distribution be approximated by a distribution whose graphical representation is shown in Figure 2. In particular, be an approximation to the posterior distribution of the latent variables given and be an approximation to the posterior distribution of given . For an individual datapoint, the conditional loglikelihood can be rewritten as
(4)  
(5) 
where refers to the divergence between the distributions and and is always nonnegative. Note that the choice of the decomposition of posterior as forces the distribution to be ‘close’ to the true posterior , thereby encouraging the model to learn features from alone, that are representative about as well.
The term in equation (5) is referred to as the variational lower bound for the conditional loglikelihood for the datapoint and will be denoted by . It can further be rewritten as
(6) 
From the last equation, we observe that the variational lower bound can be written as the sum of two terms. The first term is the negative of reconstruction error of , when reconstructed from the encoding of . The second term ensures that the encoding of is ’close’ to the corresponding encoding of , where closeness is defined in terms of divergence between the corresponding distributions.
Adding to the above bound, we obtain the lower bound to the joint loglikelihood. It has been shown in [4]
that for learning a distribution from samples, it is sufficient to train the transition operator of a Markov chain, whose stationary distribution is the distribution that we wish to model. Using this idea, we replace
by . Note that while the two terms will be quite different, the gradients with respect to the parameters for the two terms is expected to be ’close’.2.2 The reparametrization
In order to simplify the computation of the variational lower bound, we assume that conditioned on , the latent representation
is normally distributed with mean
and a diagional covariance matrix whose diagonal entries are given by . Moreover, conditioned on , is normally distributed with mean and a diagonal covariance matrix whose diagonal entries are given by . In the rest of the paper, we assume andto be multilayer perceptrons. Furthermore, we approximate the posterior distribution of
given and by a normal distribution with mean and a diagonal covariance matrix whose diagonal entries are given by , where and are again multilayer perceptrons. In order to make the dependence of the distributions on and explicit, we represent as , as and as . For reference, the parametric forms of the likelihood, prior and posterior distributions and their representations demonstrating the explicit dependence on and are given in Table 1.The above assumptions simplify the calculation of divergence and . Let denote the component of the function and the size of the latent representation be . After ignoring the constant terms, the divergence term of the variational lower bound can be written as
(7) 
The negative reconstruction error term in the variational lower bound in (6) can be obtained by generating samples from the posterior distribution of given and , and then averaging over the negative reconstruction error. For a fixed , the term can be written as
(8) 
The choice of the posterior allows us to sample as follows:
(9)  
(10) 
where denotes elementwise multiplication. Hence, the negative reconstruction error can alternatively be rewritten as
(11) 
where is as defined in (8)
In order to train the model using firstorder methods, we need to compute the derivative of the variational lower bound with respect to the parameters of the model. Let and be the parameters of , and respectively. Note that the KLdivergence term in (7) depends only on , and . Its derivatives with respect to and
can be computed via chain rule.
From (11), the derivative of the negative reconstruction error with respect to is given by
(12) 
The term inside the expectation can again be evaluated using chain rule.
2.3 Implementation details
We use minibatch training to learn the parameters of the model, whereby the gradient of the model with respect to the model parameters is computed for every minibatch and the corresponding parameters updated. While the gradient of the KLdivergence can be computed exactly from (7), the gradient of the negative reconstruction error in (11
) requires one to sample standard normal random vectors, compute the gradient for each sampled vector, and then take the mean. In practise, when the minibatch size is large enough, it is sufficient to sample one standard normal random vector per training example, and then compute the gradient of the negative reconstruction error with respect to the parameters, for this vector. This has also been observed for the case of variational autoencoder in
[5].A pictorial representation of the implemented model is given in Figure 3. Firstly, and
are fed to the neural network
to generate mean and logvariance of the distribution
. Moreover, is fed to the neural network to generate the mean and logvariance of the distribution . The KLdivergence between and is computed using (7), and its gradient is backpropagated to update the parameters and . Furthermore, the mean and logvariance of are used to sample , which is then forwarded to the neural network to compute the mean and logvariance of the distribution . Finally, the negative reconstruction error is computed using equation (8) for the specific and its gradient is backpropagated to update the parameters and .3 Related Works
Over the past few years, several deep generative models have been proposed. They include deep Boltzmann machines (DBM)
[6], generative adversarial networks (GAN) [7], variational autoencoders (VAE) [5] and generative stochastic networks (GSN) [4].DBMs learn a Markov random field with multiple latent layers, and have been effective in modelling MNIST and NORB data. However, the training of DBMs involves a meanfield approximation step for every instance in the training data, and hence, they are computationally expensive. Moreover, there are no tractable extensions of deep Boltzmann machines for handling spatial equivariance.
All the other models mentioned above, can be trained using backpropagation or its stochastic variant, and hence can incorporate the recent advances in training deep neural networks such as faster libraries and better optimization methods. In particular, GAN learns a distribution on data, by forcing the generator to generate samples that are ‘indistinguishable’ from training data. This is achieved by learning a discriminator whose task is to distinguish between the generated samples and samples in the training data. The generator is then trained to fool the discriminator. Though this approach is intuitive, it requires a careful selection of hyperparameters. Moreover, given the data, one can not sample the latent variables from which it was generated, since the posterior is never learnt by the model.
In a VAE, the posterior distribution of the latent variables conditioned on the data, is approximated by a normal distribution, whose mean and variance are the output of a neural network (distributions other than normal can also be used). This allows approximate estimation of variational loglikelihood which can be optimized using stochastic backpropagation
[8].Both GAN and VAE are directed probabilistic models with an edge from the latent layer to the data. Conditional extensions of both these models for incorporating attributes/labels have also been proposed [9][10][11]. The graphical representation of a conditional GAN or conditional VAE is shown in Figure 2. As can be observed, both these models assume the latent layer to be independent of the attributes/labels. This is in stark contrast with our model CMMA, which assumes that the latent layer is sampled conditioned on the attributes.
It is also informative to compare the variational lower bound of conditional loglikelihood for a CVAE with (6). The lower bound for a CVAE is given by
(13) 
Note that while the lower bound in the proposed model CMMA contains a KLdivergence term to explicitly force the latent representation from to be ’close’ to the latent representation from both and , there is no such term in the lower bound of CVAE. This proves to be a disadvantage for CVAE as is reflected in the experiments section.
4 Experiments
We consider the task of learning a conditional distribution for the faces given the attributes. For this task, we use the cropped Labelled Faces in the Wild dataset^{1}^{1}1The dataset is available at http://conradsanderson.id.au/lfwcrop/ (LFW) [12], which consists of faces of people of which people have only one image. The images are of size and contain channels (red green and blue). Of the faces, faces have
attributes associated with them, obtained partially using Amazon Mechanical Turk and partially using attribute classifiers
[13]. The attributes include ‘Male’, ‘Asian’, ‘No eyewear’, ‘Eyeglasses’, ‘Moustache’, ‘Mouth open’, ‘Big nose’, ‘Pointy nose’, ‘Smiling’, ‘Frowning’, ‘Big lips’ etc. The data also contains attributes for hair, necklace, earrings etc., though these attributes are not visible in the cropped images. We use the first faces and the corresponding attributes for training the model, the next faces for validation, and keep the remaining faces and their corresponding attributes for testing.The LFW dataset is very challenging since most people in the dataset have only one image. Moreover, any possible combination of the attributes occurs at most once in the dataset. This forces the model to learn a mapping from attributes to faces, that is shared across all possible combinations of attributes. In contrast, the face dataset used in [14], consists of several subsets of faces where only one attribute changes while others remain unchanged. Hence, one can tune the mapping from attributes to faces, one attribute at a time. This, however, isn’t possible for LFW.
In order to emphasize this factor, we show the prior () and the posterior distribution () for a 2dimensional latent representatation of few randomly selected individuals with the modalities in Figure 4. Note that despite conditioning on the attributes, the prior distributions have high uncertainty, and prior distribution for several attribute combinations overlap considerably, particularly in lower dimensions. As the dimensions increase, this overlap decreases however. A VAE, on the other hand, assumes a common prior for all the individuals. Hence, one can think of conditioning in CMMA as tilting the prior of VAE in the direction of the conditioning modality.
Moreover, the posterior always has much lower variance than the prior. Other than the fact that access to decreases the uncertainty by a huge amount, the reduced variance is also an artifact of variational methods in general. In particular, for the 2dimensional latent representations, we observed an average standard deviation of for CMMA , and for VAE in the posterior distribution of latent representations after iterations, which did not reduce further.
4.1 CMMA architecture
The MLP of the CMMA used in this paper (refer Figure 3) encodes the attributes, and is a neural network with hidden units, a soft thresholding unit of the form and two parallel output layers, each comprising of units. The MLPs and are convolution and deconvolution neural networks respectively. The corresponding architectures are given in Figure 5.
4.2 Models used for comparison
We compare the quantitative and qualitative performance of CMMA against conditional Generative Adversarial Networks [11][10] (CGAN) and conditional Variational Autoencoders [5] (CVAE). We have tried to ensure that the architecture of the models used for comparison is as close as possible to the architecture of the CMMA used in our experiments. Hence, the generator and discriminator of CGAN and the encoder and decoder of CVAE closely mimic the MLPs and of CMMA as described in the previous section.
4.3 Training
We coded all the models in Torch
[15] and trained each of them for iterations on a Tesla K40 GPU. For each model, the training time was approximately day. The adagrad optimization algorithm was used [16]. The proposed model CMMA was found to be relatively stable to the selection of initial learning rate, and the variance of the randomly initialized weights in various layers. For CGAN, we selected the learning rate of generator and discriminator and the variance of weights by verifying the conditional loglikelihood on the validation set. Only the results from the best hyperparameters have been reported. We found the CGAN model to be quite unstable to the selection of hyperparameters.4.4 Quantitative results
For the first set of experiments, we compare the conditional loglikelihood of the faces given the attributes on the test set for the models  CMMA, CGAN, and CVAE. A direct evaluation of conditional loglikelihood is infeasible, and for the size of latent layer used in our experiments (500), MCMC estimates of conditional loglikelihood are unreliable.
For the proposed model CMMA, a variational lower bound to the loglikelihood of the test data can be computed as the difference between the negative reconstruction error and KLdivergence (see (5)). The same can also be done for the CVAE model using (13).
Since we can not obtain the variational lower bound for the other models, we also use Parzenwindow based loglikelihood estimation method for comparing the models. In particular, for a fixed test instance, we condition on the attributes to generate samples from the
models. A Gaussian Parzen window is fit to the generated samples, and the logprobability of the face in the test instance is computed for the obtained Gaussian Parzen window. The
parameter of the Parzen window estimator is obtained via crossvalidation on the validation set. The corresponding loglikelihood estimates for the models are given in Table 2.Model  Conditional Loglikelihood  Variational Lower Bound 

CMMA  9,487  17,973 
CVAE  8,714  14,356 
CGAN  8,320   
In both the cases, the proposed model CMMA was able to achieve a better conditional loglikelihood than the other models.
4.5 Qualitative results
While the quantitative results do convey a sense of superiority of the proposed model over the other models used in comparison, it is more convincing to look at the actual samples generated by these models. Hence, we compare the three models CGAN, CVAE and CMMA for the task of generating faces from attributes. We also compare the two models CVAE and CMMA for modifying an existing face by changing the attributes. CGAN can not be used for modifying faces because of the unidirectional nature of the model, that is, it is not possible to sample the latent layer from an image in a generative adversarial network.
4.5.1 Generating faces from attributes
In our first set of experiments, we generate samples from the attributes using the already trained models. In a CGAN, the images are generated by feeding noise and attributes to the generator. Similarly, in a CVAE, noise and attributes are fed to the MLP that corresponds to (see (13)) to sample the images. In order to generate images from attributes in a CMMA, we prune the MLP from the CMMA model (refer Figure 3), and connect the MLP in its stead as shown in Figure 7.
We set/reset the ’Male’ and ’Asian’ attributes to generate four possible combinations. The faces are then generated by varying the other attributes one at a time. In order to remove any bias from the selection of images, we set the variance parameter of the noise level to in CMMA, CVAE and CGAN. The corresponding faces for our model CMMA, and the other models (CVAE [9] and CGAN [10]) are listed in Figure 6. We have also presented the results from the implementation of CGAN^{2}^{2}2https://github.com/hans/adversarial by the author of [10], since the images sampled from CGAN trained by us were quite noisy.
The columns of images for each model correspond to the attributes i) no change, ii) mouth open, iii) spectacles, iv) bushy eyebrows, v) big nose ,vi) pointy nose and vii) thick lips. As is evident from the first image in Figure 6, CMMA can incorporate any change in attribute such as ‘open mouth’ or ‘spectacles’ in the corresponding face for each of the rows. However, this does not seem to be the case for the other models. We hypothesize that this is because our model explicitly minimizes the KLdivergence between the latent representation of attributes and the joint representation of face and attributes.
4.5.2 Varying the attributes in existing faces
In our next set of experiments, we select a face from the training data, and vary the attributes to generate a modified face. For a CMMA, this can be achieved as follows (also refer Figure 3):

Let attr_orig be the original attributes of the face and attr_new be the new attributes that we wish the face to possess.

Pass the selected face and the attr_new through the MLP .

Pass attr_orig and attr_new through the MLP and compute the difference.

Add the difference to the output of MLP .

Pass the resultant sum through the decoder .
As in the previous case, we have the set the variance parameter of noise level to .
Note that, we can not use CGAN for this set of experiments, since, given a face, it is not possible to sample the latent layer in a CGAN. Hence, we only present the results corresponding to our model CMMA and CVAE. The corresponding transformed faces are given in Figure 8. As can be observed, for most of the attributes, our model, CMMA, is successfully able to transform images by removing moustaches, adding spectacles and making the nose bigger or pointy etc.
4.5.3 Modifying faces with missing attributes
Next, we select a few faces from the web and evaluate the performance of the model for modifying these faces. In order to modify these faces, one needs to sample the attributes conditioned on the faces. The algorithm for modifying the faces mentioned in the previous section is then applied. The corresponding results are given in Figure 10.
5 Concluding Remarks
In this paper, we proposed a model for conditional modality generation, that forces the latent representation of one modality to be ‘close’ to the joint representation for multiple modalities. We explored the applicability of the model for generating and modifying images using attributes. Quantitative and qualitative results suggest that our model is more suitable for this task than CGAN [10] and CVAE [9]. The model proposed, is general and can be used for other tasks such whereby some modalities need to be conditioned whereas others need to be generated, for instance, translation of text or transliteration of speech. We wish to explore the applicability of the model for such problems in future.
References

[1]
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.:
Multimodal deep learning.
In: Proceedings of the 28th International Conference on Machine Learning (ICML11). (2011) 689–696
 [2] Sohn, K., Shang, W., Lee, H.: Improved multimodal deep learning with variation of information. In: Advances in Neural Information Processing Systems. (2014) 2141–2149
 [3] Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems. (2012) 2222–2230
 [4] Bengio, Y., ThibodeauLaufer, E., Alain, G., Yosinski, J.: Deep generative stochastic networks trainable by backprop. In: Proceedings of The 31st International Conference on Machine Learning. (2014) 226––234
 [5] Kingma, D.P., Welling, M.: Autoencoding variational Bayes. International Conference on Learning Representations (2014)

[6]
Salakhutdinov, R., Hinton, G.E.:
Deep Boltzmann machines.
In: International Conference on Artificial Intelligence and Statistics. (2009) 448–455
 [7] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems. (2014) 2672–2680
 [8] Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In Jebara, T., Xing, E.P., eds.: Proceedings of the 31st International Conference on Machine Learning (ICML14), JMLR Workshop and Conference Proceedings (2014) 1278–1286
 [9] Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semisupervised learning with deep generative models. In: Advances in Neural Information Processing Systems. (2014) 3581–3589
 [10] Gauthier, J.: Conditional generative adversarial nets for convolutional face generation
 [11] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

[12]
Huang, G.B., Ramesh, M., Berg, T., LearnedMiller, E.:
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Technical Report 0749, University of Massachusetts, Amherst (October 2007) 
[13]
Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.:
Attribute and simile classifiers for face verification.
In: International Conference on Computer Vision, IEEE (2009) 365–372
 [14] Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. arXiv preprint arXiv:1503.03167 (2015)
 [15] Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlablike environment for machine learning. In: BigLearn, NIPS Workshop. Number EPFLCONF192376 (2011)
 [16] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research 12 (2011) 2121–2159