1 Introduction
Humans are capable of using information from multiple sources or modalities to learn concepts that relate them together. Previous research Shams & Seitz (2008) has also shown that perceiving and relating multiple modalities of input data is a key component for efficient learning. Multimodal data here, refers to information from multiple sources. We use such multimodal data for cognition in our everyday life. However, in artificial learning systems, relating such modality independent concepts still remains largely an unsolved problem.
The primary reason for this is the difficulty to model such relationships in machines, since they do not generalize well to associate concepts with unseen objects from the training set. This failure to generalize novel concepts can be attributed to two main reasons : (1) Humans are constantly learning from multimodal data since birth, which results in much better model learning and mapping between modalities compared to machine learning systems which typically only have access to a small finite subset of sample data. (2) The machine learning algorithms themselves are not yet effective for learning meaningful concepts in a generative manner from training examples. Thus, multimodal learning is an area of active research.
In this paper, we specifically target crossmodality concept learning, for the case of text and speech to images as depicted in figure 1. Some recent work have achieved success in this direction, i.e. to jointly learn generative models capable of generating one modality from another. For instance, Ngiam et. al. Ngiam et al. (2011)
proposed a deep learning framework using restricted Boltzmann machines
Hinton (2002)Hinton et al. (2006) to learn efficient features of audio and video modalities. They further illustrated that multimodal learning results in better performance as compared to the unimodal case. In the case for learning to generate images from text modality, the recent work of Mansimov et. al.Mansimov et al. (2015) show that using attentionbased models for generating images from text captions results in higher quality samples. Furthermore, it was claimed that this leads to better generalization towards previously novel captions. In other works, Reed et. al Reed et al. (2016) proposed deep convolutional generative adversarial networks which combined natural language and image embeddings in order to produce compelling synthetically generated images. Recently, there has also been some work in the field of crossdomain feature learning for images. Coupled generative models Liu & Tuzel (2016)generates pairs of images in two different domains by sharing weights for higher level feature extracting layers. Similarly other methods like DiscoGAN
Kim et al. (2017) and conditional VAEs Kingma et al. (2014), can learn to transfer style between images. However, prior works do not illustrate how different data modalities (like image, speech and natural language) can share similar features which can be used for conditional generation of different modalities.We rise to the challenge of this problem by jointly learning the distribution of multiple modalities of data using learned generative models of lowdimensional embeddings from high dimensional natural data. Our approach consists of first projecting the high dimensional data to a lowdimensional manifold or latent space and then separately, learn generative models for each such embedding space. In order to tie the two together we add an additional constraint on the learning objective to make the two latent representations of each generative model be as close as possible. At inference time, the latent representations of one generative model can be used to proxy the other allowing a combined conditional generative model of multimodal data. Using text, speech and image modalities, we show that our proposed method can successfully learn to generate images from modified MNIST datasets from text captions and speech snippets not seen during training.
2 Problem Statement
In this section, we formulate the problem of mapping between multiple modalities of data to learn semantic concepts. Consider two random variables
X and Y representing instances of two input data modalities. Given the data,, ideally we would want to create a parametric joint distribution model
and find the optimal parameters as follows,(1) 
where is the number of samples in the dataset. Furthermore, learning the joint probability distribution enables us to perform conditional inference in order to map between the two data modalities X and Y, i.e. and .
3 Proposed Method
3.1 Learning the constrained embedding mapping
The original high dimensional data points are first mapped to semantically meaningful lowdimensional manifold by deterministic or stochastic functions, and , where are the dimensions of original high dimensional space and are the low dimensional embedding. Let the embeddings be represented as respectively. We also assume that the original highdimensional data can be recovered from these embeddings by generative models. Such that, and . We explicitly learn the mapping between them in a deterministic manner, while ensuring that the mapping can reconstruct or decode the original embeddings from the common latent space. We take inspiration from Gupta et al. (2017)
, where the authors proposed a similar mapping to transfer learned skills between robots, and extend it to learn meaningful mappings among embedding spaces.
Instead of modeling the original high dimensional joint distribution, , we model the joint distribution of the embedding space by learning a common latent variable() for both embeddings, such that . However, with a shared latent representation, both modalities of the data would be required to predict the latent space during inference, as . To resolve this problem, we employ a proxyvariable trick, such that, we separately learn the generative models (with latent space ) and (with latent space ) of the lowdimensional embeddings using an autoencoder architecture parameterized by the network parameters and respectively. We introduce an additional constraint that minimizes the distance between the latent representations for each autoencoder structure. As such, within this framework, reducing the original high dimensional data to lowdimensional embedding subspaces ensure two important purpose: (1) Since the embedding spaces are semantically meaningful, the intuition is that mapping between such subspaces lead to better generalization for novel data points, than using the original high dimensional data. (2) Learning a mapping on the embedding space allows us to exploit the generative capabilities of the separate models which can be trained from the plethora of unlabeled data. Furthermore, since we use a simple autoencoder model, reduction in data dimension leads to better faster convergence with shallower networks.
The autoencoders for image and text(or speech) embedding generation are parameterized by and respectively. The intermediate latent representations are and and the reconstructed embeddings are computed by their inverse functions as, and . Reconstruction losses and are also minimized in order to ensure that the respective embeddings can be generated back. Additionally, the two latent representations are constrained to be equal by minimizing the distance between them . This facilitates multimodal mapping during the inference step. We explain this in section 3.2
. Combining the three loss functions, the optimal parameters for the two autoencoders are learned by minimizing the following objective function,
(2)  
In figure 1, we provide the outline of our training process. Once the networks are trained, we can rearrange each individual components in order to create a generative mapping model which is discussed in the following subsection.




Two digit  Colored two digit  Two digit  Colored two digit  
Direct (conv)  14.73  19.12  15.26  19.24  
Proposed (convvae)  15.82  19.49  16.32  19.66  
Proposed (mlpvae)  15.73  18.89  16.08  19.51  
Proposed (convae)  14.97  18.90  15.26  19.67  
Proposed (mlpae)  14.89  18.39  15.22  19.05 
3.2 Mapping between multimodal data
Since the networks were trained to respect the constraint of equality between the latent spaces, we can consider them to represent the same space during inference. However, even if the spaces and are equal, they are decoupled by design, which allows us to use them for generating specific modalities of data.
From the learned mapping, the original highdimensional data can be recovered by generator functions, and . The mapping from language to image (image generation) can be performed as and image to language (image captioning) can be performed as .
4 Experimental results
Using the proposed conditional generative model, we generate images from corresponding text and speech representations. As an example, we use text ”seven five” and also the corresponding speech representation, and independently learn to generate images of from each of the above two modalities.
For generating the embedding space from images, we evaluate our method using Multilayered perceptron (MLP) based variational autoencoders (mlpvae)
Kingma & Welling (2013), convolutional networks based variational autoencoders (convvae), MLP based autoencoders (mlpae) and Convolutional autoencoders (convae). For generating the word embeddings we use word2vec as proposed in Mikolov et al. (2013) which was trained with wikipedia word corpus. For the speech signals, we use MelFrequency Cepstral Coefficients (MFCC) features as speech embeddings. For generating back text and speech from the embeddings, we use nearest neighbor to retrieve the closest data point to the query. In the image embedding case, we use a normalization function as the forward function, and the corresponding unnormalization function as the inverse function, , where andare the mean and standard deviation of image embedding in the training set. For the word and speech embeddings, we represent the nonlinear mapping functions as neural networks. Specifically in our implementation, we use simple fully connected networks, containing 1 hidden layer in encoder and decoder units. L2 loss is used for all the components of the objective function. Adam optimizer is used with default parameters for minimizing the objective function in equation
2.Since the generated images will have some different attributes to the images in test dataset, it is difficult to directly compare them with arbitrary instances from the test set. Thus, we first find the image in test set () that is closest (by euclidean distance measure) and find Peak Signal to Noise ratio (PSNR) of the generated image () as compared with this image, given as,
(3) 
As a baseline method, we use direct mapping from word(or speech) embedding space to image pixel space using convolutional neural networks similar to the discriminator network in the DCGAN
Radford et al. (2015) architecture. This method of directly regressing the image from word embeddings does not involve generation from latent variables and thus it learns a mean image representations from each word embedding, which is demonstrated later in this section.Experimental evaluation were performed to generate MNIST images from textual and speech embeddings. In all the four cases, we use a 100 dimensional image embedding of the images and corresponding word embeddings of dimension 13 for each word. speech embeddings of dimension 13 were used for each digits audio (sampling rate 24 KHz, each digit has 0.5 secs duration).
Double digit numbers were created by concatenating MNIST digit images horizontally. While training the image autoencoder, we randomly remove some twodigit combinations and train on the remaining images. Thus the autoencoder have explicitly never been trained to generate these images. For the word and speech embeddings, we simply concatenate the respective embeddings of each digit. While learning the mapping between image to word and speech embeddings, we hide the mapping between the unseen images mentioned above and learn mapping only on the remaining imagesword combination. During testing, we give the word embeddings of those unseen 2digit numbers and generate the corresponding images.
Similar experiment with the addition of color attribute (to increase the complexity of the mapping) was also performed. Colors are red, green and blue only. Figure 2(c,d) and fig 4(c,d) in appendix) shows colored MNIST images.
4.1 Qualitative Evaluation
Figure 2 shows the generated double digit images from text combinations that were unseen during training. Mapping using convolution VAE(figure 2(a)) as image autoencoder produces clear images for the word embeddings. MLP VAE (not shown in the image) produces images that are comparatively blurry compared to convvae case. For direct regression(figure 2(b)), the network learns to generate mean images of digits from the training dataset which can be seen for some digits like ”one” and ”five”. It is to be noted that while direct regression learns to associate word embeddings with blurry mean images that it encountered during training, the proposed method finds mappings in the semantically meaningful image embedding space and subsequently generates clearer images, hence producing higher quality images. We see similar results for colored double mnist dataset in figure 2(c) and 2(d), where convolutional VAE produces better generalization for the novel color and digit combinations compared to the baseline direct regression method. Similar results are seen for image generation from speech in figure 4 as show in the supplementary materials.
4.2 Quantitative Evaluation
To quantitatively evaluate the performance of our algorithm, we take the mean of PSNR value of all double digits and all coloreddouble digit combination, computed by equation 3. Table 1 shows the PSNR values for both experiments using different image encoderdecoder methods and direct regression from text embeddings and speech features. The PSNR values show that convolutional VAEs produce the highest PSNR value for both experiments from text embeddings and double MNIST from speech features whereas convolutional autoencoder produces best PSNR values for image generation from speech data colored MNIST double digits. Direct methods suffers from low PSNR because the blurry nature of the digits cause high divergence from the closest image in test set.
5 Conclusion
We propose a multimodal mapping model that can generate images even from unseen captions. The core of the proposed algorithm is to explicitly learn separate generative models for lowdimensional embeddings of multimodal data. Thereby, enforcing the key equality constraint between latent representations of the data enables twoway generation of information. We showcase the validity of our proposed model by performing experimental evaluation on generating twodigit MNIST images and colored two digit numbers from word embeddings and speech features not seen during training. In the future, we hope to extend our method for generating images from complex captions using higher dimensional natural images.
References
 Gupta et al. (2017) Gupta, Abhishek, Devin, Coline, Liu, Yuxuan, Abbeel, Pieter, and Levine, Sergey. Learning invariant feature spaces to transfer skills with reinforcement learning. CoRR, abs/1703.02949, 2017. URL http://arxiv.org/abs/1703.02949.

Hinton (2002)
Hinton, Geoffrey E.
Training products of experts by minimizing contrastive divergence.
Neural computation, 14(8):1771–1800, 2002.  Hinton et al. (2006) Hinton, Geoffrey E, Osindero, Simon, and Teh, YeeWhye. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 Kim et al. (2017) Kim, Taeksoo, Cha, Moonsu, Kim, Hyunsoo, Lee, Jungkwon, and Kim, Jiwon. Learning to discover crossdomain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Liu & Tuzel (2016) Liu, MingYu and Tuzel, Oncel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 469–477, 2016.
 Mansimov et al. (2015) Mansimov, Elman, Parisotto, Emilio, Ba, Jimmy, and Salakhutdinov, Ruslan. Generating images from captions with attention. CoRR, abs/1511.02793, 2015.
 Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 Ngiam et al. (2011) Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, and Ng, Andrew Y. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 689–696, 2011.
 Radford et al. (2015) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Reed et al. (2016) Reed, Scott, Akata, Zeynep, Yan, Xinchen, Logeswaran, Lajanugen, Schiele, Bernt, and Lee, Honglak. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1060–1069. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045503.
 Shams & Seitz (2008) Shams, Ladan and Seitz, Aaron R. Benefits of multisensory learning. Trends in cognitive sciences, 12(11):411–417, 2008.
Appendix A Network description
In this section we provide details of the network architecture to ensure reproducibility of the paper. For all networks, adam optimizer is used with default values in keras neural network library. For the image autoencoders binary crossentropy was used as reconstructions loss. For mapping embeddings to latent space, mean squared error was used as loss function.
a.1 Image autoencoder
For both the case of double MNIST and colored double MNIST, we use same configuration for the network parameters.

Convoutional Variational Autoencoder: For the encoder, we used 2D convolution layers with filters of size each with maxpooling, followed by dense layers of size 256 and two dense layers of 100 dimension for mean and standard deviation of the image embedding space. For the decoder, we used a dense layer of size , followed by 2D convolution layer with filters of size each with upsampling and 2D convolution layer of size
with single channel output to produce the image back. For all the layers, nonlinearity of ReLU was used except the last convolution layer in the decoder which used a sigmoid nonlinearity.

MLP Variational Autoencoder: For the encoder, we used a hidden dense layers of size 256 and two dense layers of 100 dimension for mean and standard deviation of the image embedding space. For the decoder, we used a hidden dense layers of size 256 and a dense layers of 1568 dimension to produce the image back. For all the layers, nonlinearity of ReLU was used except the last dense layer in the decoder which used a sigmoid nonlinearity.

Convoutional Autoencoder: For the encoder, we used 2D convolution layers with filters of size each with maxpooling, followed by another 2D convolution layers with filters of size and maxpooling, followed by dense layers of 100 dimension for the image embedding space. For the decoder, we used a dense layer of size , followed by 2D convolution layer with filters of size each with upsampling and another 2D convolution layer with filters of size each with upsampling. Finally a 2D convolution layer of size with single channel output si used to produce the image back. For all the layers, nonlinearity of ReLU was used except the last convolution layer in the decoder which used a sigmoid nonlinearity.

MLP Autoencoder: For the encoder, we used a hidden dense layers of size 256 and a dense layers of 100 dimension for the image embedding space. For the decoder, we used a hidden dense layers of size 256 and a dense layers of 1568 dimension to produce the image back. For all the layers, nonlinearity of ReLU was used except the last dense layer in the decoder which used a sigmoid nonlinearity.
a.2 Mapping module network architecture
For image embedding, we use a normalization function as the forward function, and the corresponding unnormalization function as the inverse function, , where and are the mean and standard deviation of image embedding in the training set. In other words, we force the shared latent space to be the normalized image embedding space in our implementation. For the word embeddings, we represent the nonlinear mapping functions as neural networks. The encoder from the wordembedding to the common latent space contains a hidden dense layers of size 256 and a dense layers of 100 dimension for the common latent space.
Appendix B Additional results
We present additional results for image generation from text embeddings, in figure 3, using various types of autoencoders as image generators. Qualitative inspection reveals that convolutional VAE produces the clearest and bestlooking results, which is in accordance to the PSNR values in Table 1. For the image generation from speech embeddings (MFCC features), please refer to figure 4. Qualitative inspection reveals that proposed method produce clear and goodlooking results compared to direct regression baseline, which is in accordance to the PSNR values in Table 1.
Appendix C Experimental details
Experimental evaluation were performed on MNIST dataset for four cases for both speech and text data. it is to be noted that, for both speech and text data, the embeddings are fixed for each class. Thus each class in the double digits MNIST case (total classes : 100) and colored double digit MNIST (total classes : 900) have singleton speech signal and text embedding.

Double MNIST digits : In this experiment, we attempt to learn the concept of double digits by testing our algorithm to generate novel images of double digit combinations from text embeddings. Double digit numbers (total 100 classes) are created by concatenating MNIST digit images horizontally. For each such 2digit combination, 1000 images are generated. During training of the image autoencoder, we randomly remove set of sixteen such twodigit combinations and train on the remaining 84000 images. Thus the image embeddings have explicitly never been trained to create these images. For the text and speech embeddings, we simply concatenate the embeddings of each digit. While learning the mapping between image and text or speech embeddings, we hide the mapping between those 16000 images and learn mapping only on the 84000 images(text, speech) combination. During testing, we give the text and speech embeddings of those sixteen 2digit numbers and generate the corresponding images.

Colored double MNIST digits : This experiment is similar to above experiment with the addition of color attribute to increase the complexity of the mapping. Colors are red, green and blue only. For each such 2digit combination, 4000 images are generated by randomly juxtaposing images of digits with random color. During training of image autoencoder, we randomly remove set of sixteen such colored twodigit combinations and train on the remaining 336000 images. For the text and speech embeddings, we simply concatenate the embeddings of each word of the digit (for example, red 5 blue 1 is concat(word2vec(”red”,”five”,”blue”,”one”))).
While learning the mapping between image and text or speech embeddings, we hide the mapping between those 64000 images and learn mapping only with the other 336000 images. During testing, we give the text and speech embeddings of those sixteen twodigit numbers with 9 color combinations (3 3) and generate the corresponding images.
Comments
There are no comments yet.