I Introduction
Multimedia data with different modalities, including image, video, text, audio and so on, is now mixed together and represents comprehensive knowledge to perceive the real world. The research of cognitive science indicates that in human brain, the cognition of outside world is through the fusion of multiple sensory organs [1]
. However, there exists the heterogeneity gap, which makes it quite difficult for artificial intelligence to simulate the above human cognitive process, because multimedia data with various modalities in Internet has huge quantity, but inconsistent distribution and representation.
Naturally, crossmodal correlation exists among the heterogeneous data of different modalities to describe specific kinds of statistical dependencies [2]. For example, for the image and textual descriptions coexisting in one web page, they may be intrinsically correlated from content and share a certain level of semantic consistency. Therefore, it is necessary to automatically exploit and understand such latent correlation across the data of different modalities, and further construct metrics on them to measure how they are semantically relevant. For addressing the above issues, an intuitive idea is to model the joint distribution over the data of different modalities to learn the common representation, which can form a commonly shared space where heterogeneous data is mapped into. Thus, the similarities between them can be directly computed by adopting common distance metrics. Figure 1 shows an illustration of the above framework. In this way, the heterogeneity gap among the data of different modalities can be reduced, so that the heterogeneous data can be correlated together more easily to realize various practical application, such as crossmodal retrieval [3], where the data of different modalities can be retrieved at the same time by a query of any modality flexibly.
Following the above idea, some methods [4, 5, 6] have been proposed to deal with the heterogeneity gap and learn the common representation by modeling the crossmodal correlation, so that the similarities between different modalities can be measured directly. These existing methods can be divided into two major categories according to their different models as follows: The first is in traditional framework, which attempts to learn mapping matrices for the data of different modalities by optimizing the statistical values, so as to project them into one common space. Canonical Correlation Analysis (CCA) [7] is one of the representative works, which has many extensions such as [8, 9]. The second kind of methods [10, 11, 12]
utilize the strong learning ability of deep neural network (DNN) to construct multilayer network, and most of them aim to minimize the correlation learning error across different modalities for the common representation learning.
Recently, generative adversarial networks (GANs) [13]
have been proposed to estimate a generative model by an adversarial training process. The basic model of GANs consists of two components, namely a generative model
G and a discriminative model D. The generative model aims to capture the data distribution, while the discriminative model attempts to discriminate whether the input sample comes from real data or is generated from G. Furthermore, an adversarial training strategy is adopted to train these two models simultaneously, which makes them compete with each other for mutual promotion to learn better representation of the input data. Inspired by the recent progress of GANs, researchers attempt to apply GANs into computer vision areas, such as image synthesis
[14], video prediction [15] and object detection [16].Due to the strong ability of GANs in modeling data distribution and learning discriminative representation, it is a natural solution that GANs can be utilized for modeling the joint distribution over the heterogeneous data of different modalities, which aims to learn the common representation and boost the crossmodal correlation learning. However, most of the existing GANsbased works only focus on the unidirectional generative problem to generate new data for some specific applications, such as image synthesis to generate certain image by a noise input [14], side information [17] or text description [18]. Their main purpose is to generate new data of single modality, which cannot effectively establish correlation on multimodal data with heterogeneous distribution. Different from the existing works, we aim to utilize GANs for establishing correlation on the existing largescale heterogeneous data of different modalities by common representation generation, which is a completely different goal to model the joint distribution over the multimodal input.
For addressing the above issues, we propose Crossmodal Generative Adversarial Networks (CMGANs), which aims to learn discriminative common representation with multipathway GANs to bridge the gap between different modalities. The main contributions can be summarized as follows.

Crossmodal GANs architecture is proposed to deal with the heterogeneity gap between different modalities, which can effectively model the joint distribution over the heterogeneous data simultaneously. The generative model learns to fit the joint distribution by modeling intermodality correlation and intramodality reconstruction information, while the discriminative model leans to judge the relevance both within the same modality and between different modalities. Both generative and discriminative models beat each other with a minimax game for better crossmodal correlation learning.

Crossmodal convolutional autoencoders with weightsharing constraint
are proposed to form the two parallel generative models. Specifically, the encoder layers contain convolutional neural network to learn highlevel semantic information for each modality, and also exploit the crossmodal correlation by the weightsharing constraint for learning the common representation. While the decoder layers aim to model the reconstruction information, which can preserve semantic consistency within each modality.

Crossmodal adversarial mechanism is proposed to perform a novel adversarial training strategy in crossmodal scenario, which utilizes two kinds discriminative models to simultaneously conduct intermodality and intramodality discrimination. Specifically, the intermodality discrimination aims to discriminate the generated common representation from which modality, while the intramodality discrimination tends to discriminate the generated reconstruction representation from the original input, which can mutually boost to force the generative models to learn more discriminative common representation by adversarial training process.
To the best of our knowledge, our proposed CMGANs approach is the first to utilize GANs to perform crossmodal common representation learning. With learned common representation, heterogeneous data can be correlated by common distance metric. We conduct extensive experiments on crossmodal retrieval paradigm, to evaluate the performance of crossmodal correlation, which aims to retrieve the relevant results across different modalities by distance metric on the learned common representation, as shown in Figure 2. Comprehensive experimental results show the effectiveness of our proposed approach, where our proposed approach achieves the best retrieval accuracy compared with 10 stateoftheart crossmodal retrieval methods on 3 widelyused datasets: Wikipedia, Pascal Sentence and our constrcuted largescale XmediaNet datasets.
The rest of this paper is organized as follows: We first briefly introduce the related works on crossmodal correlation learning methods as well as existing GANsbased methods in Section II. Section III presents our proposed CMGANs approach. Section IV introduces the experiments of crossmodal retrieval conducted on 3 crossmodal datasets with the result analyses. Finally Section V concludes this paper.
Ii Related Works
In this section, the related works are briefly reviewed from the following two aspects: crossmodal correlation learning methods, and representative works based on GANs.
Iia Crossmodal Correlation Learning Methods
For bridging the heterogeneity gap between different modalities, there are some methods proposed to conduct crossmodal correlation learning, which mainly aim to learn the common representation and correlate the heterogeneous data by distance metric. We briefly introduce the representative methods of crossmodal correlation learning with the following two categories, namely traditional methods and deep learning based methods.
Traditional methods mainly learn linear projection to maximize the correlation between the pairwise data of different modalities, which project the feature of different modalities into one common space to generate common representation. One class of methods attempt to optimize the statistical values to perform statistical correlation analysis. A representative method is to adopt canonical correlation analysis (CCA) [19] to construct a lowerdimensional common space, which has many extensions, such as adopting kernel function [9], integrating semantic category labels [7], taking highlevel semantics as the third view to perform multiview CCA [8], and considering the semantic information in the form of multilabel annotations [20]. Besides, crossmodal factor analysis (CFA) [21], which is similar to CCA, minimizes the Frobenius norm between the pairwise data to learn the projections for common space. Another class of methods integrate graph regularization into the crossmodal correlation learning, namely to construct graphs for correlating the data of different modalities in the learned common space. For example, Zhai et al. [22] propose joint graph regularized heterogeneous metric learning (JGRHML) method, which adopts both metric learning and graph regularization, and they further integrate semisupervised information to propose joint representation learning (JRL) [23]. Wang et al. [24] also adopt graph regularization to simultaneously preserve the intermodality and intramodality correlation.
As for the deep learning based methods, with the strong power of nonlinear correlation modeling, deep neural network has made great progress in numerous singlemodal problems such as object detection [25] and image classification [26]. It has also been utilized to model the crossmodal correlation. Ngiam et al. [27]
propose bimodal autoencoder, which is an extension of restricted Boltzmann machine (RBM), to model the crossmodal correlation at the shared layer, and followed by some similar network structures such as
[28][29]. Multimodal deep belief network
[30] is proposed to model the distribution over the data of different modalities and learn the crossmodal correlation by a joint RBM. Feng et al. [11] propose correspondence autoencoder (CorrAE), which jointly models the crossmodal correlation and reconstruction information. Deep canonical correlation analysis (DCCA) [31, 32] attempts to combine deep network with CCA. The above methods mainly contain two subnetworks linked at the joint layer for correlating the data of different modalities. Furthermore, crossmodal multiple deep networks (CMDN) are proposed [10] to construct hierarchical network structure for both intermodality and intramodality modeling. Crossmodal correlation learning (CCL) [33] method is further proposed to integrate finegrained information as well as multitask learning strategy for better performance.IiB Generative Adversarial Networks
Since generative adversarial networks (GANs) have been proposed by Goodfellow et al. [13] in 2014, a series of GANsbased methods have arisen for a wide variety of problems. The original GANs consist of a generative model G and a discriminative model D, which aim to learn a generative model for capturing the distribution over real data with an adversarial discriminative model, in order to discriminate between real data and generated fake data. Specifically, and play the minimax game on as follows.
(1) 
where denotes the real data and is the noise input, and this minimax game has a global optimum when . Furthermore, Mirza et al. [17] propose conditional GANs (cGAN) to condition the generate data with side information instead of uncontrollable noise input.
Most of the existing GANsbased works focus on the generative problem to generate new data, and they are mainly developed for some specific applications. One of the most popular applications is image synthesis to generate natural images. Radford et al. [14] propose deep convolutional generative adversarial networks (DCGANs) to generate images from a noise input by using deconvolutions. Denton et al. [34] utilize conditional GANs and propose a Laplacian pyramid framework with cascade of convolutional networks to generate images in a coarsetofine fashion. Besides image generation, Ledig et al. [35]
propose superresolution generative adversarial network (SRGAN), which designs a perceptual loss function consisting of an adversarial loss and a content loss. Wang et al.
[36] propose style and structure generative adversarial network (SGAN) to conduct style transfer from normal maps to the realistic images. Li et al. [16] propose perceptual GAN that performs small object detection by transferring poor representation of small object to superresolved one to improve the detection performance.The methods mentioned above cannot handle the multimedia data with heterogeneous distribution. Recently, there are some works proposed to explore the multimedia application. Odena et al. [37]
attempt to generate images conditioned on class labels, which forms auxiliary classifier GAN (ACGAN). Reed et al.
[18] utilize GANs to translate visual concepts from sentences to images. They further propose the generative adversarial whatwhere network (GAWWN) [38] to generate images by giving the description on what content to draw in which location. Furthermore, Zhang et al. [39] adopt StackGAN to synthesize photorealistic images from text descriptions, which can generate higher resolution images than the prior work [18].However, the aforementioned works still have limited flexibility, because they only address the generative problem from one modality to another through one pathway network structure unidirectionally. They cannot model the joint distribution over the multimodal input to correlate the largescale heterogeneous data. Inspired by GANs’ strong ability in modeling data distribution and learning discriminative representation, we utilize GANs for modeling the joint distribution over the data of different modalities to learn the common representation, which aims to further construct correlation on the largescale heterogeneous data across various modalities.
Iii Our CMGANs Approach
The overall framework of our proposed CMGANs approach is shown in Figure 3. For the generative model, crossmodal convolutional autoencoders are adopted to generate the common representation by exploiting the crossmodal correlation with weightsharing constraint, and also generate the reconstruction representation aiming to preserve the semantic consistency within each modality. For the discriminative model, two kinds of discriminative models are designed with intermodality and intramodality discrimination, which can make discrimination on both the generated common representation as well as the generated reconstruction representation for mutually boosting. The above two models are trained together with crossmodal adversarial mechanism for learning more discriminative common representation to correlate heterogeneous data of different modalities.
Iiia Notation
The formal definition is first introduced. We aim to conduct correlation learning on the multimodal dataset, which consists of two modalities, namely as image and as text. The multimodal dataset is represented as , where denotes the training data and testing data is . Specifically, , where and . and are the th instance of image and text, and totally instances of each modality are in the training set. Furthermore, there are semantic category labels and for each image and text instance respectively. As for the testing set denoted as , there are instances for each modality including and .
Our goal is to learn the common representation for each image or text instance so as to calculate crossmodal similarities between different modalities, which can correlate the heterogeneous data. For further evaluating the effectiveness of the learned common representation, crossmodal retrieval is conducted based on the common representation, which aims to retrieve the relevant text from by giving an image query from , and vice versa to retrieve image by a query of text. In the following subsections, first our proposed network architecture is introduced, then followed by the objective functions of the proposed CMGANs, and finally the training procedure of our model is presented.
IiiB Crossmodal GANs architecture
As shown in Figure 3, we introduce the detailed network architectures of the generative model and discriminative model in our proposed CMGANs approach respectively as follows.
IiiB1 Generative model
We design crossmodal convolutional autoencoders to form two parallel generative models for each modality respectively, denoted as for image and for text, which can be divided into encoder layers and as well as decoder layers and . The encoder layers contain convolutional neural network to learn highlevel semantic information for each modality, followed by several fullyconnected layers which is linked at the last one with weightsharing and semantic constraints to exploit crossmodal correlation for the common representation learning. While the decoder layers aim to reconstruct the highlevel semantic representation obtained from the convolutional neural network ahead, which can preserve semantic consistency within each modality.
For image data, each input image is first resized into , and then fed into the convolutional neural network to exploit the highlevel semantic information. The encoder layers have the following two subnetworks: The convolutional layers have the same configuration as the 19layer VGGNet [40]
, which is pretrained on the ImageNet and finetuned on the training image data
. We generate 4,096 dimensional feature vector from fc7 layer as the original highlevel semantic representation for image, denoted as
. Then, several additional fullyconnected layers conduct the common representation learning, where the learned common representation for image is denoted as . The decoder layers have a relatively simple structure, which contain several fullyconnected layers to generate the reconstruction representation from , in order to reconstruct to preserve semantic consistency of image.For text data, assuming that input text instance consists of words, each word is represented as a dimensional feature vector, which is extracted by Word2Vec model [41] pretrained on billions of words in Google News. Thus, the input text instance can be represented as an matrix. The encoder layers also have the following two subnetworks: The convolutional layers on the input matrix have the same configuration with [42] to generate the original highlevel semantic representation for text, denoted as . Similarly, there follow several additional fullyconnected layers to learn text common representation, denoted as . The decoder layers aim to preserve semantic consistency of text by reconstructing with the generated reconstruction representation , which is also made up of fullyconnected layers.
For the weightsharing and semantic constraints, we aim to correlate the generated common representation of each modality. Specifically, the weights of last few layers of the image and text encoders are shared, which are responsible for generating the common representation of image and text, with the intuition that common representation for a pair of corresponding image and text should be as similar as possible. Furthermore, the weightsharing layers are followed by a softmax layer for further exploiting the semantic consistency between different modalities. Thus, the crossmodal correlation can be fully modeled to generate more discriminative common representation.
IiiB2 Discriminative model
We adopt two kinds discriminative models to simultaneously conduct intramodality and intermodality discrimination. Specifically, the intramodality discrimination aims to discriminate the generated reconstruction representation from the original input, while the intermodality discrimination tends to discriminate the generated common representation from image or text modality.
For the intramodality discriminative model, it consists of two subnetworks for image and text, denoted as and respectively. Each of them is made up of several fullyconnected layers, which takes original highlevel semantic representation as the real data and reconstruction representation as the generated data to make discrimination. For the intermodality discriminative model denoted as , a twopathway network is also adopted, where is for image pathway and is for text pathway. Both of them aim to discriminate which modality the common representation is from, such as on image pathway to discriminate the image common representation as the real data, while its corresponding text common representation and mismatched image common representation with different category as the fake data. So as for the text pathway to discriminate the text common representation with others.
IiiC Objective Functions
The objective of the proposed generative model and discriminative model in CMGANs is defined as follows.

Generative model: As shown in the middle of Figure 3, and for image and text respectively, each of which generates three kinds of representations, namely original representation and common representation from or , also the reconstruction representation from or , such as , and for image instance . The goal of generative model is to fit the joint distribution by modeling both intermodality correlation and intramodality reconstruction information.

Discriminative model: As shown in the right of Figure 3, and for intramodality discrimination, while and for intermodality discrimination. For the intramodality discrimination, aims to distinguish the real image data with the generated reconstruction data , and is similar for text. For the intermodality discrimination, tries to discriminate the image common representation as the real data with both text common representation and the common representation of mismatched image as fake data. Each of them is concatenated with their corresponding original image representation and of the mismatched one for better discrimination. is similar to discriminate the text common representation with the fake data of and .
With the above definitions, the generative model and discriminative model can beat each other with a minimax game, and our CMGANs can be trained by jointly solving the learning problem of two parallel GANs.
(2) 
The generative model aims to learn more similar common representation for the instances between different modalities with the same category, as well as more close reconstruction representation within each modality to fool the discriminative model, while the discriminative model tries the distinguish each of them to conduct intramodality discrimination with and intermodality discrimination with . The objective functions are given as follows.
(3)  
(4) 
With the above objective functions, the generative model and discriminative model can be trained iteratively to learn more discriminative common representation for different modalities.
IiiD Crossmodal Adversarial Training Procedure
With the defined objective functions in equations (3) and (4), the generative model and discriminative model are trained iteratively in an adversarial way. The parameters of generative model are fixed during the discriminative model training stage and vice versa. It should be noted that we keep the parameters of convolutional layers fixed during the training phase, for the fact that our main focus is the crossmodal correlation learning. In the following paragraphs, the optimization algorithms of these two models are presented respectively.
IiiD1 Optimizing discriminative model
For the intramodality discrimination, as shown in Figure 3, taking image pathway as an example, we first generate the original highlevel representation and the reconstruction representation from the generative model. Then, the intramodality discrimination for image aims to maximize the loglikelihood for correctly distinguishing as the real data and as the generated reconstruction data, by ascending its stochastic gradient as follows:
(5) 
where is the number of instance in one batch. Similarly, the intramodality discriminative model for text can be updated with the following equation:
(6) 
Next, for the intermodality discrimination, there is also a twopathway network for each modality. As for the image pathway, intermodality discrimination is conducted to maximize the loglikelihood fo correctly discriminate the common representation of different modalities, specifically as the real data while the text common representation and mismatching image instance with different categories as the fake data, which are also concatenated with their corresponding original representation or of mismatched one for better discrimination. The stochastic gradient is calculated as follows:
(7)  
where means to concatenate the two representations, and so as to the following equations. Similarly, for the text pathway, the stochastic gradient can be calculated with following equation:
(8)  
IiiD2 Optimizing generative model
There are two generative models for image and text. The image generative model aims to minimize the object function to fit true relevance distribution, which is trained by descending its stochastic gradient with the following equation, while the discriminative model is fixed at this time.
(9) 
where also means the concatenation of the two representation. For the text generative model, it is updated similarly by descending the stochastic gradient as follows:
(10) 
In summary, the overall training procedure is presented in Algorithm 1. It should be noted that the training procedure between the generative and discriminative models needs to be carefully balanced, for the fact that there are multiple different kinds of discriminative models to provide gradient information for intermodality and intramodality discrimination. Thus the generative model is trained for steps in each iteration in training stage to learn more discriminative representation.
IiiE Implementation Details
Our proposed CMGANs approach is implemented by Torch
^{1}^{1}1http://torch.ch/, which is widely used as a scientific computing framework. The implementation details of generative model and discriminative model are introduced respectively in the following paragraphs.IiiE1 Generative model
The generative model is in the form of crossmodal convolutional autoencoders with two pathways for image and text. The convolutional layers in encoder have the same configuration with 19layer VGGNet [40] for image pathway and word CNN [42]
for text pathway as mentioned above. Then two fullyconnected layers are adopted in each pathway, and each layer is followed by a batch normalization layer and a ReLU activation function layer. The numbers of hidden units for the two layers are both 1,024. Through the encoder layers, we can get the common representation for image and text. Weights of the second fullyconnected layer between text and image pathway are shared to learn the correlation of different modalities. The structure of decoder is made up of two fullyconnected layers on each pathway, except there is no subsequent layer after the second fullyconnected layer. The dimension of the first layer is 1,024 and that of the second layer is the same with the original representation obtained by CNN. What’s more, the common representations are fed into a softmax layer for the semantic constraint.
IiiE2 Intramodality Discriminative model
The discriminative model for intramodality is a network with one fullyconnected layer for each modality, which can project the input feature vector into the singlevalue predict score, followed by a sigmoid layer. For distinguishing the original representation of the instance and the reconstructed representation, we label the original ones with 1 and reconstructed ones with 0 during discriminative model training phase.
IiiE3 Intermodality Discriminative model
The discriminative model for intermodality is a twopathway network, and both of them take the concatenation of the common representation and the original representation as input. Each pathway consists of two fullyconnected layers. The first layer has 512 hidden units, followed by a batch normalization layer and a ReLU activation function layer. The second layer generates the singlevalue predicted score from the output of the first layer and feed into a sigmoid layer, which is similar with the intramodality discriminative model. For the image pathway, the image common representation is labeled with 1, while its corresponding text representation and mismatched image common representation are labeled with 0, and vice versa for the text pathway.
Iv Experiments
In this section, we will introduce the configurations of the experiments and show the results and analyses. Specifically, we conduct the experiments on three datasets, including two widelyused datasets, Wikipedia and Pascal Sentence datasets, and one largescale XMediaNet dataset constructed by ourselves. We compare our approach with 10 stateoftheart methods and 5 baseline approaches to verify the effectiveness of our approach and the contribution of each component comprehensively.
Iva Datasets
The datasets used in the experiments are briefly introduced first, which are XMediaNet, Pascal Sentence and Wikipedia datasets. The first is a largescale crossmodal dataset which is constructed by ourselves. The others are widely used in crossmodal task.

XMediaNet dataset is a new largescale dataset constructed by ourselves. It contains 5 media types, including text, image, audio, video and 3D model. Up to 200 independent categories are selected from the WordNet^{2}^{2}2http://wordnet.princeton.edu/, including 47 animal species and 153 artifact species. In this paper, we use image and text data in XMediaNet dataset to conduct the experiments. There are both 40,000 instances for images and texts. The images are all crawled from Flickr^{3}^{3}3http://www.flickr.com, while the texts are the paragraphs of the corresponding introductions in Wikipedia website. In the experiments, XMediaNet dataset is divided into three parts. Training set has 32,000 pairs, while validation set and test set both have 4,000 pairs. Some examples of this dataset are shown in Figure 4.

Pascal Sentence dataset [43] is generated from 2008 PASCAL development kit, consisting of 1,000 images with 20 categories. Each image is described by 5 sentences, which are treated as a document. We divided this dataset into 3 subsets like Wikipedia dataset also following [10, 11], namely 800 pairs for training, 100 pairs for validation and 100 pairs for testing.
Method  Bimodal retrieval  Allmodal retrieval  

ImageText  TextImage  Average  ImageAll  TextAll  Average  
Our CMGANs Approach  0.567  0.551  0.559  0.581  0.590  0.586 
CCL [33]  0.537  0.528  0.533  0.552  0.578  0.565 
CMDN [10]  0.485  0.516  0.501  0.504  0.563  0.534 
DeepSM [44]  0.399  0.342  0.371  0.351  0.338  0.345 
LGCFL [45]  0.441  0.509  0.475  0.314  0.544  0.429 
JRL [23]  0.488  0.405  0.447  0.508  0.505  0.507 
DCCA [32]  0.425  0.433  0.429  0.433  0.475  0.454 
CorrAE [11]  0.469  0.507  0.488  0.342  0.314  0.328 
KCCA [9]  0.252  0.270  0.261  0.299  0.186  0.243 
CFA [21]  0.252  0.400  0.326  0.318  0.207  0.263 
CCA [19]  0.212  0.217  0.215  0.254  0.252  0.253 

Method  Bimodal retrieval  Allmodal retrieval  

ImageText  TextImage  Average  ImageAll  TextAll  Average  
Our CMGANs Approach  0.603  0.604  0.604  0.584  0.698  0.641 
CCL [33]  0.576  0.561  0.569  0.575  0.632  0.604 
CMDN [10]  0.544  0.526  0.535  0.496  0.627  0.562 
DeepSM [44]  0.560  0.539  0.550  0.555  0.653  0.604 
LGCFL [45]  0.539  0.503  0.521  0.385  0.420  0.403 
JRL [23]  0.563  0.505  0.534  0.561  0.631  0.596 
DCCA [32]  0.568  0.509  0.539  0.556  0.653  0.605 
CorrAE [11]  0.532  0.521  0.527  0.489  0.534  0.512 
KCCA [9]  0.488  0.446  0.467  0.346  0.429  0.388 
CFA [21]  0.476  0.470  0.473  0.470  0.497  0.484 
CCA [19]  0.203  0.208  0.206  0.238  0.301  0.270 

Method  Bimodal retrieval  Allmodal retrieval  

ImageText  TextImage  Average  ImageAll  TextAll  Average  
Our CMGANs Approach  0.521  0.466  0.494  0.434  0.661  0.548 
CCL [33]  0.505  0.457  0.481  0.422  0.652  0.537 
CMDN [10]  0.487  0.427  0.457  0.407  0.611  0.509 
DeepSM [44]  0.478  0.422  0.450  0.391  0.597  0.494 
LGCFL [45]  0.466  0.431  0.449  0.392  0.598  0.495 
JRL [23]  0.479  0.428  0.454  0.404  0.595  0.500 
DCCA [32]  0.445  0.399  0.422  0.371  0.560  0.466 
CorrAE [11]  0.442  0.429  0.436  0.397  0.608  0.494 
KCCA [9]  0.438  0.389  0.414  0.354  0.518  0.436 
CFA [21]  0.319  0.316  0.318  0.279  0.341  0.310 
CCA [19]  0.298  0.273  0.286  0.268  0.370  0.319 

IvB Evaluation Metric
The heterogeneous data can be correlated with the learned common representation by similarity metric. To comprehensively evaluate the performance of crossmodal correlation, we preform crossmodal retrieval with two kinds of retrieval tasks on 3 datasets, namely bimodal retrieval and allmodal retrieval, which are defined as follows.
IvB1 Bimodal retrieval
To perform retrieval between different modalities with the following two subtasks.

Image retrieve text (imagetext): Taking images as queries, to retrieve text instances in the testing set by calculated crossmodality similarity.

Text retrieve image (textimage): Taking texts as queries, to retrieve image instances in the testing set by calculated crossmodality similarity.
IvB2 Allmodal retrieval
To perform retrieval among all modalities with the following two subtasks.

Image retrieve all modalities (imageall): Taking images as queries, to retrieve both text and image instances in the testing set by calculated crossmodality similarity.

Text retrieve all modalities (textall): Taking texts as queries, to retrieve both text and image instances in the testing set by calculated crossmodality similarity.
It should be noted that all the compared methods adopt the same CNN features for both image and text extracted from the CNN architectures used in our approach for fair comparison. Specifically, we extract CNN feature for image from the fc7 layer in 19layer VGGNet [40], and CNN feature for text from Word CNN with the same configuration of [42]. Besides, we use the source codes released by their authors to evaluate the compared methods fairly with the following steps: (1) Common representation learning with the training data to learn the projections or deep models. (2) Converting the testing data into the common representation by the learned projections or deep models. (3) Computing crossmodal similarity with cosine distance to perform crossmodal retrieval.
For the evaluation metric, we calculate mean average precision (MAP) score for all returned results on all the 3 datasets. First, the Average Precision (AP) is calculated for each query as follows:
(11) 
where denotes the total number of instance in testing set consisting of relevant instances. The top returned results contain relevant instances. If the th returned result is relevant, is set to be 1, otherwise, is set to be 0. Then, the mean value of calculated AP on each query is formed as MAP, which joint considers the ranking information and precision and is widely used in crossmodal retrieval task.
IvC Compared Methods
To verify the effectiveness of our proposed CMGANs approach, we compare 10 stateoftheart methods in the experiments, including 5 traditional crossmodal retrieval methods, namely CCA [19], CFA [21], KCCA [9], JRL [23] and LGCFL [45], as well as 5 deep learning based methods, namely CorrAE [11], DCCA [32], CMDN [10], DeepSM [44] and CCL [33]. We briefly introduce these compared methods in the following paragraphs.

CCA [19] learns projection matrices to map the features of different modalities into one common space by maximizing the correlation on them.

CFA [21] minimizes the Frobenius norm and projects the data of different modalities into one common space.

KCCA [9] adopts kernel function to extend CCA for the common space learning. In the experiments, Gaussian kernel is used as the kernel function.

JRL [23] adopts semisupervised regularization as well as sparse regularization to learn the common space with semantic information.

LGCFL [45] uses a local group based priori to exploit popular block based features and jointly learns basis matrices for different modalities.

CorrAE [11] jointly models the correlation and reconstruction learning error with two subnetworks linked at the code layer, which has two extensions, and the best results of these models for fair comparison is reported in the experiments.

DCCA [32] adopts the similar objective function with CCA on the top of two separate subnetworks to maximize the correlation between them.

CMDN [10] jointly models the intramodality and intermodality correlation in both separate representation and common representation learning stages with multiple deep networks.

DeepSM [44] performs deep semantic matching to exploit the strong representation learning ability of convolutional neural network for image.

CCL [33] fully explores both intramodality and intermodality correlation simultaneously with multigrained and multitask learning.
IvD Comparisons with 10 Stateoftheart Methods
In this subsection, we compare the crossmodal retrieval accuracy to evaluate the effectiveness of the learned common representation on both our proposed approach as well as the stateoftheart compared methods. The experimental results are shown in Tables I, II and III, including the MAP scores of both bimodal retrieval and allmodal retrieval on 3 datasets, from which we can observe that our proposed CMGANs approach achieves the best retrieval accuracy among all the compared methods. On our constructed largescale XMediaNet dataset as shown in Table I, the average MAP score of bimodal retrieval has been improved from 0.533 to 0.559, while our proposed approach also makes improvement on allmodal retrieval. Among the compared methods, most deep learning based methods have better performance than the traditional methods, where CCL achieves the best accuracy in the compared methods, and some traditional methods also get benefits from the CNN feature leading to a close accuracy with the deep learning based methods, such as LGCFL and JRL, which are the two best compared traditional methods.
Besides, on Pascal Sentence and Wikipedia datasets, we can also observe similar trends on the results of bimodal retrieval and allmodal retrieval, which are shown in Tables II and III. Our proposed approach outperforms all the compared methods and achieves great improvement on the MAP scores. For intuitive comparison, we have shown some bimodal retrieval results in Figure 5 on our constructed largescale XMediaNet dataset.
IvE Experimental Analysis
The indepth experimental analysis is presented in this subsection of our proposed approach and the compared stateoftheart methods. We also give some failure analysis on our proposed approach for further discussion.
First, for compared deep learning based methods, DCCA, CorrAE and DeepSM all have similar network structures that consist of two subnetworks. CorrAE jointly models the crossmodal correlation learning error as well as the reconstruction error. Although DCCA only maximizes the correlation on the top of two subnetworks, it utilizes the strong representation learning ability of convolutional neural network to reach roughly the same accuracies with CorrAE. While DeepSM further integrates semantic category information to achieve better accuracy. Besides, both CMDN and CCL contain multiple deep networks to consider intramodality and intermodality correlation in a multilevel framework, which makes them outperform the other methods. While CCL further exploits the finegrained information as well as adopts multitask learning strategy to get the best accuracy among the compared methods. Then, for the traditional methods, although their performance benefits from the deep feature, most of them are still limited in the traditional framework and get poor accuracies such as CCA and CFA. KCCA, as an extension of CCA, achieves better accuracy because of the kernel function to model the nonlinear correlation. Besides, JRL and LGCFL have the best retrieval accuracies among the traditional methods, and even outperform some deep learning based methods, for the fact that JRL adopts semisupervised and sparse regularization, while LGCFL uses a local group based priori to take the advantage of popular block based features.
Compared with the above stateoftheart methods, our proposed CMGANs approach clearly keeps the advantages as shown in Tables I, II and III for the 3 reasons as follows: (1) Crossmodal GANs architecture fully models the joint distribution over the data of different modalities with crossmodal adversarial training process. (2) Crossmodal convolutional autoencoders with weightsharing and semantic constraints as the generative model fit the joint distribution by exploiting both intermodality and intramodality correlation. (3) Intermodality and intramodality discrimination in the discriminative model strengthens the generative model.
Then, for the failure analysis, Figures 5 and 6 show the retrieval results in XMediaNet dataset and the MAP score of each category in Wikipedia and Pascal Sentence datasets. From Figure 5
, we can observe that the failure cases are mostly caused by the small variance between image instances or the confusion in text instances among different categories, which leads to wrong retrieval results. But it should be noted that the number of failure cases can be effectively reduced with our proposed approach comparing with CCL as the best compared deep learning based method as well as LGCFL as the best compared traditional method. Besides, as shown in Figure
6, the retrieval accuracies of different categories differ from each other greatly. Some categories with highlevel semantics, such as “art” and “history” in Wikipedia dataset, or with relatively small objects such as “bottle” and “potted plant” in Pascal Sentence dataset, may lead to confusions when performing crossmodal retrieval. However, our proposed approach still achieves the best retrieval accuracies on most categories compared with CCL and LGCFL, which indicates the effectiveness of our approach.Dataset  Method  MAP scores  

ImageText  TextImage  Average  

Our CMGANs Approach  0.567  0.551  0.559  


0.530  0.524  0.527  

0.548  0.544  0.546  

0.536  0.529  0.533  

Our CMGANs Approach  0.603  0.604  0.604  


0.562  0.557  0.560  

0.585  0.586  0.585  

0.566  0.570  0.568  

Our CMGANs Approach  0.521  0.466  0.494  


0.489  0.436  0.463  

0.502  0.439  0.470  

0.494  0.438  0.466  

Dataset  Method  MAP scores  

ImageText  TextImage  Average  

Our CMGANs Approach  0.567  0.551  0.559  

0.529  0.524  0.527  

Our CMGANs Approach  0.603  0.604  0.604  


0.576  0.577  0.577  

Our CMGANs Approach  0.521  0.466  0.494  


0.506  0.442  0.474  

Dataset  Method  MAP scores  

ImageText  TextImage  Average  

Our CMGANs Approach  0.567  0.551  0.559  

0.491  0.511  0.501  

Our CMGANs Approach  0.603  0.604  0.604  


0.563  0.545  0.554  

Our CMGANs Approach  0.521  0.466  0.494  


0.460  0.436  0.448  

IvF Baseline Comparisons
To verify the effectiveness of each part in our proposed CMGANs approach, three kinds of baseline experiments are conducted, and Tables IV, V and VI show the comparison of our proposed approach with the baseline approaches. The detailed analysis is given in the following paragraphs.
IvF1 Performance of generative model
We have constructed the crossmodal convolutional autoencoders with both weightsharing and semantic constraints in the generative model, as mentioned in Section III.B. To demonstrate the separate contribution on each of them, we conduct 3 sets of baseline experiments, where “ws” denotes the weightsharing constraint and “sc” denotes the semantic constraints. Thus, “CMGANs without ws&sc” means that none of these two constraints is adopted, and “CMGANs with ws” and “CMGANs with sc” means one of them is adopted.
As shown in Table IV, these two components in the generative model have similar contributions on the accuracies for final crossmodal retrieval results, while weightsharing constraint can effectively handle the crossmodal correlation and semantic constraints can preserve the semantic consistency between different modalities. Finally, both of them can mutually boost the common representation learning.
IvF2 Performance of discriminative model
There are two kinds of discriminative models to simultaneously conduct the intermodality discrimination and intramodality discrimination. It should be noted that the intermodality discrimination is indispensable for the crossmodal correlation learning. Therefore, we only conduct the baseline experiment on the effectiveness of intramodality discrimination “CMGANs only inter”.
As shown in Table V, CMGANs achieves the improvement on the average MAP score of bimodal retrieval in 3 datasets. This indicates that the intramodality discrimination plays a complementary role with intermodality discrimination, which can preserve the semantic consistency within each modality by discriminating the generated reconstruction representation with the original representation.
IvF3 Performance of adversarial training
We aim to verify the effectiveness of the adversarial training process. In our proposed approach, the generative model can be trained solely without discriminative model, by adopting the reconstruction learning error on the top of two decoders for each modality as well as weightsharing and semantic constraints. This baseline approach is denoted as “CMGANsCAE”.
From the results in Table VI, we can observe that CMGANs obtains higher accuracy than CMGANsCAE on the average MAP score of bimodal retrieval in 3 datasets. It demonstrates that the adversarial training process can effectively boost the crossmodal correlation learning to improve the performance of crossmodal retrieval.
The above baseline results have verified the separate contribution of each component in our proposed CMGANs approach with the following 3 aspects: (1) Weightsharing and semantic constraints can exploit the crossmodal correlation and semantic information between different modalities. (2) Intramodality discrimination can model semantic information within each modality to make complementary contribution to intermodality discrimination. (3) Crossmodal adversarial training can fully capture the crossmodal joint distribution to learn more discriminative common representation.
V Conclusion
In this paper, we have proposed Crossmodal Generative Adversarial Networks (CMGANs) to handle the heterogeneous gap to learn common representation for different modalities. First, crossmodal GANs architecture is proposed to fit the joint distribution over the data of different modalities with a minimax game. Second, crossmodal convolutional autoencoders are proposed with both weightsharing and semantic constraints to model the crossmodal semantic correlation between different modalities. Third, a crossmodal adversarial mechanism is designed with two kinds of discriminative models to simultaneously conduct intermodality and intramodality discrimination for mutually boosting to learn more discriminative common representation. We conduct crossmodal retrieval to verify the effectiveness of the learned common representation, and our proposed approach outperforms 10 stateoftheart methods on widelyused Wikipedia and Pascal Sentence datasets as well as our constructed largescale XMediaNet dataset in the experiments.
For the future work, we attempt to further model the joint distribution over the data of more modalities, such as video, audio. Besides, we attempt to make the best of largescale unlabeled data to perform unsupervised training for marching toward the practical application.
References
 [1] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, pp. 746–748, 1976.
 [2] Y. Peng, W. Zhu, Y. Zhao, C. Xu, Q. Huang, H. Lu, Q. Zheng, T. Huang, and W. Gao, “Crossmedia analysis and reasoning: advances and directions,” Frontiers of Information Technology & Electronic Engineering, vol. 18, no. 1, pp. 44–57, 2017.
 [3] Y. Peng, X. Huang, and Y. Zhao, “An overview of crossmedia retrieval: Concepts, methodologies, benchmarks and challenges,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2017.
 [4] Y. Yang, Y. Zhuang, F. Wu, and Y. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and crossmedia retrieval,” IEEE Transactions on Multimedia (TMM), vol. 10, no. 3, pp. 437–446, 2008.
 [5] Y. Zhuang, Y. Yang, and F. Wu, “Mining semantic correlation of heterogeneous multimedia data for crossmedia retrieval,” IEEE Transactions on Multimedia (TMM), vol. 10, no. 2, pp. 221–229, 2008.
 [6] L. Zhang, B. Ma, G. Li, Q. Huang, and Q. Tian, “Crossmodal retrieval using multiordered discriminative structured subspace learning,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 6, pp. 1220–1233, 2017.
 [7] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in ACM International Conference on Multimedia (ACMMM), 2010, pp. 251–260.
 [8] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multiview embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision (IJCV), vol. 106, no. 2, pp. 210–233, 2014.
 [9] D. R. Hardoon, S. Szedmák, and J. ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 [10] Y. Peng, X. Huang, and J. Qi, “Crossmedia shared representation by hierarchical learning with multiple deep networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 3846–3853.
 [11] F. Feng, X. Wang, and R. Li, “Crossmodal retrieval with correspondence autoencoder,” in ACM International Conference on Multimedia (ACMMM), 2014, pp. 7–16.
 [12] L. Pang, S. Zhu, and C. Ngo, “Deep multimodal learning for affective analysis and retrieval,” IEEE Transactions on Multimedia (TMM), vol. 17, no. 11, pp. 2008–2020, 2015.
 [13] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
 [14] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.

[15]
C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in
Advances in Neural Information Processing Systems (NIPS), 2016, pp. 64–72.  [16] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,” arXiv preprint arXiv:1706.05274, 2017.
 [17] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.

[18]
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative
adversarial text to image synthesis,” in
International Conference on Machine Learning (ICML)
, 2016, pp. 1060–1069.  [19] H. Hotelling, “Relations between two sets of variates,” Biometrika, pp. 321–377, 1936.
 [20] V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multilabel crossmodal retrieval,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4094–4102.
 [21] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia content processing through crossmodal association,” in ACM International Conference on Multimedia (ACMMM), 2003, pp. 604–611.
 [22] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for crossmedia retrieval,” in AAAI Conference on Artificial Intelligence (AAAI), 2013, pp. 1198–1204.
 [23] X. Zhai, Y. Peng, and J. Xiao, “Learning crossmedia joint representation with sparse and semisupervised regularization,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 24, pp. 965–978, 2014.

[24]
K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for crossmodal retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, no. 10, pp. 2010–2023, 2016.  [25] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster RCNN: towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
 [26] H. G. Krizhevsky A, Sutskever I, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2012.
 [27] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in International Conference on Machine Learning (ICML), 2011, pp. 689–696.
 [28] J. Kim, J. Nam, and I. Gurevych, “Learning semantics with deep belief network for crosslanguage information retrieval,” in International Committee on Computational Linguistic (ICCL), 2012, pp. 579–588.
 [29] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashing with orthogonal regularization,” in International Joint Conference on Artificial Intelligence (IJCAI), 2015, pp. 2291–2297.
 [30] N. Srivastava and R. Salakhutdinov, “Learning representations for multimodal data with deep belief nets,” in International Conference on Machine Learning (ICML) Workshop, 2012.
 [31] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning (ICML), 2013, pp. 1247–1255.

[32]
F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,”
in
Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015, pp. 3441–3450.  [33] Y. Peng, J. Qi, X. Huang, and Y. Yuan, “Ccl: Crossmodal correlation learning with multigrained fusion by hierarchical network,” IEEE Transactions on Multimedia (TMM), 2017.
 [34] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 1486–1494.
 [35] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photorealistic single image superresolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802, 2016.
 [36] X. Wang and A. Gupta, “Generative image modeling using style and structure adversarial networks,” in European Conference on Computer Vision (ECCV), 2016, pp. 318–335.
 [37] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” arXiv preprint arXiv:1610.09585, 2016.
 [38] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 217–225.
 [39] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas, “Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1–8.
 [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations (ICLR), 2014.

[41]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
Advances in Neural Information Processing Systems (NIPS), 2013, pp. 3111–3119. 
[42]
Y. Kim, “Convolutional neural networks for sentence classification,” in
Conference on Empirical Methods in Natural Language Processing (EMNLP)
, 2014, pp. 1746–1751.  [43] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
 [44] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Crossmodal retrieval with CNN visual features: A new baseline,” IEEE Transactions on Cybernetics (TCYB), vol. 47, no. 2, pp. 449–460, 2017.
 [45] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for crossmodal multimedia retrieval,” IEEE Transactions on Multimedia (TMM), vol. 17, no. 3, pp. 370–381, 2015.