1 Introduction
Over the last decade, different types of media data such as texts, images and videos are growing rapidly on the Internet. It is common that different types of data are used for describing the same events or topics. For example, a web page usually contains not only textual description but also images or videos for illustrating the common content. Such different types of data are referred as multimodal data, which exhibit heterogeneous properties. There have been many applications for multimodal data (as shown in Figure 1
). As multimodal data grow, it becomes difficult for users to search information of interest effectively and efficiently. Till now, there have been various research techniques for indexing and searching multimedia data. However, these search techniques are mostly singlemodalitybased, which can be divided into keywordbased retrieval and contentbased retrieval. They only perform similarity search of the same media type, such as text retrieval, image retrieval, audio retrieval, and video retrieval. Hence, a demanding requirement for promoting information retrieval is to develop a new retrieval model that can support the similarity search for multimodal data.
Nowadays, mobile devices and emerging social websites (e.g., Facebook, Flickr, YouTube, and Twitter) are changing the ways people interact with the world and search information of interest. It is convenient if users can submit any media content at hand as the query. Suppose we are on a visit to the Great Wall, by taking a photo, we may expect to use the photo to retrieve the relevant textual materials as visual guides for us. Therefore, crossmodal retrieval, as a natural searching way, becomes increasingly important. Crossmodal retrieval aims to take one type of data as the query to retrieve relevant data of another type. For example, as shown in Figure 2, the text is used as the query to retrieve images. Furthermore, when users search information by submitting a query of any media type, they can obtain search results across various modalities, which is more comprehensive given that different modalities of data can provide complementary information to each other.
More recently, crossmodal retrieval has attracted considerable research attention. The challenge of crossmodal retrieval is how to measure the content similarity between different modalities of data, which is referred as the heterogeneity gap. Hence, compared with traditional retrieval methods, crossmodal retrieval requires crossmodal relationship modeling, so that users can retrieve what they want by submitting what they have. Now, the main research effort is to design the effective ways to make the crossmodal retrieval more accurate and more scalable.
This paper aims to conduct a comprehensive survey of crossmodal retrieval. Although Liu et al. [1] gave an overview of crossmodal retrieval in 2010, it does not include many important works proposed in recent years. Xu et al. [2] summarize several methods for modeling multimodal data, but they focus on multiview learning. Since many technical challenges remain in crossmodal retrieval, various ideas and techniques have been provided to solve the crossmodal problem in recent years. This paper focuses on summarizing these latest works in crossmodal retrieval, the major concerns of which are very different from previous related surveys. Another topic for modeling multimodal data is image/video description [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], which is not discussed here because it goes beyond the scope of crossmodal retrieval research.
The major contributions of this paper are briefly summarized as follows.

This paper aims to provide a survey on recent progress in crossmodal retrieval. It contains many new references not found in previous surveys, which is beneficial for the beginners to get familiar with crossmodal retrieval quickly.

This paper gives a taxonomy of crossmodal retrieval approaches. Differences between different kinds of methods are elaborated, which are helpful for readers to better understand various techniques utilized in crossmodal retrieval.

This paper evaluates several representative algorithms on the commonly used datasets. Some meaningful findings are obtained, which are useful for understanding the crossmodal retrieval algorithms, and are expected to benefit both practical applications and future research.

This paper summarizes challenges and opportunities in crossmodal retrieval fields, and points out some open directions in future.
The rest of this paper is organized as follows: we firstly give an overview on different kinds of methods for crossmodal retrieval in Section 2. Then, we illustrate different kinds of crossmodal retrieval algorithms in detail in Sections 3 and 4. We introduce several multimodal datasets in Section 5. Experimental results are reported in Section 6. The discussion and future trends are given in Section 7. Finally, Section 8 concludes this paper.
2 Overview
In the crossmodal retrieval procedure, users can search various modalities of data including texts, images and videos, starting with any modality of data as a query. Figure 3
presents the general framework of crossmodal retrieval, in which, feature extraction for multimodal data is considered as the first step to represent various modalities of data. Based on these representations of multimodal data, crossmodal correlation modeling is performed to learn common representations for various modalities of data. At last, the common representations enable the crossmodal retrieval by suitable solutions of search result ranking and summarization.
Category  Typical algorithms  
Realvalued representation learning 
Unsupervised methods  Subspace learning methods 


Topic model  CorrLDA [21], Trmm LDA [22], MDRF [23]  
Deep learning methods 


Pairwise based methods  Shallow methods  MultiNPP [30], MVMLGL [31], JGRHML [32]  
Deep learning methods  RGDBN [33], MSDS [34]  
Rank based methods  Shallow methods 


Deep learning methods 


Supervised methods  Subspace learning methods 


Topic model  SupDocNADE [54], NPBUS [55], MR [56]  
Deep learning methods  REDNN [57], deepSM [58], MDNN [59]  
Binary representation learning 
Unsupervised methods  Linear modeling 


Nonlinear modeling  MSAE [66], DMHOR [67]  
Pairwise based methods  Linear modeling 


Nonlinear modeling  MLBE [74], PLMH [75], MMNN [76], CHN [77]  
Supervised methods  Linear modeling  SMH [78], DCDH [79], SCM [80]  
Nonlinear modeling  SePH [81], CAH [82], DCMH [83]  

Since the crossmodal retrieval is considered as an important problem in real applications, various approaches have been proposed to deal with this problem, which can be roughly divided into two categories: 1) realvalued representation learning and 2) binary representation learning, which is also called crossmodal hashing. For realvalued representation learning, the learned common representations for various modalities of data are realvalued. To speed up crossmodal retrieval, the binary representation learning methods aim to transform different modalities of data into a common Hamming space, in which crossmodal similarity search is fast. Since the representation is encoded to binary codes, the retrieval accuracy generally decreases slightly due to the loss of information.
According to the utilized information when learning the common representations, the crossmodal retrieval methods can be further divided into four groups: 1) unsupervised methods, 2) pairwise based methods, 3) rank based methods, and 4) supervised methods. Generally speaking, the more information one method utilizes, the better performance it obtains.
1) For unsupervised methods, only cooccurrence information is utilized to learn common representations across multimodal data. The cooccurrence information means that if different modalities of data are coexisted in a multimodal document, then they are of the same semantic. For example, a web page usually contains both textual descriptions and images for illustrating the same event or topic.
2) For the pairwise based methods, similar pairs (or dissimilar pairs) are utilized to learn common representations. These methods generally learn a meaningful metric distance between different modalities of data.
3) For the rank based methods, rank lists are often utilized to learn common representations. Ranking based methods study the crossmodal retrieval as a problem of learning to rank.
4) Supervised methods exploit label information to learn common representations. These methods enforce the learned representations of differentclass samples to be far apart while those of the sameclass samples lie as close as possible. Accordingly, they obtain more discriminative representations. But getting label information is sometimes expensive due to massive manual annotation.
Typical algorithms of the crossmodal retrieval in terms of different categories are summarized in Table I.
3 Realvalued Representation Learning
If different modalities of data are related to the same event or topic, they are expected to share certain common representation space in which relevant data are close to each other. Realvalued representation learning methods aim to learn a realvalued common representation space, in which different modalities of data can be directly measured. According to the information utilized to learn the common representation, the crossmodal retrieval methods can be further divided into four groups: 1) unsupervised methods, 2) pairwise based methods, 3) rank based methods, and 4) supervised methods. We will introduce them in the following, respectively, and describe some of them in details for better understanding.
3.1 Unsupervised methods
The unsupervised methods only utilize cooccurrence information to learn common representations across multimodal data. The cooccurrence information means that if different modalities of data are coexisted in a multimodal document, then they are of the similar semantic. For example, the textual description along with images or videos often exist in a webpage to illustrate the same event or topic. Furthermore, the unsupervised methods are categorized into subspace learning methods, topic models and deep learning methods.
3.1.1 Subspace learning methods
The main difficulty of crossmodal retrieval is how to measure the content similarity between different modalities of data. Subspace learning methods are one type of the most popular methods. They aim to learn a common subspace shared by different modalities of data, in which the similarity between different modalities of data can be measured (as shown in Figure 4). Unsupervised subspace learning methods use pairwise information to learn a common latent subspace across multimodal data. They enforce pairwise closeness between different modalities of data in the common subspace.
Canonical Correlation Analysis (CCA) is one of the most popular unsupervised subspace learning methods for establishing intermodal relationships between data from different modalities. It has been widely used for crossmedia retrieval[13, 84, 85], crosslingual retrieval [86] and some vision problems [87]. CCA aims to learn two directions and for two modalities of data, along which the data is maximally correlated, i.e.,
(1) 
where and represent the empirical covariance matrices for the two modalities of data respectively, while represents the crosscovariance matrix between them. Rasiwasia et al. [13] propose a twostage method for crossmodal multimedia retrieval. In the first stage, CCA is used to learn a common subspace by maximizing the correlation between the two modalities. Then, a semantic space is learned to measure the similarity of different modal features.
Besides CCA, Partial Least Squares (PLS) [14] and Bilinear Model (BLM) [15, 16] are also used for crossmodal retrieval. Sharma and Jacobs [88] use PLS to linearly map images with different modalities to a common linear subspace in which they are highly correlated. Chen et al. [89] apply PLS to the crossmodal document retrieval. They use PLS to switch the image features into the text space, and then learn a semantic space for the measure of similarity between two different modalities. In [16]
, Tenenbaum and Freeman propose a bilinear model (BLM) to derive a common space for crossmodal face recognition. BLM is also used for textimage retrieval in
[15].Li et al. [17] introduce a crossmodal factor analysis (CFA) approach to evaluate the association between two modalities. The CFA method adopts a criterion of minimizing the Frobenius norm between pairwise data in the transformed domain. Mahadevan et al. [18] propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different modalities. Shi et al. [19] propose a principle of collective component analysis (CoCA), to handle dimensionality reduction on a heterogeneous feature space. Zhu et al. [20] propose a greedy dictionary construction method for the crossmodal retrieval problem. The compactness and modalityadaptivity are preserved by including reconstruction error terms and a Maximum Mean Discrepancy (MMD) measurement for both modalities in the objective function. Wang et al. [90] propose to learn the sparse projection matrices that map the imagetext pairs in Wikipedia into a latent space for crossmodal retrieval.
3.1.2 Topic models
Another unsupervised method is the topic model. Topic models have been widely applied to a specific crossmodal problem, i.e., image annotation [21, 22]. To capture the correlation between images and annotations, Latent Dirichlet Allocation (LDA) [91]
has been extended to learn the joint distribution of multimodal data, such as Correspondence LDA (CorrLDA)
[21] and Topicregression Multimodal LDA (Trmm LDA) [22]. CorrLDA uses topics as the shared latent variables, which represent the underlying causes of crosscorrelations in the multimodal data. Trmm LDA learns two separate sets of hidden topics and a regression module which captures more general forms of association and allows one set of topics to be linearly predicted from the other.Jia et al. [23] propose a new probabilistic model (Multimodal Document Random Field, MDRF) to learn a set of shared topics across the modalities. The model defines a Markov random field on the document level which allows modeling more flexible document similarities.
3.1.3 Deep learning methods
As we mentioned above, it is common that different types of data are used for description of the same events or topics in the web. For example, usergenerated content usually involves with data from different modalities, such as images, texts and videos. This makes it very challenging for traditional methods to obtain a joint representation for multimodal data. Inspired by recent progress of deep learning, Ngiam et al. [24]
apply deep networks to learn features over multiple modalities, which focuses on learning representations for speech audio that are coupled with videos of the lips. Then, a deep Restricted Boltzmann Machine
[25] succeeds in learning the joint representations for multimodal data. It firstly uses separate modalityfriendly latent models to learn lowlevel representations for each modality, and then fuses into joint representation along the deep architecture in the higherlevel (as shown in Figure 5).Inspired by representation learning using deep networks [24, 25], Andrew et al. [26] present Deep Canonical Correlation Analysis (DCCA), a deep learning method to learn complex nonlinear projection for different modalities of data such that the resulting representations are highly linearly correlated. Furthermore, Yan and Mikolajczyk [27] propose an endtoend learning scheme based on the deep canonical correlation analysis (Endtoend DCCA), which is a nontrivial extension to [26] (as shown in Figure 6). The objective function is
(2) 
Define , then the objective function is rewritten as
(3) 
The gradients with respect to and are computed and propagated down along the two branches of the network. The high dimensionality of features presents a great challenge in terms of memory and speed complexity when used in the DCCA framework. To address this problem, Yan and Mikolajczyk propose and discuss details of a GPU implementation with CULA libraries. The efficiency of the implementation is several orders of magnitude higher than CPU implementation.
Feng et al. [28]
propose a novel model involving correspondence autoencoder (CorrAE) for crossmodal retrieval. The model is constructed by correlating hidden representations of two unimodal autoencoders. A novel objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning errors between hidden representations of two modalities, is utilized to train the model as a whole. Minimization of correlation learning errors forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning errors makes hidden representations good enough to reconstruct the input of each modality.
Xu et al. [29]
propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In their language model, they propose a dependencytree structure model that embeds sentences into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, they leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, they minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, this model is able to accomplish three tasks: 1) natural language generation, 2) videolanguage retrieval and 3) languagevideo retrieval.
3.2 Pairwise based methods
Compared with the unsupervised method, pairwise based methods utilize more similar pairs (or dissimilar pairs) to learn a meaningful metric distance between different modalities of data, which can be regarded as heterogeneous metric learning (as shown in Figure 7).
3.2.1 Shallow methods
Wu et al. [92] study the metric learning problem to find a similarity function over two different spaces. Mignon and Jurie [93] propose a metric learning approach for crossmodal matching, which considers both positive and negative constraints. Quadrianto and Lampert [30] propose a new metric learning scheme (MultiView Neighborhood Preserving Projection, MultiNPP) to project different modalities into a shared feature space, in which the Euclidean distance provides a meaningful intramodality and intermodality similarity. To learn projections and for different features and
, the loss function is defined as
(4) 
with
(5) 
(6) 
where for appropriately chosen constants and . The above loss function consists of the similarity term that enforces similar objects to be at proximal locations in the latent space and the dissimilarity term that pushes dissimilar objects away from each other.
Zhai et al. [31]
propose a new method called Multiview Metric Learning with Global consistency and Local smoothness (MVMLGL). This framework consists of two main steps. In the first step, they seek a global consistent shared latent feature space. In the second step, the explicit mapping functions between the input spaces and the shared latent space are learned via regularized local linear regression. Zhai et al.
[32] propose a joint graph regularized heterogeneous metric learning (JGRHML) algorithm to learn a heterogeneous metric for crossmodal retrieval. Based on the heterogeneous metric, they further learn a highlevel semantic metric through label propagation.3.2.2 Deep learning methods
To predict the links between social media, Yuan et al. [33] design a Relational Generative Deep Belief Nets (RGDBN) model to learn latent features for social media, which utilizes the relationships between social media in the network. In the RGDBN model, the link between items is generated from the interactions of their latent features. By integrating the Indian buffet process into the modified Deep Belief Nets, they learn the latent feature that best embeds both the media content and observed media relationships. The model is able to analyze the links between heterogeneous as well as homogeneous data, which can also be used for crossmodal retrieval.
Wang et al. [34]
propose a novel model based on modalityspecific feature learning, named as ModalitySpecific Deep Structure (MSDS). Considering the characteristics of different modalities, the model uses two types of convolutional neural networks to map the raw data to the latent space representations for images and texts, respectively. Particularly, the convolution based network used for texts involves word embedding learning, which has been proved effective to extract meaningful textual features for text classification. In the latent space, the mapped features of images and texts form relevant and irrelevant imagetext pairs, which are used by the onevsmore learning scheme.
3.3 Rank based methods
The rank based methods utilize rank lists to learn common representations. Rank based methods study the crossmodal retrieval as a problem of learning to rank.
3.3.1 Shallow methods
Bai et al. [35] present Supervised Semantic Indexing (SSI) for crosslingual retrieval. Grangier et al. [36] propose a discriminative kernelbased method (PassiveAggressive Model for Image Retrieval, PAMIR) to solve the problem of crossmodal ranking by adapting the PassiveAggressive algorithm. Weston et al. [39] introduce a scalable model for image annotation by learning a joint representation of images and annotations. It learns to optimize precision at the top of the ranked list of annotations for a given image and learns a lowdimensional joint embedding space for both images and annotations.
Lu et al. [37] propose a crossmodal ranking algorithm for crossmodal retrieval, called Latent Semantic CrossModal Ranking (LSCMR). They utilize the structural SVM to learn a metric such that ranking of data induced by the distance from a query can be optimized against various ranking measures. However, LSCMR does not make full use of bidirectional ranking examples (bidirectional ranking means that both textqueryimage and imagequerytext ranking examples are utilized in the training). Accordingly, Wu et al. [38] propose to optimize the bidirectional listwise ranking loss with a latent space embedding.
Recently, Yao et al. [40] propose a novel Ranking Canonical Correlation Analysis (RCCA) for learning query and image similarities. RCCA is used to adjust the subspace learnt by CCA to further preserve the preference relations in the click data. The objective function of the RCCA is
(7) 
where and are the initial transformation matrices learnt by CCA, and is the margin ranking loss as follows:
(8) 
where is a queryimage similarity function that is used to measure the similarity of image given query in the latent space.
(9) 
3.3.2 Deep learning methods
Inspired by the progress of deep learning, Frome et al. [41] present a new deep visualsemantic embedding model (DeViSE), the objective of which is to leverage semantic knowledge learned in the text domain, and transfer it to a model trained for visual object recognition.
Socher et al. [42] introduce a Dependency Tree Recursive Neural Networks (DTRNNs) which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. The image features are extracted from a deep neural network. To learn joint imagesentence representations, the ranking cost function is:
(10) 
where is the mapped image vector, and is the composed sentence vector. is the th sentence description for image . is the set of all sentence indices and is the set of sentence indices corresponding to image . Similarly, is the set of all image indices and is the set of image indices corresponding to sentence . is the set of all correct imagesentence training pairs . With both images and sentences in the same multimodal space, the imagesentence retrieval is performed easily.
Karpathy et al. [43] introduce a model for the bidirectional retrieval of images and sentences, which formulates a structured, maxmargin objective for a deep neural network that learns to embed both visual and language data into a common, multimodal space. Unlike previous models that directly map images or sentences into a common embedding space, this model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space.
Jiang et al. [44] exploit the existing imagetext databases to optimize a ranking function for crossmodal retrieval, called deep compositional crossmodal learning to rank (CMLR). CMLR considers learning a multimodal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the imagelevel and sentencelevel alignment) are collaboratively utilized to learn the multimodal common embedding space in a maxmargin learning to rank manner.
Hua et al. [94] develop a novel deep convolutional architecture for crossmodal retrieval, named CrossModal Correlation learning with Deep Convolutional Architecture (CMCDCA). It consists of visual feature representation learning and crossmodal correlation learning with a large margin principle.
3.4 Supervised methods
To obtain a more discriminative common representation, supervised methods exploit label information, which provides a much better separation between classes in the common representation space.
3.4.1 Subspace learning methods
Figure 8 shows the difference between unsupervised subspace learning methods and supervised subspace learning methods. Supervised subspace learning methods enforce differentclass samples to be mapped far apart while the sameclass samples lie as close as possible. To obtain more discriminative subspace, several works extend CCA to supervised subspace learning methods.
Sharma et al. [15] present a supervised extension of CCA, called Generalized Multiview Analysis (GMA). The optimal projection directions , are obtained as
(11) 
It extends Linear Discriminant Analysis (LDA) and Marginal Fisher Analysis (MFA) to their multiview counterparts, i.e., Generalized Multiview LDA (GMLDA) and Generalized Multiview MFA (GMMFA), and apply them to deal with the crossmedia retrieval problem. GMLDA and GMMFA take the semantic category into account, which has obtained promising results. Rasiwasia et al. [50] present cluster canonical correlation analysis (clusterCCA) for joint dimensionality reduction of two modalities of data. The clusterCCA problem is formulated as
(12) 
where the covariance matrices , and are defined as
(13) 
(14) 
(15) 
where is the total number of pairwise correspondences. ClusterCCA is able to learn discriminant low dimensional representations that maximize the correlation between the two modalities of data while segregating the different classes on the learned space. Gong et al. [51] propose a novel threeview CCA (CCA3V) framework, which explicitly incorporates the dependence of visual features and text on the underlying semantics. To better understand the CCA3V, the objective function is given as below:
(16) 
where , , and represent embedding vectors from visual view, text view and semantic class view, respectively. , , and are the learned projections for each view. Furthermore, a distance function specially adapted to CCA that improves the accuracy of retrieval in the embedded space is adopted. Ranjan et al. [95] introduce multilabel Canonical Correlation Analysis (mlCCA), an extension of CCA, for learning shared subspaces by considering the high level semantic information in the form of multilabel annotations. They also present Fast mlCCA, a computationally efficient version of mlCCA, which is able to handle large scale datasets. Jing et al. [46] propose a novel multiview feature learning approach based on intraview and interview supervised correlation analysis (ISCA). It explores the useful correlation information of samples within each view and between all views.
Besides supervised CCAbased methods, Lin and Tang [45] propose a common discriminant feature extraction (CDFE) method to learn a common feature subspace where the difference of within scatter matrix and between scatter matrix is maximized. Mao et al. [47] introduce a method for cross media retrieval, named parallel field alignment retrieval (PFAR), which integrates a manifold alignment framework from the perspective of vector fields. Zhai et al. [48] propose a novel feature learning algorithm for crossmodal data, named Joint Representation Learning (JRL). It can explore the correlation and semantic information in a unified optimization framework.
Wang et al. [52] propose a novel regularization framework for the crossmodal matching problem, called LCFS (Learning Coupled Feature Spaces). It unifies coupled linear regressions,
norm and trace norm into a generic minimization formulation so that subspace learning and coupled feature selection can be performed simultaneously. Furthermore, they extend this framework to more than twomodality case in
[53], where the extension version is called JFSSL (Joint Feature Selection and Subspace Learning). The main extensions are summarized as follows: 1) they propose a multimodal graph to better model the similarity relationships among different modalities of data, which is demonstrated to outperform the low rank constraint in terms of both computational cost and retrieval performance. 2) Accordingly, a new iterative algorithm is proposed to solve the modified objective function and the proof of its convergence is given.Inspired by the idea of (semi)coupled dictionary learning, Zhuang et al. [49] bring coupled dictionary learning into supervised sparse coding for crossmodal retrieval, which is called Supervised coupled dictionary learning with group structures for multimodal retrieval (SliM). It can utilize the class information to jointly learn discriminative multimodal dictionaries as well as mapping functions between different modalities. The objective function is formulated as follows:
(17) 
where is the coefficient matrix associated to those intramodality data belonging to the th class. As shown above, data in the th modality space can be mapped into the th modality space by the learned according to , therefore, the computation of crossmodal similarity is achieved.
3.4.2 Topic models
Based on Document Neural Autoregressive Distribution Estimator (DocNADE), Zhen et al.
[54] propose a supervised extension of DocNADE, to learn a joint representation from image visual words, annotation words and class label information.Liao et al. [55] present a nonparametric Bayesian upstream supervised (NPBUS) multimodal topic model for analyzing multimodal data. The NPBUS model allows flexible learning of correlation structures of topics with individual modality and between different modalities. The model becomes more discriminative via incorporating upstream supervised information shared by multimodal data. Furthermore, it is capable to automatically determine the number of latent topics in each modality.
Wang et al. [56] propose a supervised multimodal mutual topic reinforce modeling (MR) approach for crossmedia retrieval, which seeks to build a joint crossmodal probabilistic graphical model for discovering the mutually consistent semantic topics via appropriate interactions between model factors (e.g., categories, latent topics and observed multimodal data).
3.4.3 Deep learning methods
Wang et al. [57] propose a regularized deep neural network (REDNN) for semantic mapping across modalities. They design and implement a 5layer neural network for mapping visual and textual features into a common semantic space such that the similarity between different modalities can be measured.
Li et al. [96]
propose a deep learning method to address the crossmedia retrieval problem with multiple labels. The proposed method is supervised, and the correlation between two modalities can be built according to their shared ground truth probability vectors. Both two networks have two hidden layers and one output layer, and the squared loss is employed as the loss function.
Wei et al. [58] propose a deep semantic matching method to address the crossmodal retrieval problem for samples which are annotated with one or multiple labels.
Castrejon et al. [97] present two approaches to regularize crossmodal convolutional networks so that the intermediate representations are aligned across modalities. The focus of this work is to learn crossmodal representations when the modalities are significantly different (e.g., text and natural images) and have category supervision.
Wang et al. [59] propose a supervised Multimodal Deep Neural Network (MDNN) method, which consists of a deep convolutional neural network (DCNN) model and a neural language model (NLM) to learn mapping functions for the image modality and the text modality respectively. It exploits the label information, and thus can learn robust mapping functions against noisy input data.
4 Binary Representation Learning
Most existing realvalued crossmodal retrieval techniques are based on the bruteforce linear search, which is timeconsuming for large scale data. A practical way to speed up the similarity search is binary representation learning, which is referred as hashing. Existing hashing methods can be categorized into unimodal hashing, multiview hashing and crossmodal hashing. Representative hashing methods on single modal data include spectral hashing (SH) [98], selftaught hashing [99], iterative quantization hashing (ITQ) [100], and so on. The approaches mentioned above focus on learning hash functions for data objects with homogeneous features. However, in real world applications, we often extract multiple types of features with different properties from data objects. Accordingly, multiview hashing methods (e.g., CHMIS [101] and MFH [102]) leverage information contained in different features to learn more accurate hash codes.
Crossmodal hashing methods aim to discover correlations among different modalities of data to enable crossmodal similarity search. They project different modalities of data into a common Hamming space for fast retrieval (as shown in Figure 9). Similarly, crossmodal hashing methods can be categorized into: 1) unsupervised methods, 2) pairwise based methods, and 3) supervised methods. To the best of our knowledge, there is little literature on rank based crossmodal hashing. According to the learnt hash function being linear or nonlinear, the crossmodal hashing method can be further divided into two categories: linear modeling and nonlinear modeling. Linear modeling methods aim to learn linear functions to obtain hash codes. Whereas, nonlinear modeling methods learn the hash codes in a nonlinear manners.
4.1 Unsupervised methods
4.1.1 Linear modeling
In [60], Kumar et al. propose a crossview hashing (CVH), which extends spectral hashing [98] from traditional unimodal setting to the multimodal scenario. The hash functions map similar objects to similar codes across views, and thus enable crossview similarity search. The objective function is:
(18) 
where is the total number of views, and is the Hamming distance between objects and summed over all views:
(19) 
Actually, CCA can be viewed as a special case of CVH by setting .
Rastegari et al. [62] propose a predictable dualview hashing (PDH) algorithm for twomodalities. They formulate an objective function to maintain the predictability of binary codes and optimize the objective function by applying an iterative method based on block coordinate descent.
Ding et al. [64] propose a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH assumes that all modalities of an instance generate identical hash codes. It learns unified hash codes by collective matrix factorization with a latent factor model from different modalities of one instance. The objective function is:
(20) 
where and represent two modalities of data, and represents the latent semantic representations. and are the learned projections.
Zhou et al. [65] propose a novel Latent Semantic Sparse Hashing (LSSH) to perform crossmodal similarity search. In particular, LSSH uses Sparse Coding to capture the salient structures of images:
(21) 
and uses Matrix Factorization to learn the latent concepts from text:
(22) 
Then the learned latent semantic features are mapped to a joint abstraction space.
(23) 
The overall objective function is
(24) 
Song et al. [61] propose a novel intermedia hashing (IMH) model to transform multimodal data into a common Hamming space. This method explores the intermedia consistency and intramedia consistency to derive effective hash codes, based on which hash functions are learnt to efficiently map new data points into the Hamming space. To learn the hash codes and for two modalities respectively, the objective function is defined as follows:
(25) 
The first term models intramodal consistency, the second term models intermodal consistency, and the third term learns the hash functions to generate hash codes for new data.
Zhu et al. [63] propose a novel hashing method, named linear crossmodal hashing (LCMH), to enable scalable indexing for multimedia search. This method achieves a linear time complexity to the training data size in the training phase. The key idea is to partition the training data of each modality into clusters, and then represent each training data point with its distances to centroids of the clusters for preserving the intrasimilarity in each modality. To preserve the intersimilarity among data points across different modalities, they transform the derived data representations into a common binary subspace.
4.1.2 Nonlinear modeling
The hash functions learned by most existing crossmodal hashing methods are linear. To capture more complex structure of the multimodal data, nonlinear hash function learning is studied recently. Based on stacked autoencoders, Wang et al. [66] propose an effective nonlinear mapping mechanism for multimodal retrieval, called Multimodal Stacked AutoEncoders (MSAE). Mapping functions are learned by optimizing a new objective function, which captures both intramodal and intermodal semantic relationships of data from heterogeneous sources effectively. The stacked structure of MSAE enables the method to learn nonlinear projections rather than linear projections.
Wang et al. [67] propose a Deep Multimodal Hashing with Orthogonal Regularization (DMHOR) to learn accurate and compact multimodal representations. The method can better capture the intramodality and intermodality correlations to learn accurate representations. Meanwhile, in order to make the representations compact and reduce redundant information lying in the codes, an orthogonal regularizer is imposed on the learned weighting matrices.
4.2 Pairwise based methods
4.2.1 Linear modeling
To the best of our knowledge, Bronstein et al. [68] propose the first crossmodal hashing method, called crossmodal similarity sensitive hashing (CMSSH). CMSSH learns hash functions for the bimodal case in a standard boosting manner. Specifically, given two modalities of data sets, CMSSH learns two groups of hash functions to ensure that if two data points (with different modalities) are relevant, their corresponding hash codes are similar and otherwise dissimilar. However, CMSSH only preserves the intermodality correlation but ignores the intramodality similarity.
Zhen et al. [69] propose a novel multimodal hash function learning method, called CoRegularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of hash codes are learned by solving DC (difference of convex functions) programs. To learn projections and from multimodal data, the objective function is defined as:
(26) 
where and are intramodality loss terms for modalities and , defined as follows:
(27) 
where is equal to if and otherwise. The intermodality loss term is defined as follows:
(28) 
where and is the smoothly clipped inverted squared deviation function in Eq.6 [30]. The learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized.
Hu et al. [70] propose a multiview hashing algorithm for crossmodal retrieval, called Iterative MultiView Hashing (IMVH). IMVH aims to learn discriminative hashing functions for mapping multiview datum into a shared hamming space. It not only preserves the withinview similarity, but also incorporates the betweenview correlations into the encoding scheme, where it maps the similar points to be close and push apart the dissimilar ones.
The crossmodal hashing methods usually assume that the hashed data reside in a common Hamming space. However, this may be inappropriate, especially when the modalities are quite different. To address this problem, Ou et al. [72] propose a novel Relationaware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities from multiple heterogeneous domains. Unlike some existing crossmodal hashing methods that map heterogeneous data into a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. The RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to learn hash codes . Specifically, the homogeneous loss is
(29) 
where indicates homogeneous relationship. The heterogeneous loss is
(30) 
where indicates that modality has relationship with modality , and the logistic loss is
(31) 
where indicates heterogeneous relationship. To minimize the loss, and need to be close for a large . Based on the similar idea, Wei et al. [73] present a Heterogeneous Translated Hashing (HTH) method. HTH simultaneously learns hash functions to embed heterogeneous media into different Hamming spaces and translators to align these spaces.
Wu et al. [71] present a crossmodal hashing approach, called quantized correlation hashing (QCH), which considers the quantization loss over domains and the relation between domains. Unlike previous approaches that separate the optimization of the quantizer and the maximization of domain correlation, this approach simultaneously optimizes both processes. The underlying relation between the domains that describes the same objects is established via maximizing the correlation among the hash codes across domains.
4.2.2 Nonlinear modeling
Zhen et al. [74]
propose a probabilistic latent factor model, called multimodal latent binary embedding (MLBE) to learn hash functions for crossmodal retrieval. MLBE employs a generative model to encode the intrasimilarity and intersimilarity of data objects across multiple modalities. Based on maximum a posteriori estimation, the binary latent factors are efficiently obtained and then taken as hash codes in MLBE. However, the optimization is easy to get trapped in local minima during learning, especially when the code length is large.
Zhai et al. [75] present a new parametric local multimodal hashing (PLMH) method for crossview similarity search. PLMH learns a set of hash functions to locally adapt to the data structure of each modality. Different local hash functions are learned at different locations of the input spaces, therefore, the overall transformations of all points in each modality are locally linear but globally nonlinear.
To learn nonlinear hash functions, Masci et al. [76] introduce a novel learning framework for multimodal similaritypreserving hashing based on the coupled siamese neural network architecture. It utilizes similar pairs and dissimilar pairs for both intra and intermodality similarity learning. For modalities and , the objective function is defined as:
(32) 
where two siamese networks are coupled by a crossmodal loss:
(33) 
and the unimodal loss is
(34) 
where and denote the similar pairs set and dissimilar pairs set, respectively. is similar to . The full multimodal version making use of inter and intramodal training data is called MMNN in the original paper. Unlike most existing crossmodality similarity learning approaches, the hashing functions are not limited to linear projections. By increasing the number of layers in the network, mappings of the arbitrary complexity can be trained.
Cao et al. [77] propose Correlation Hashing Network (CHN), a new hybrid architecture for crossmodal hashing. They jointly learn good image and text representations tailored to hash coding and formally control the quantization error.
4.3 Supervised methods
4.3.1 Linear modeling
Zhang et al. [80] propose a multimodal hashing method, called semantic correlation maximization (SCM), which integrates semantic labels into the hashing learning procedure. This method uses label vectors to get semantic similarity matrix , and tries to reconstruct it through the learned hash codes. Finally, the objective function is:
(35) 
To learn orthogonal projection, the objective function is reformulated as:
(36) 
Furthermore, a sequential learning method (SCMSeq) is proposed to learn the hash functions bit by bit without imposing the orthogonality constraints.
Based on the dictionary learning framework, Wu et al. [78] develop an approach to obtain the sparse codesets for data objects across different modalities via joint multimodal dictionary learning, which is called sparse multimodal hashing (abbreviated as SMH). In SMH, both intramodality similarity and intermodality similarity are firstly modeled by a hypergraph, and then multimodal dictionaries are jointly learned by Hypergraph Laplacian sparse coding. Based on the learned dictionaries, the sparse codeset of each data object is acquired and conducted for multimodal approximate nearest neighbor retrieval using a sensitive Jaccard metric. Similarly, Yu et al. [79] propose a discriminative coupled dictionary hashing (DCDH) method to capture the underlying semantic information of the multimodal data. In DCDH, the coupled dictionary for each modality is learned with the aid of class information. As a result, the coupled dictionaries not only preserve the intrasimilarity and intercorrelation among multimodal data, but also contain dictionary atoms that are semantically discriminative (i.e., data from the same category are reconstructed by the similar dictionary atoms). To perform fast crossmedia retrieval, hash functions are learned to map data from the dictionary space to a lowdimensional Hamming space.
4.3.2 Nonlinear modeling
To capture more complex data structure, Lin et al. [81]
propose a twostep supervised hashing method termed SePH (SemanticsPreserving Hashing) for crossview retrieval. For training, SePH firstly transforms the given semantic affinities of training data into a probability distribution and approximates it with tobelearnt hash codes in Hamming space via minimizing the KLdivergence. Then in each view, SePH utilizes kernel logistic regression with a sampling strategy to learn the nonlinear projections from features to hash codes. And for any unseen instance, predicted hash codes and their corresponding output probabilities from observed views are utilized to determine its unified hash code, using a novel probabilistic approach.
Cao et al. [82] propose a novel supervised crossmodal hashing method, Correlation Autoencoder Hashing (CAH), to learn discriminative and compact binary codes based on deep autoencoders. Specifically, CAH jointly maximizes the feature correlation revealed by bimodal data and the semantic correlation conveyed in similarity labels, while embeds them into hash codes by nonlinear deep autoencoders.
Jiang et al. [83] propose a novel crossmodal hashing method, called deep crossmodal hashing (DCMH), by integrating feature learning and hashcode learning into the same framework. DCMH is an endtoend learning framework with deep neural networks, one for each modality, to perform feature learning from scratch.
5 Multimodal datasets
Dataset  Modality  Number of Samples  Image Features  Text Features  Number of Categories 

Wiki  image/text  2,866  SIFT+BOW  LDA  10 
INRIAWebsearch  image/text  71,478  —  —  353 
Flickr30K  image/sentences  31,783  —  —  — 
NUSWIDE  image/tags  186,577  6 types  tag occurrence feature  81 
PascalVOC  image/tags  9,963  3 types  tag occurrence feature  20 
With the popularity of multimodal data, crossmodal retrieval becomes an urgent and challenging problem. To evaluate the performance of crossmodal retrieval algorithms, researchers collect multimodal data to build up multimodal datasets. Here, we introduce five commonly used datasets, i.e., Wikipedia, INRIAWebsearch, Flickr30K, Pascal VOC, and NUSWIDE datasets.
The Wiki imagetext dataset^{1}^{1}1http://www.svcl.ucsd.edu/projects/crossmodal/ [13]: it is generated from Wikipedia’s “featured article”, which consists of 2866 imagetext pairs. In each pair, the text is an article describing people, places or some events and the image is closely related to the content of the article (as shown in Figure 10). Besides, each pair is labeled with one of 10 semantic classes. The representation of the text with 10 dimensions is derived from a latent Dirichlet allocation model [91]. The images are represented by the 128 dimensional SIFT descriptor histograms [103].
The INRIAWebsearch dataset^{2}^{2}2http://lear.inrialpes.fr/pubs/2010/KAVJ10/ [104]: it contains 71,478 imagetext pairs, which can be categorized into 353 different concepts that include famous landmarks, actors, films, logos, etc. Each concept comes with a number of images retrieved via Internet search, and each image is marked as either relevant or irrelevant to its query concept. The text modality consists of text surrounding images on web pages. This dataset is very challenging because it contains a large number of classes.
The Flickr30K dataset^{3}^{3}3http://shannon.cs.illinois.edu/DenotationGraph/ [105]: it is an extension of Flicker8K [106], which contains 31,783 images collected from different Flickr groups and focuses on events involving people and animals. Each image is associated with five sentences independently written by native English speakers from Mechanical Turk.
The NUSWIDE dataset^{4}^{4}4http://lms.comp.nus.edu.sg/research/NUSWIDE.htm [107]
: this dataset contains 186,577 labeled images. Each image is associated with user tags, which can be taken as an imagetext pair. To guarantee that each class has abundant training samples, researchers generally select those pairs that belong to one of the K (K=10, or 21) largest classes with each pair exclusively belonging to one of the selected classes. Six types of lowlevel features are extracted from these images, including 64D color histogram, 144D color correlogram, 73D edge direction histogram, 128D wavelet texture, 225D blockwise color moments extracted over 5 5 fixed grid partitions, and 500D bag of words based on SIFT descriptions. The textual tags are represented with 1000dimensional tag occurrence feature vectors.
The Pascal VOC dataset^{5}^{5}5http://www.cs.utexas.edu/ grauman/research/datasets.html [108]: it consists of 5011/4952 (training/testing) imagetag pairs, which can be categorized into 20 different classes. Some examples are shown in Figure 11. Since some images are multilabeled, researchers usually select images with only one object as the way in [15], resulting in 2808 training and 2841 testing data. The image features include histograms of bagofvisualwords, GIST and color [108], and the text features are 399dimensional tag occurrence features.
These datasets generally contain two modalities of data, i.e., image and text. Among those datasets, only the Wiki dataset is designed for crossmodal retrieval, but the number of samples and categories it contains are small. Another commonly used dataset is the NUSWIDE dataset, especially in crossmodal hashing, due to its relative massiveness. But the text modality in the NUSWIDE dataset is user tag, which is simple and limited for text description. The Flickr30K dataset is usually used for the imagesentence retrieval, which does not have category information. So it is desirable to design a more general, largescale multimodal dataset for future research, which contains more modalities of data and more categories.
6 Experiments
In this section, we test the performance of different kinds of crossmodal retrieval methods. Firstly, we introduce evaluation metrics, then choose representative crossmodal retrieval methods for evaluation. For the compared methods, we cite the reported results in the relevant literature.
6.1 Evaluation metric
To evaluate the performance of the crossmodal retrieval methods, two crossmodal retrieval tasks are conducted: (1) Image query vs. Text database, (2) Text query vs. Image database. More specifically, in testing phase, we take one modality of data of the testing set as the query set to retrieve another modality of data. The cosine distance is adopted to measure the similarity of features. Given an image (or text) query, the goal of each crossmodal task is to find the nearest neighbors from the text (or image) database.
The mean average precision (MAP) [13] is used to evaluate the overall performance of the tested algorithms. To compute MAP, we first evaluate the average precision (AP) of a set of retrieved documents by , where is the number of relevant documents in the retrieved set, denotes the precision of the top retrieved documents, and if the th retrieved document is relevant (where ’relevant’ means belonging to the class of the query) and otherwise. The MAP is then computed by averaging the AP values over all queries in the query set. The larger the MAP, the better the performance.
For imagesentence retrieval, the commonly used metric is another one, which will be described in Section 6.2.2.
6.2 Comparison of realvalued representation learning methods
For imagetext retrieval, the Wiki imagetext dataset and the NUSWIDE dataset are commonly used to evaluate performance. For the realvalued representation learning methods, we choose three popular unsupervised methods (i.e., PLS [14], BLM [15, 16] and CCA [13]), two rank based methods (i.e., LSCMR [37], and BiLSCMR [38]) and eight popular supervised methods (i.e., CDFE [45], GMLDA [15], GMMFA [15], CCA3V [51], SliM [49], MR [56], LCFS [52] and JFSSL [53]).
For imagesentence retrieval, the Flickr30K is the commonly used dataset for evaluation. Deep learning methods are generally used for modeling the images and sentences. For the deep learning methods, we choose four recently proposed methods: DeViSE [41], SDTRNN [42], Deep Fragment [43] and Endtoend DCCA [27]
6.2.1 Results on Wiki and NUSWIDE
Tables III and V show the MAP scores ^{6}^{6}6For the Wiki and NUSWIDE datasets, we use the features provided by the authors. For PLS, BLM, CCA, CDFE, GMMFA, GMLDA, CCA3V, LCFS and JFSSL, we adopt the codes provided by the authors and use the setting in [53]. For LSCMR, BiLSCMR, SliM, and MR, we cite the publicly reported results. achieved by PLS [14], BLM [15, 16], CCA [13], LSCMR [37], BiLSCMR [38], CDFE [45], GMMFA [15], GMLDA [15], CCA3V [51], SliM [49], MR [56], LCFS [52] and JFSSL [53]
. For the experiments on the NUSWIDE dataset, Principal Component Analysis (PCA) is first performed on the original features to remove redundancy for methods PLS, BLM, CCA, CDFE, GMMFA, GMLDA, and CCA3V.
We observe that the supervised learning methods (CDFE, GMMFA, GMLDA, CCA3V, SliM
, and MR, LCFS and JFSSL) perform better than the unsupervised learning methods (PLS, BLM and CCA). The reason is that PLS, BLM and CCA only care about pairwise closeness in the common subspace, but CDFE, GMMFA, GMLDA, CCA3V, SliM
, and MR, LCFS and JFSSL utilize class information to obtain much better separation between classes in the common representation space. Hence, it is helpful for crossmodal retrieval to learn a discriminative common representation space. The rank based methods (LSCMR and BiLSCMR) achieve comparable results on the Wiki dataset, but perform worse on the NUSWIDE dataset. One of the reasons is that they only use a small set of data to generate rank lists on the NUSWIDE dataset, and the number of data used for training is not enough. Another reason is that the features (the bag of words (BoW) representation with the TFIDF weighting scheme for text, the bagofvisual words (BoVW) feature for images) they used are not powerful enough. LCFS and JFSSL performs the best, and one reason is that they perform feature selection on different feature spaces simultaneously.Results on the Wiki with different types of features: To evaluate the effect of different types of features, we test the crossmodal retrieval performance with different types of features for images and texts on the Wiki dataset. Here we cite the results in [53]
. Besides the features provided by the Wiki dataset itself, 4096dimensional CNN(Convolutional Neural Networks) features for images are extracted by Caffe
[109], and 5000dimensional feature vectors for texts are extracted by using the bag of words representation with the TFIDF weighting scheme. Table IV shows the MAP scores of GMLDA, CCA3V, LCFS and JFSSL with different types of features on the Wiki dataset. PCA is performed on CNN and TFIDF features for GMLDA and CCA3V. It can be seen that all methods achieve better results when using the CNN features. This is because CNN features are more powerful for image representation, which has been proved in many fields. Overall, better representation of data leads to better performance.Methods  Image query  Text query  Average 

PLS  0.2402  0.1633  0.2032 
BLM  0.2562  0.2023  0.2293 
CCA  0.2549  0.1846  0.2198 
LSCMR  0.2021  0.2229  0.2125 
BiLSCMR  0.2123  0.2528  0.2326 
CDFE  0.2655  0.2059  0.2357 
GMMFA  0.2750  0.2139  0.2445 
GMLDA  0.2751  0.2098  0.2425 
CCA3V  0.2752  0.2242  0.2497 
SliM  0.2548  0.2021  0.2285 
MR  0.2298  0.2677  0.2488 
LCFS  0.2798  0.2141  0.2470 
JFSSL  0.3063  0.2275  0.2669 
Query  Methods  Features (Image/Text)  





Image  GMLDA  0.2751  0.4084  0.2782  0.4455  
CCA3V  0.2752  0.4049  0.2862  0.4370  
LCFS  0.2798  0.4132  0.2978  0.4553  
JFSSL  0.3063  0.4279  0.3080  0.4670  
Text  GMLDA  0.2098  0.3693  0.1925  0.3661  
CCA3V  0.2242  0.3651  0.2238  0.3832  
LCFS  0.2141  0.3845  0.2134  0.3978  
JFSSL  0.2275  0.3957  0.2257  0.4102 
Results on the Wiki in threemodality case: To evaluate the performance of crossmodal retrieval methods in a threemodality case, we test several methods on the Wiki dataset. Here we cite the results in [53]. To the best of our knowledge, there are no three or more modalities of datasets available publicly in the recent literature. The Wiki dataset contains two modalities of data: text and image. We adopt the settings in [53]. To simulate a threemodality setting, 4096dimensional CNN (Convolutional Neural Networks) features of images are extracted by Caffe [109] as another virtual modality. Here 128dim SIFT histogram, 10dim LDA feature and 4096dimensional CNN features are used as Modality A, Modality B and Modality C, respectively. Table VI shows the MAP comparison on the Wiki dataset in the threemodality case. We can see that JFSSL outperforms the other methods in three crossmodal retrieval tasks. This is mainly due to the fact that JFSSL is designed for the modality case, which can model the correlations between different modalities more accurately in the threemodality case. However, the other methods are designed for only the twomodality case, and they are not suitable for the threemodality case.
From the above experiments, we draw the following conclusions:

Supervised learning methods generally achieve better results than unsupervised learning methods. The reason is that unsupervised methods only care about pairwise closeness in the common subspace, but supervised learning methods utilize class information to obtain much better separation between classes in the common representation space. Hence, it is helpful for crossmodal retrieval to learn a discriminative common representation.

For crossmodal retrieval, better features generally lead to better performance. So it is beneficial for crossmodal retrieval to learn powerful representations for various modalities of data.

For algorithms designed for the twomodality case, it cannot be directly extended for more than twomodality case well. So in more than twomodality case, they generally perform worse than the algorithms designed for more than twomodality methods.
Methods  Image query  Text query  Average 

PCA+PLS  0.2752  0.2661  0.2706 
PCA+BLM  0.2976  0.2809  0.2892 
PCA+CCA  0.2872  0.2840  0.2856 
LSCMR  0.1424  0.2491  0.1958 
BiLSCMR  0.1453  0.2380  0.1917 
PCA+CDFE  0.2595  0.2869  0.2732 
PCA+GMMFA  0.2983  0.2939  0.2961 
PCA+GMLDA  0.3243  0.3076  0.3159 
PCA+CCA3V  0.3513  0.3260  0.3386 
SliM  0.3154  0.2924  0.3039 
MR  0.2445  0.3044  0.2742 
LCFS  0.3830  0.3460  0.3645 
JFSSL  0.4035  0.3747  0.3891 
Query  Modality A  Modality B  Modality C 

PLS  0.1629  0.1653  0.2412 
BLM  0.1673  0.2167  0.2607 
CCA  0.1733  0.1722  0.2434 
CDFE  0.1882  0.1836  0.2548 
GMMFA  0.2005  0.1961  0.2551 
GMLDA  0.1841  0.1700  0.2525 
CCA3V  0.2301  0.1720  0.2665 
LCFS  0.2292  0.3065  0.3072 
JFSSL  0.2636  0.3203  0.3354 
6.2.2 Results on Flickr30K
For the Flickr30K dataset, We adopt the evaluation metrics in [43] for a fair comparison. More specifically, for the imagesentece retrieval, we report the median rank (Med r) of the closest ground truth result in the list, as well as the R@K (with K = 1, 5, 10) that computes the fraction of times the correct result being found among the top K items. In contrast to R@K, a lower median rank indicates a better performance.
Table VII shows the publicly reported results achieved by DeViSE [41], SDTRNN [42], Deep Fragment [43] and Endtoend DCCA [27] on the Flickr30K dataset. Endtoend DCCA and Deep Fragment perform better than other methods. The reason is that Deep Fragment breaks an image into objects and a sentence into dependency tree relations, and maximises the explicit alignment between the image fragments and text fragments. For Endtoend DCCA, the used TFIDF based text features and CNN based visual features capture global properties of the two modalities respectively. The alignment of the fragments in image and text is implicitly considered by the CCA correlation objective.
Methods  Sentence retrieval  Image retrieval  










DeViSE  4.5  18.1  29.2  26  6.7  21.9  32.7  25  
SDTRNN  9.6  29.8  41.1  16  8.9  29.8  41.1  16  
Deep Fragment  14.2  37.7  51.3  10  10.2  30.8  44.2  14  
Endtoend DCCA  16.7  39.3  52.9  8  12.6  31.0  43.0  15 
Tasks  Methods  Code Length  
=16  =32  =64  =128  

CVH  0.1257  0.1212  0.1215  0.1171  
IMH  0.1573  0.1575  0.1568  0.1651  
LSSH  0.2141  0.2216  0.2218  0.2211  
CMFH  0.2132  0.2259  0.2362  0.2419  
CMSSH  0.1877  0.1771  0.1646  0.1552  
SCMSeq  0.2210  0.2337  0.2442  0.2596  
SePH  0.2787  0.2956  0.3064  0.3134  
MMNN  –  0.5750  –  –  

CVH  0.1185  0.1034  0.1024  0.0990  
IMH  0.1463  0.1311  0.1290  0.1301  
LSSH  0.5031  0.5224  0.5293  0.5346  
CMFH  0.4884  0.5132  0.5269  0.5375  
CMSSH  0.1630  0.1617  0.1539  0.1517  
SCMSeq  0.2134  0.2366  0.2479  0.2573  
SePH  0.6318  0.6577  0.6646  0.6709  
MMNN  –  0.2740  –  – 
6.3 Comparison of binary representation learning methods
For crossmodal hashing methods, the Wiki imagetext dataset and the NUSWIDE dataset are most commonly used datasets. We use the settings in [81] and utilize the Mean Average Precision (MAP) as the evaluation metric.
Tables VIII and IX show the publicly reported MAP scores achieved by unsupervised crossmodal hashing methods (CVH [60], IMH [61], LSSH [65], and CMFH [64]), pairwise base methods (CMSSH [68] and MMNN [76]), supervised methods (SCMSeq [80] and SePH [81]) on the Wiki and NUSWIDE datasets, respectively. MMNN ^{7}^{7}7For MMNN, we show the results reported in the original paper, MAP scores at some code length are missing. and SePH are nonlinear modeling methods, and others are linear modeling methods. The code length is set to be 16, 32, 64, and 128 bits, respectively.
Tasks  Methods  Code Length  
=16  =32  =64  =128  

CVH  0.3687  0.4182  0.4602  0.4466  
IMH  0.4187  0.3975  0.3778  0.3668  
LSSH  0.3900  0.3924  0.3962  0.3966  
CMFH  0.4267  0.4229  0.4207  0.4182  
CMSSH  0.4063  0.3927  0.3939  0.3739  
SCMSeq  0.4842  0.4941  0.4947  0.4965  
SePH  0.5421  0.5499  0.5537  0.5601  
MMNN  57.44  –  56.33  –  

CVH  0.3646  0.4024  0.4339  0.4255  
IMH  0.4053  0.3892  0.3758  0.3627  
LSSH  0.4286  0.4248  0.4248  0.4175  
CMFH  0.4627  0.4556  0.4518  0.4478  
CMSSH  0.3874  0.3849  0.3704  0.3699  
SCMSeq  0.4536  0.4620  0.4630  0.4644  
SePH  0.6302  0.6425  0.6506  0.6580  
MMNN  56.91  –  55.83  – 
From the experimental results, we can draw the following observations.
1) IMH performs better than CVH. The reason is that CVH only exploits the intermodality similarity, but IMH exploits both intermodality and intramodality similarity. So it is useful to model the intramodal similarity in the crossmodal hashing algorithms.
2) CMSSH performs better than CVH. The reason is that CVH only uses the pairwise information, but CMSSH uses both similar and dissimilar pairs. So using more information is helpful for improving performance.
3) Although without supervised information, experiments showed that CMFH and LSSH can well exploit the latent semantic affinities of training data and yield stateoftheart performance for crossmodal retrieval. So modeling the common representation space in a appropriate manner (like CMFH and LSSH) is very important.
4) SCMSeq integrates semantic labels into the hashing learning procedure via maximizing semantic correlations, which outperforms several stateoftheart methods. It can be seen that supervised information is beneficial for learning binary codes for various modalities of data.
5) SePH performs better than other methods, due to its capability to better preserve semantic affinities in Hamming space, as well as the effectiveness of kernel logistic regression to model the nonlinear projections from features to hash codes. MMNN learns nonlinear projection by using the siamese neural network architecture. It generally performs better than the linear models. So learning nonlinear projection is more appropriate for complex structure of multimodal data.
6) As the length of hash codes increases, the performance of supervised hashing methods (SCMSeq and SePH) keeps increasing, which reflects the capability of utilizing longer hash codes to better preserve semantic affinities. Meanwhile, performance of some baselines like CVH, IMH, CMSSH decreases, which is also observed in previous work [64, 80, 63]. So it is more difficult to learn longer binary codes without supervised information.
7 Discussion and future trends
Although some promising results have been achieved in the field of crossmodal retrieval, there is still a gap between stateoftheart methods and user expectation, which indicates that we still need to investigate the crossmodal retrieval problem. In the following, we discuss the future research opportunities for crossmodal retrieval.
1. Collection of multimodal largescale datasets
Now researchers have been working hard to design more and more sophisticated algorithms to retrieve or summarize the multimodal data. However, there is a lack of good sources for further training, testing, evaluating and comparing the performance of different algorithms. Currently available datasets for crossmodal retrieval research are either too small such as the Wikipedia dataset that only contains 2866 documents, or too specific such as NUSWIDE only consists of user tags. Hence, it would be tremendously helpful for researchers if there exists a multimodal largescale dataset, which contains more than two modalities of data and largescale multimodal data with ground truth.
2. Multimodal learning with limited and noisy annotations
The emerging applications on social networks have produced huge amount of multimodal content created by people. Typical examples include Flickr, YouTube, Facebook, MySpace, WeiBo, WeiXin, etc. It is well known that the multimodal data in the web is loosely organized, and the annotations of these data are limited and noisy. Obviously, it is difficult to label large scale multimodal data. However, the annotations provide semantic information for multimodal data, so how to utilize the limited and noisy annotations to learn semantic correlations among the multimodal data in this scenario need to be addressed in the future.
3. Scalability on largescale data
Driven by wide availability of massive storage devices, mobile devices and fast networks, more and more multimedia resources are generated and propagated on the web. With rapid growth of the multimodal data, we need develop effective and efficient algorithms that are scalable to distributed platforms. We also need to conduct further research on effectively and efficiently organizing each relevant modality of data together.
4. Deep learning on multimodal data
Recently, deep learning algorithms achieve much progress in image classification [110, 111, 112], video recognition/classsification [113, 114], text analysis [8, 4]
, and so on. Deep learning algorithms show good properties in representation learning. Powerful representations are helpful for reducing heterogeneity gap and semantic gap between different modalities of data. Hence, combining appropriate deep learning algorithms to model different types of data for crossmodal retrieval (such as CNN for modeling images, RNN (Recurrent Neural Networks) for modeling text) is a future trend.
5. Finerlevel crossmodal semantic correlation modeling
Most of published works usually embed different modalities into a common embedding space. For example, they map images and texts into a common space, where different modalities of data can be compared. However, this is too rough because different image fragments correspond to different text fragments, and considering such finer correspondence could explore the imagetext semantic relations more accurately. So how to obtain the fragments of different modalities and find their correspondence are very important. Accordingly, new model should be designed for modeling such complex relations.
8 Conclusions
Crossmodal retrieval provides an effective and powerful way to multimodal data retrieval and it is more convenient than traditional singlemodalitybased techniques. This paper gives an overview of crossmodal retrieval, summarizes a number of representative methods and classifies them into two main groups: 1) realvalued representation learning, and 2) bianry representation learning. Then, we introduce several commonly used multimodal datasets, and empirically evaluate the performance of some representative methods on some commonly used datasets. We also discuss the future trends in crossmodal retrieval field. Although significant work has been carried out in this field, crossmodal retrieval has not been welladdressed to date. There is still much work to be done to better process crossmodal retrieval. We expect this paper will help readers to understand the stateoftheart in crossmodal retrieval and motivate more meaningful works.
References
 [1] J. Liu, C. Xu, and H. Lu, “Crossmedia retrieval: stateoftheart and open issues,” International Journal of Multimedia Intelligence and Security, vol. 1, no. 1, pp. 33–52, 2010.
 [2] C. Xu, D. Tao, and C. Xu, “A survey on multiview learning,” arXiv preprint arXiv:1304.5634, 2013.
 [3] A. Karpathy and F. Li, “Deep visualsemantic alignments for generating image descriptions,” in Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
 [4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
 [5] X. Chen and C. L. Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in Computer Vision and Pattern Recognition, 2015, pp. 2422–2431.

[6]
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the longshort term memory model for image caption generation,” in
International Conference on Computer Vision, 2015, pp. 2407–2415.  [7] Y. Ushiku, M. Yamaguchi, Y. Mukuta, and T. Harada, “Common subspace for model and similarity: Phrase learning for caption generation from images,” in International Conference on Computer Vision, 2015, pp. 2668–2676.
 [8] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (mrnn),” 2015.
 [9] R. S. Z. Ryan Kiros, Ruslan Salakhutdinov, “Unifying visualsemantic embeddings with multimodal neural language models,” arXiv:1411.2539.
 [10] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504, 2015.
 [11] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence–video to text,” arXiv:1505.00487.
 [12] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, “Longterm recurrent convolutional networks for visual recognition and description,” in Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
 [13] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in International conference on Multimedia. ACM, 2010, pp. 251–260.
 [14] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in Subspace, latent structure and feature selection. Springer, 2006, pp. 34–51.
 [15] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2160–2167.
 [16] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural Computation, vol. 12, no. 6, pp. 1247–1283, 2000.
 [17] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia content processing through crossmodal association,” in International Conference on Multimedia. ACM, 2003, pp. 604–611.
 [18] V. Mahadevan, C. W. Wong, J. C. Pereira, T. Liu, N. Vasconcelos, and L. K. Saul, “Maximum covariance unfolding: Manifold learning for bimodal data,” in Advances in Neural Information Processing Systems, 2011, pp. 918–926.
 [19] X. Shi and P. Yu, “Dimensionality reduction on heterogeneous feature space,” in International Conference on Data Mining, 2012, pp. 635–644.
 [20] F. Zhu, L. Shao, and M. Yu, “Crossmodality submodular dictionary learning for information retrieval,” in International Conference on Information and Knowledge Management. ACM, 2014, pp. 1479–1488.
 [21] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Conference on Research and Development in Informaion Retrieval. ACM, 2003, pp. 127–134.
 [22] D. Putthividhy, H. T. Attias, and S. S. Nagarajan, “Topic regression multimodal latent dirichlet allocation for image annotation,” in Computer Vision and Pattern Recognition. IEEE, 2010, pp. 3408–3415.
 [23] Y. Jia, M. Salzmann, and T. Darrell, “Learning crossmodality similarity for multinomial data,” in International Conference on Computer Vision. IEEE, 2011, pp. 2407–2414.

[24]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep
learning,” in
International Conference on Machine Learning
, 2011, pp. 689–696.  [25] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in Neural Information Processing Systems, 2012, pp. 2222–2230.
 [26] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013, pp. 1247–1255.
 [27] F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” 2015, pp. 3441–3450.
 [28] F. Feng, X. Wang, and R. Li, “Crossmodal retrieval with correspondence autoencoder,” in International Conference on Multimedia. ACM, 2014, pp. 7–16.

[29]
R. Xu, C. Xiong, W. Chen, and C. J. J., “Jointly modeling deep video and
compositional text to bridge vision and language in a unified framework,” in
AAAI Conference on Artificial Intelligence
, 2015, pp. 2346–2352.  [30] N. Quadrianto and C. H. Lampert, “Learning multiview neighborhood preserving projections,” in International Conference on Machine Learning, 2011, pp. 425–432.
 [31] D. Zhai, H. Chang, S. Shan, X. Chen, and W. Gao, “Multiview metric learning with global consistency and local smoothness,” ACM Transactions on Intelligent Systems and Technology, vol. 3, no. 3, 2012.
 [32] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for crossmedia retrieval,” in AAAI Conference on Artificial Intelligence, 2013, pp. 1198–1204.
 [33] Z. Yuan, J. Sang, Y. Liu, and C. Xu, “Latent feature learning in social media network,” in International Conference on Multimedia. ACM, 2013, pp. 253–262.
 [34] J. Wang, Y. He, C. Kang, S. Xiang, and C. Pan, “Imagetext crossmodal retrieval via modalityspecific feature learning,” in International Conference on Multimedia Retrieval, 2015, pp. 347–354.
 [35] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger, “Learning to rank with (a lot of) word features,” Information Retrieval, vol. 13, no. 3, pp. 291–314, 2010.
 [36] D. Grangier and S. Bengio, “A discriminative kernelbased approach to rank images from text queries,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1371–1384, 2008.
 [37] X. Lu, F. Wu, S. Tang, Z. Zhang, X. He, and Y. Zhuang, “A low rank structural large margin method for crossmodal ranking,” in Conference on Research and Development in Information Retrieval. ACM, 2013, pp. 433–442.
 [38] F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang, “Crossmedia semantic representation via bidirectional learning to rank,” in International Conference on Multimedia. ACM, 2013, pp. 877–886.
 [39] J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in International Joint Conference on Artificial Intelligence, vol. 11, 2011, pp. 2764–2770.
 [40] T. Yao, T. Mei, and C.W. Ngo, “Learning query and image similarities with ranking canonical correlation analysis,” in International Conference on Computer Vision, 2015, pp. 28–36.
 [41] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visualsemantic embedding model,” in Advances in Neural Information Processing Systems, 2013, pp. 2121–2129.
 [42] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 207–218, 2014.
 [43] A. Karpathy, A. Joulin, and F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Advances in Neural Information Processing Systems, 2014, pp. 1889–1897.
 [44] X. Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang, “Deep compositional crossmodal learning to rank via localglobal alignment,” in International Conference on Multimedia. ACM, 2015, pp. 69–78.
 [45] D. Lin and X. Tang, “Intermodality face recognition,” in European Conference on Computer Vision. Springer, 2006, pp. 13–26.
 [46] X.Y. Jing, R.M. Hu, Y.P. Zhu, S.S. Wu, C. Liang, and J.Y. Yang, “Intraview and interview supervised correlation analysis for multiview feature learning,” in AAAI Conference on Artificial Intelligence, 2014, pp. 1882–1889.
 [47] X. Mao, B. Lin, D. Cai, X. He, and J. Pei, “Parallel field alignment for cross media retrieval,” in International Conference on Multimedia. ACM, 2013, pp. 897–906.
 [48] X. Zhai, Y. Peng, and J. Xiao, “Learning crossmedia joint representation with sparse and semisupervised regularization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 965–978, 2014.
 [49] Y. T. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. M. Lu, “Supervised coupled dictionary learning with group structures for multimodal retrieval,” in AAAI Conference on Artificial Intelligence, 2013.
 [50] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Cluster canonical correlation analysis,” in International Conference on Artificial Intelligence and Statistics, 2014, pp. 823–831.
 [51] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multiview embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
 [52] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for crossmodal matching,” in International Conference on Computer Vision. IEEE, 2013, pp. 2088–2095.
 [53] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for crossmodal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, preprint.
 [54] Y. Zheng, Y.J. Zhang, and H. Larochelle, “Topic modeling of multimodal data: an autoregressive approach,” in Computer Vision and Pattern Recognition. IEEE, 2014, pp. 1370–1377.
 [55] R. Liao, J. Zhu, and Z. Qin, “Nonparametric bayesian upstream supervised multimodal topic models,” in International Conference on Web Search and Data Mining. ACM, 2014, pp. 493–502.
 [56] Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multimodal mutual topic reinforce modeling for crossmedia retrieval,” in International Conference on Multimedia. ACM, 2014, pp. 307–316.
 [57] C. Wang, H. Yang, and C. Meinel, “Deep semantic mapping for crossmodal retrieval,” in International Conference on Tools with Artificial Intelligence, 2015, pp. 234–241.
 [58] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Crossmodal retrieval with cnn visual features: A new baseline,” in IEEE Transactions on Cybernetics, p. Preprint.
 [59] W. Wang, X. Yang, B. C. Ooi, D. Zhang, and Y. Zhuang, “Effective deep learningbased multimodal retrieval,” International Journal on Very Large Data Bases, vol. 25, no. 1, pp. 79–101, 2016.
 [60] L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in International Conference on Machine learning. ACM, 2008, pp. 1024–1031.
 [61] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Intermedia hashing for largescale retrieval from heterogeneous data sources,” in International Conference on Management of Data. ACM, 2013, pp. 785–796.
 [62] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Predictable dualview hashing,” in International Conference on Machine Learning, 2013, pp. 1328–1336.
 [63] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear crossmodal hashing for efficient multimedia search,” in International Conference on Multimedia. ACM, 2013, pp. 143–152.
 [64] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in Computer Vision and Pattern Recognition. IEEE, 2014, pp. 2083–2090.
 [65] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for crossmodal similarity search,” in Conference on Research & Development in Information Retrieval. ACM, 2014, pp. 415–424.
 [66] W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, “Effective multimodal retrieval based on stacked autoencoders,” in International Conference on Very Large Data Bases, 2014, pp. 649–660.
 [67] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” IEEE Transactions on Multimedia, vol. 17, no. 9, pp. 1404–1416, 2015.
 [68] R. He, W.S. Zheng, and B.G. Hu, “Maximum correntropy criterion for robust face recognition,” Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1561–1576, 2011.
 [69] Y. Zhen and D.Y. Yeung, “Coregularized hashing for multimodal data,” in Advances in Neural Information Processing Systems, 2012, pp. 1376–1384.
 [70] Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He, “Iterative multiview hashing for cross media indexing,” in International Conference on Multimedia. ACM, 2014, pp. 527–536.
 [71] B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang, “Quantized correlation hashing for fast crossmodal search,” in International Joint Conference on Artificial Intelligence, 2015, pp. 3946–3952.
 [72] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang, “Comparing apples to oranges: a scalable solution with heterogeneous hashing,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2013, pp. 230–238.
 [73] Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang, “Scalable heterogeneous translated hashing,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2014, pp. 791–800.
 [74] Y. Zhen and D.Y. Yeung, “A probabilistic model for multimodal hash function learning,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2012, pp. 940–948.
 [75] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao, “Parametric local multimodal hashing for crossview similarity search,” in International Joint Conference on Artificial Intelligence. AAAI Press, 2013, pp. 2754–2760.
 [76] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, “Multimodal similaritypreserving hashing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 824–830, 2014.
 [77] Y. Cao, M. Long, and J. Wang, “Correlation hashing network for efficient crossmodal retrieval,” CoRR, vol. abs/1602.06697, 2016. [Online]. Available: http://arxiv.org/abs/1602.06697
 [78] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multimodal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
 [79] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast crossmedia retrieval,” in Conference on Research & Development in Information Retrieval. ACM, 2014, pp. 395–404.
 [80] D. Zhang and W.J. Li, “Largescale supervised multimodal hashing with semantic correlation maximization,” in AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183.
 [81] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semanticspreserving hashing for crossview retrieval,” in Computer Vision and Pattern Recognition, 2015, pp. 3864–3872.
 [82] Y. Cao, M. Long, J. Wang, and H. Zhu, “Correlation autoencoder hashing for supervised crossmodal search,” in International Conference on Multimedia Retrieval, 2016.
 [83] Q. Jiang and W. Li, “Deep crossmodal hashing,” CoRR, vol. abs/1602.02255, 2016. [Online]. Available: http://arxiv.org/abs/1602.02255
 [84] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multiview embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
 [85] J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in crossmodal multimedia retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 521–535, 2014.
 [86] R. Udupa and M. Khapra, “Improving the multilingual user experience of wikipedia using crosslanguage name search,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 492–500.
 [87] A. Li, S. Shan, X. Chen, and W. Gao, “Face recognition based on noncorresponding region matching,” in International Conference on Computer Vision. IEEE, 2011, pp. 1060–1067.
 [88] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, lowresolution and sketch,” in Computer Vision and Pattern Recognition. IEEE, 2011, pp. 593–600.
 [89] Y. Chen, L. Wang, W. Wang, and Z. Zhang, “Continuum regression for crossmodal multimedia retrieval,” in International Conference on Image Processing. IEEE, 2012, pp. 1949–1952.
 [90] X. Wang, Y. Liu, D. Wang, and F. Wu, “Crossmedia topic mining on wikipedia,” in International Conference on Multimedia. ACM, 2013, pp. 689–692.
 [91] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
 [92] W. Wu, J. Xu, and H. Li, “Learning similarity function between objects in heterogeneous spaces,” Microsoft Research Technique Report, 2010.
 [93] A. Mignon and F. Jurie, “CMML: a new metric learning approach for cross modal matching,” in Asian Conference on Computer Vision, 2012, pp. 14–pages.
 [94] Y. Hua, H. Tian, A. Cai, and P. Shi, “Crossmodal correlation learning with deep convolutional architecture,” in Visual Communications and Image Processing, 2015, pp. 1–4.
 [95] V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multilabel crossmodal retrieval,” 2015, pp. 4094–4102.
 [96] Z. Li, W. Lu, E. Bao, and W. Xing, “Learning a semantic space by deep network for crossmedia retrieval,” in International Conference on Distributed Multimedia Systems, 2015, pp. 199–203.
 [97] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, “Learning aligned crossmodal representations from weakly aligned data,” in Computer Vision and Pattern Recognition, 2016.
 [98] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in Neural Information Processing Systems, 2009, pp. 1753–1760.
 [99] D. Zhang, J. Wang, D. Cai, and J. Lu, “Selftaught hashing for fast similarity search,” in Conference on Research and Development in Information Retrieval. ACM, 2010, pp. 18–25.
 [100] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in Computer Vision and Pattern Recognition. IEEE, 2011, pp. 817–824.
 [101] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Conference on Research and Development in Information Retrieval. ACM, 2011, pp. 225–234.
 [102] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Multiple feature hashing for realtime large scale nearduplicate video retrieval,” in International Conference on Multimedia. ACM, 2011, pp. 423–432.
 [103] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [104] J. Krapac, M. Allan, J. Verbeek, and F. Jurie, “Improving web image search results using queryrelative classifiers,” in Computer Vision and Pattern Recognition, 2010, pp. 1094–1101.
 [105] Y. Peter, L. Alice, H. Micah, and H. Julia, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” in Transactions of the Association for Computational Linguistics, 2014, pp. 67–78.
 [106] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
 [107] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUSWIDE: A realworld web image database from national university of singapore,” in International Conference on Image and Video Retrieval. ACM, 2009, p. 48.
 [108] S. J. Hwang and K. Grauman, “Reading between the lines: Object localization using implicit cues from image tags,” Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1145–1158, 2012.
 [109] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[110]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Neural Information Processing Systems, 2012, pp. 1106–1114.  [111] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale visual recognition,” in International Conference on Learning Representations, 2015.
 [112] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition, 2015, pp. 1–9.
 [113] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li, “Largescale video classification with convolutional neural networks,” in Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
 [114] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.
Comments
There are no comments yet.