Over the last decade, different types of media data such as texts, images and videos are growing rapidly on the Internet. It is common that different types of data are used for describing the same events or topics. For example, a web page usually contains not only textual description but also images or videos for illustrating the common content. Such different types of data are referred as multi-modal data, which exhibit heterogeneous properties. There have been many applications for multi-modal data (as shown in Figure 1
). As multimodal data grow, it becomes difficult for users to search information of interest effectively and efficiently. Till now, there have been various research techniques for indexing and searching multimedia data. However, these search techniques are mostly single-modality-based, which can be divided into keyword-based retrieval and content-based retrieval. They only perform similarity search of the same media type, such as text retrieval, image retrieval, audio retrieval, and video retrieval. Hence, a demanding requirement for promoting information retrieval is to develop a new retrieval model that can support the similarity search for multimodal data.
Nowadays, mobile devices and emerging social websites (e.g., Facebook, Flickr, YouTube, and Twitter) are changing the ways people interact with the world and search information of interest. It is convenient if users can submit any media content at hand as the query. Suppose we are on a visit to the Great Wall, by taking a photo, we may expect to use the photo to retrieve the relevant textual materials as visual guides for us. Therefore, cross-modal retrieval, as a natural searching way, becomes increasingly important. Cross-modal retrieval aims to take one type of data as the query to retrieve relevant data of another type. For example, as shown in Figure 2, the text is used as the query to retrieve images. Furthermore, when users search information by submitting a query of any media type, they can obtain search results across various modalities, which is more comprehensive given that different modalities of data can provide complementary information to each other.
More recently, cross-modal retrieval has attracted considerable research attention. The challenge of cross-modal retrieval is how to measure the content similarity between different modalities of data, which is referred as the heterogeneity gap. Hence, compared with traditional retrieval methods, cross-modal retrieval requires cross-modal relationship modeling, so that users can retrieve what they want by submitting what they have. Now, the main research effort is to design the effective ways to make the cross-modal retrieval more accurate and more scalable.
This paper aims to conduct a comprehensive survey of cross-modal retrieval. Although Liu et al.  gave an overview of cross-modal retrieval in 2010, it does not include many important works proposed in recent years. Xu et al.  summarize several methods for modeling multimodal data, but they focus on multi-view learning. Since many technical challenges remain in cross-modal retrieval, various ideas and techniques have been provided to solve the cross-modal problem in recent years. This paper focuses on summarizing these latest works in cross-modal retrieval, the major concerns of which are very different from previous related surveys. Another topic for modeling multimodal data is image/video description [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], which is not discussed here because it goes beyond the scope of cross-modal retrieval research.
The major contributions of this paper are briefly summarized as follows.
This paper aims to provide a survey on recent progress in cross-modal retrieval. It contains many new references not found in previous surveys, which is beneficial for the beginners to get familiar with cross-modal retrieval quickly.
This paper gives a taxonomy of cross-modal retrieval approaches. Differences between different kinds of methods are elaborated, which are helpful for readers to better understand various techniques utilized in cross-modal retrieval.
This paper evaluates several representative algorithms on the commonly used datasets. Some meaningful findings are obtained, which are useful for understanding the cross-modal retrieval algorithms, and are expected to benefit both practical applications and future research.
This paper summarizes challenges and opportunities in cross-modal retrieval fields, and points out some open directions in future.
The rest of this paper is organized as follows: we firstly give an overview on different kinds of methods for cross-modal retrieval in Section 2. Then, we illustrate different kinds of cross-modal retrieval algorithms in detail in Sections 3 and 4. We introduce several multimodal datasets in Section 5. Experimental results are reported in Section 6. The discussion and future trends are given in Section 7. Finally, Section 8 concludes this paper.
In the cross-modal retrieval procedure, users can search various modalities of data including texts, images and videos, starting with any modality of data as a query. Figure 3
presents the general framework of cross-modal retrieval, in which, feature extraction for multimodal data is considered as the first step to represent various modalities of data. Based on these representations of multimodal data, cross-modal correlation modeling is performed to learn common representations for various modalities of data. At last, the common representations enable the cross-modal retrieval by suitable solutions of search result ranking and summarization.
Real-valued representation learning
|Unsupervised methods||Subspace learning methods||
|Topic model||Corr-LDA , Tr-mm LDA , MDRF |
|Deep learning methods||
|Pairwise based methods||Shallow methods||Multi-NPP , MVML-GL , JGRHML |
|Deep learning methods||RGDBN , MSDS |
|Rank based methods||Shallow methods||
|Deep learning methods||
|Supervised methods||Subspace learning methods||
|Topic model||SupDocNADE , NPBUS , MR |
|Deep learning methods||RE-DNN , deep-SM , MDNN |
Binary representation learning
|Unsupervised methods||Linear modeling||
|Nonlinear modeling||MSAE , DMHOR |
|Pairwise based methods||Linear modeling||
|Nonlinear modeling||MLBE , PLMH , MM-NN , CHN |
|Supervised methods||Linear modeling||SMH , DCDH , SCM |
|Nonlinear modeling||SePH , CAH , DCMH |
Since the cross-modal retrieval is considered as an important problem in real applications, various approaches have been proposed to deal with this problem, which can be roughly divided into two categories: 1) real-valued representation learning and 2) binary representation learning, which is also called cross-modal hashing. For real-valued representation learning, the learned common representations for various modalities of data are real-valued. To speed up cross-modal retrieval, the binary representation learning methods aim to transform different modalities of data into a common Hamming space, in which cross-modal similarity search is fast. Since the representation is encoded to binary codes, the retrieval accuracy generally decreases slightly due to the loss of information.
According to the utilized information when learning the common representations, the cross-modal retrieval methods can be further divided into four groups: 1) unsupervised methods, 2) pairwise based methods, 3) rank based methods, and 4) supervised methods. Generally speaking, the more information one method utilizes, the better performance it obtains.
1) For unsupervised methods, only co-occurrence information is utilized to learn common representations across multi-modal data. The co-occurrence information means that if different modalities of data are co-existed in a multimodal document, then they are of the same semantic. For example, a web page usually contains both textual descriptions and images for illustrating the same event or topic.
2) For the pairwise based methods, similar pairs (or dissimilar pairs) are utilized to learn common representations. These methods generally learn a meaningful metric distance between different modalities of data.
3) For the rank based methods, rank lists are often utilized to learn common representations. Ranking based methods study the cross-modal retrieval as a problem of learning to rank.
4) Supervised methods exploit label information to learn common representations. These methods enforce the learned representations of different-class samples to be far apart while those of the same-class samples lie as close as possible. Accordingly, they obtain more discriminative representations. But getting label information is sometimes expensive due to massive manual annotation.
Typical algorithms of the cross-modal retrieval in terms of different categories are summarized in Table I.
3 Real-valued Representation Learning
If different modalities of data are related to the same event or topic, they are expected to share certain common representation space in which relevant data are close to each other. Real-valued representation learning methods aim to learn a real-valued common representation space, in which different modalities of data can be directly measured. According to the information utilized to learn the common representation, the cross-modal retrieval methods can be further divided into four groups: 1) unsupervised methods, 2) pairwise based methods, 3) rank based methods, and 4) supervised methods. We will introduce them in the following, respectively, and describe some of them in details for better understanding.
3.1 Unsupervised methods
The unsupervised methods only utilize co-occurrence information to learn common representations across multi-modal data. The co-occurrence information means that if different modalities of data are co-existed in a multimodal document, then they are of the similar semantic. For example, the textual description along with images or videos often exist in a webpage to illustrate the same event or topic. Furthermore, the unsupervised methods are categorized into subspace learning methods, topic models and deep learning methods.
3.1.1 Subspace learning methods
The main difficulty of cross-modal retrieval is how to measure the content similarity between different modalities of data. Subspace learning methods are one type of the most popular methods. They aim to learn a common subspace shared by different modalities of data, in which the similarity between different modalities of data can be measured (as shown in Figure 4). Unsupervised subspace learning methods use pairwise information to learn a common latent subspace across multi-modal data. They enforce pair-wise closeness between different modalities of data in the common subspace.
Canonical Correlation Analysis (CCA) is one of the most popular unsupervised subspace learning methods for establishing inter-modal relationships between data from different modalities. It has been widely used for cross-media retrieval[13, 84, 85], cross-lingual retrieval  and some vision problems . CCA aims to learn two directions and for two modalities of data, along which the data is maximally correlated, i.e.,
where and represent the empirical covariance matrices for the two modalities of data respectively, while represents the cross-covariance matrix between them. Rasiwasia et al.  propose a two-stage method for cross-modal multimedia retrieval. In the first stage, CCA is used to learn a common subspace by maximizing the correlation between the two modalities. Then, a semantic space is learned to measure the similarity of different modal features.
Besides CCA, Partial Least Squares (PLS)  and Bilinear Model (BLM) [15, 16] are also used for cross-modal retrieval. Sharma and Jacobs  use PLS to linearly map images with different modalities to a common linear subspace in which they are highly correlated. Chen et al.  apply PLS to the cross-modal document retrieval. They use PLS to switch the image features into the text space, and then learn a semantic space for the measure of similarity between two different modalities. In 
, Tenenbaum and Freeman propose a bilinear model (BLM) to derive a common space for cross-modal face recognition. BLM is also used for text-image retrieval in.
Li et al.  introduce a cross-modal factor analysis (CFA) approach to evaluate the association between two modalities. The CFA method adopts a criterion of minimizing the Frobenius norm between pairwise data in the transformed domain. Mahadevan et al.  propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different modalities. Shi et al.  propose a principle of collective component analysis (CoCA), to handle dimensionality reduction on a heterogeneous feature space. Zhu et al.  propose a greedy dictionary construction method for the cross-modal retrieval problem. The compactness and modality-adaptivity are preserved by including reconstruction error terms and a Maximum Mean Discrepancy (MMD) measurement for both modalities in the objective function. Wang et al.  propose to learn the sparse projection matrices that map the image-text pairs in Wikipedia into a latent space for cross-modal retrieval.
3.1.2 Topic models
Another unsupervised method is the topic model. Topic models have been widely applied to a specific cross-modal problem, i.e., image annotation [21, 22]. To capture the correlation between images and annotations, Latent Dirichlet Allocation (LDA) 
has been extended to learn the joint distribution of multi-modal data, such as Correspondence LDA (Corr-LDA) and Topic-regression Multi-modal LDA (Tr-mm LDA) . Corr-LDA uses topics as the shared latent variables, which represent the underlying causes of cross-correlations in the multi-modal data. Tr-mm LDA learns two separate sets of hidden topics and a regression module which captures more general forms of association and allows one set of topics to be linearly predicted from the other.
Jia et al.  propose a new probabilistic model (Multi-modal Document Random Field, MDRF) to learn a set of shared topics across the modalities. The model defines a Markov random field on the document level which allows modeling more flexible document similarities.
3.1.3 Deep learning methods
As we mentioned above, it is common that different types of data are used for description of the same events or topics in the web. For example, user-generated content usually involves with data from different modalities, such as images, texts and videos. This makes it very challenging for traditional methods to obtain a joint representation for multimodal data. Inspired by recent progress of deep learning, Ngiam et al. 
apply deep networks to learn features over multiple modalities, which focuses on learning representations for speech audio that are coupled with videos of the lips. Then, a deep Restricted Boltzmann Machine succeeds in learning the joint representations for multimodal data. It firstly uses separate modality-friendly latent models to learn low-level representations for each modality, and then fuses into joint representation along the deep architecture in the higher-level (as shown in Figure 5).
Inspired by representation learning using deep networks [24, 25], Andrew et al.  present Deep Canonical Correlation Analysis (DCCA), a deep learning method to learn complex nonlinear projection for different modalities of data such that the resulting representations are highly linearly correlated. Furthermore, Yan and Mikolajczyk  propose an end-to-end learning scheme based on the deep canonical correlation analysis (End-to-end DCCA), which is a non-trivial extension to  (as shown in Figure 6). The objective function is
Define , then the objective function is rewritten as
The gradients with respect to and are computed and propagated down along the two branches of the network. The high dimensionality of features presents a great challenge in terms of memory and speed complexity when used in the DCCA framework. To address this problem, Yan and Mikolajczyk propose and discuss details of a GPU implementation with CULA libraries. The efficiency of the implementation is several orders of magnitude higher than CPU implementation.
Feng et al. 
propose a novel model involving correspondence autoencoder (Corr-AE) for cross-modal retrieval. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning errors between hidden representations of two modalities, is utilized to train the model as a whole. Minimization of correlation learning errors forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning errors makes hidden representations good enough to reconstruct the input of each modality.
Xu et al. 
propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In their language model, they propose a dependency-tree structure model that embeds sentences into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, they leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, they minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, this model is able to accomplish three tasks: 1) natural language generation, 2) video-language retrieval and 3) language-video retrieval.
3.2 Pairwise based methods
Compared with the unsupervised method, pairwise based methods utilize more similar pairs (or dissimilar pairs) to learn a meaningful metric distance between different modalities of data, which can be regarded as heterogeneous metric learning (as shown in Figure 7).
3.2.1 Shallow methods
Wu et al.  study the metric learning problem to find a similarity function over two different spaces. Mignon and Jurie  propose a metric learning approach for cross-modal matching, which considers both positive and negative constraints. Quadrianto and Lampert  propose a new metric learning scheme (Multi-View Neighborhood Preserving Projection, Multi-NPP) to project different modalities into a shared feature space, in which the Euclidean distance provides a meaningful intra-modality and inter-modality similarity. To learn projections and for different features and
, the loss function is defined as
where for appropriately chosen constants and . The above loss function consists of the similarity term that enforces similar objects to be at proximal locations in the latent space and the dissimilarity term that pushes dissimilar objects away from each other.
Zhai et al. 
propose a new method called Multiview Metric Learning with Global consistency and Local smoothness (MVML-GL). This framework consists of two main steps. In the first step, they seek a global consistent shared latent feature space. In the second step, the explicit mapping functions between the input spaces and the shared latent space are learned via regularized local linear regression. Zhai et al. propose a joint graph regularized heterogeneous metric learning (JGRHML) algorithm to learn a heterogeneous metric for cross-modal retrieval. Based on the heterogeneous metric, they further learn a high-level semantic metric through label propagation.
3.2.2 Deep learning methods
To predict the links between social media, Yuan et al.  design a Relational Generative Deep Belief Nets (RGDBN) model to learn latent features for social media, which utilizes the relationships between social media in the network. In the RGDBN model, the link between items is generated from the interactions of their latent features. By integrating the Indian buffet process into the modified Deep Belief Nets, they learn the latent feature that best embeds both the media content and observed media relationships. The model is able to analyze the links between heterogeneous as well as homogeneous data, which can also be used for cross-modal retrieval.
Wang et al. 
propose a novel model based on modality-specific feature learning, named as Modality-Specific Deep Structure (MSDS). Considering the characteristics of different modalities, the model uses two types of convolutional neural networks to map the raw data to the latent space representations for images and texts, respectively. Particularly, the convolution based network used for texts involves word embedding learning, which has been proved effective to extract meaningful textual features for text classification. In the latent space, the mapped features of images and texts form relevant and irrelevant image-text pairs, which are used by the one-vs-more learning scheme.
3.3 Rank based methods
The rank based methods utilize rank lists to learn common representations. Rank based methods study the cross-modal retrieval as a problem of learning to rank.
3.3.1 Shallow methods
Bai et al.  present Supervised Semantic Indexing (SSI) for cross-lingual retrieval. Grangier et al.  propose a discriminative kernel-based method (Passive-Aggressive Model for Image Retrieval, PAMIR) to solve the problem of cross-modal ranking by adapting the Passive-Aggressive algorithm. Weston et al.  introduce a scalable model for image annotation by learning a joint representation of images and annotations. It learns to optimize precision at the top of the ranked list of annotations for a given image and learns a low-dimensional joint embedding space for both images and annotations.
Lu et al.  propose a cross-modal ranking algorithm for cross-modal retrieval, called Latent Semantic Cross-Modal Ranking (LSCMR). They utilize the structural SVM to learn a metric such that ranking of data induced by the distance from a query can be optimized against various ranking measures. However, LSCMR does not make full use of bi-directional ranking examples (bi-directional ranking means that both text-query-image and image-query-text ranking examples are utilized in the training). Accordingly, Wu et al.  propose to optimize the bi-directional listwise ranking loss with a latent space embedding.
Recently, Yao et al.  propose a novel Ranking Canonical Correlation Analysis (RCCA) for learning query and image similarities. RCCA is used to adjust the subspace learnt by CCA to further preserve the preference relations in the click data. The objective function of the RCCA is
where and are the initial transformation matrices learnt by CCA, and is the margin ranking loss as follows:
where is a query-image similarity function that is used to measure the similarity of image given query in the latent space.
3.3.2 Deep learning methods
Inspired by the progress of deep learning, Frome et al.  present a new deep visual-semantic embedding model (DeViSE), the objective of which is to leverage semantic knowledge learned in the text domain, and transfer it to a model trained for visual object recognition.
Socher et al.  introduce a Dependency Tree Recursive Neural Networks (DT-RNNs) which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. The image features are extracted from a deep neural network. To learn joint image-sentence representations, the ranking cost function is:
where is the mapped image vector, and is the composed sentence vector. is the -th sentence description for image . is the set of all sentence indices and is the set of sentence indices corresponding to image . Similarly, is the set of all image indices and is the set of image indices corresponding to sentence . is the set of all correct image-sentence training pairs . With both images and sentences in the same multimodal space, the image-sentence retrieval is performed easily.
Karpathy et al.  introduce a model for the bidirectional retrieval of images and sentences, which formulates a structured, max-margin objective for a deep neural network that learns to embed both visual and language data into a common, multimodal space. Unlike previous models that directly map images or sentences into a common embedding space, this model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space.
Jiang et al.  exploit the existing image-text databases to optimize a ranking function for cross-modal retrieval, called deep compositional cross-modal learning to rank (CMLR). CMLR considers learning a multi-modal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the image-level and sentence-level alignment) are collaboratively utilized to learn the multi-modal common embedding space in a max-margin learning to rank manner.
Hua et al.  develop a novel deep convolutional architecture for cross-modal retrieval, named Cross-Modal Correlation learning with Deep Convolutional Architecture (CMCDCA). It consists of visual feature representation learning and cross-modal correlation learning with a large margin principle.
3.4 Supervised methods
To obtain a more discriminative common representation, supervised methods exploit label information, which provides a much better separation between classes in the common representation space.
3.4.1 Subspace learning methods
Figure 8 shows the difference between unsupervised subspace learning methods and supervised subspace learning methods. Supervised subspace learning methods enforce different-class samples to be mapped far apart while the same-class samples lie as close as possible. To obtain more discriminative subspace, several works extend CCA to supervised subspace learning methods.
Sharma et al.  present a supervised extension of CCA, called Generalized Multiview Analysis (GMA). The optimal projection directions , are obtained as
It extends Linear Discriminant Analysis (LDA) and Marginal Fisher Analysis (MFA) to their multiview counterparts, i.e., Generalized Multiview LDA (GMLDA) and Generalized Multiview MFA (GMMFA), and apply them to deal with the cross-media retrieval problem. GMLDA and GMMFA take the semantic category into account, which has obtained promising results. Rasiwasia et al.  present cluster canonical correlation analysis (cluster-CCA) for joint dimensionality reduction of two modalities of data. The cluster-CCA problem is formulated as
where the covariance matrices , and are defined as
where is the total number of pairwise correspondences. Cluster-CCA is able to learn discriminant low dimensional representations that maximize the correlation between the two modalities of data while segregating the different classes on the learned space. Gong et al.  propose a novel three-view CCA (CCA-3V) framework, which explicitly incorporates the dependence of visual features and text on the underlying semantics. To better understand the CCA-3V, the objective function is given as below:
where , , and represent embedding vectors from visual view, text view and semantic class view, respectively. , , and are the learned projections for each view. Furthermore, a distance function specially adapted to CCA that improves the accuracy of retrieval in the embedded space is adopted. Ranjan et al.  introduce multi-label Canonical Correlation Analysis (ml-CCA), an extension of CCA, for learning shared subspaces by considering the high level semantic information in the form of multi-label annotations. They also present Fast ml-CCA, a computationally efficient version of ml-CCA, which is able to handle large scale datasets. Jing et al.  propose a novel multi-view feature learning approach based on intra-view and inter-view supervised correlation analysis (ISCA). It explores the useful correlation information of samples within each view and between all views.
Besides supervised CCA-based methods, Lin and Tang  propose a common discriminant feature extraction (CDFE) method to learn a common feature subspace where the difference of within scatter matrix and between scatter matrix is maximized. Mao et al.  introduce a method for cross media retrieval, named parallel field alignment retrieval (PFAR), which integrates a manifold alignment framework from the perspective of vector fields. Zhai et al.  propose a novel feature learning algorithm for cross-modal data, named Joint Representation Learning (JRL). It can explore the correlation and semantic information in a unified optimization framework.
Wang et al.  propose a novel regularization framework for the cross-modal matching problem, called LCFS (Learning Coupled Feature Spaces). It unifies coupled linear regressions,
-norm and trace norm into a generic minimization formulation so that subspace learning and coupled feature selection can be performed simultaneously. Furthermore, they extend this framework to more than two-modality case in, where the extension version is called JFSSL (Joint Feature Selection and Subspace Learning). The main extensions are summarized as follows: 1) they propose a multimodal graph to better model the similarity relationships among different modalities of data, which is demonstrated to outperform the low rank constraint in terms of both computational cost and retrieval performance. 2) Accordingly, a new iterative algorithm is proposed to solve the modified objective function and the proof of its convergence is given.
Inspired by the idea of (semi-)coupled dictionary learning, Zhuang et al.  bring coupled dictionary learning into supervised sparse coding for cross-modal retrieval, which is called Supervised coupled dictionary learning with group structures for multi-modal retrieval (SliM). It can utilize the class information to jointly learn discriminative multi-modal dictionaries as well as mapping functions between different modalities. The objective function is formulated as follows:
where is the coefficient matrix associated to those intra-modality data belonging to the -th class. As shown above, data in the -th modality space can be mapped into the -th modality space by the learned according to , therefore, the computation of cross-modal similarity is achieved.
3.4.2 Topic models
Based on Document Neural Autoregressive Distribution Estimator (DocNADE), Zhen et al. propose a supervised extension of DocNADE, to learn a joint representation from image visual words, annotation words and class label information.
Liao et al.  present a nonparametric Bayesian upstream supervised (NPBUS) multi-modal topic model for analyzing multi-modal data. The NPBUS model allows flexible learning of correlation structures of topics with individual modality and between different modalities. The model becomes more discriminative via incorporating upstream supervised information shared by multi-modal data. Furthermore, it is capable to automatically determine the number of latent topics in each modality.
Wang et al.  propose a supervised multi-modal mutual topic reinforce modeling (MR) approach for cross-media retrieval, which seeks to build a joint cross-modal probabilistic graphical model for discovering the mutually consistent semantic topics via appropriate interactions between model factors (e.g., categories, latent topics and observed multi-modal data).
3.4.3 Deep learning methods
Wang et al.  propose a regularized deep neural network (RE-DNN) for semantic mapping across modalities. They design and implement a 5-layer neural network for mapping visual and textual features into a common semantic space such that the similarity between different modalities can be measured.
Li et al. 
propose a deep learning method to address the cross-media retrieval problem with multiple labels. The proposed method is supervised, and the correlation between two modalities can be built according to their shared ground truth probability vectors. Both two networks have two hidden layers and one output layer, and the squared loss is employed as the loss function.
Wei et al.  propose a deep semantic matching method to address the cross-modal retrieval problem for samples which are annotated with one or multiple labels.
Castrejon et al.  present two approaches to regularize cross-modal convolutional networks so that the intermediate representations are aligned across modalities. The focus of this work is to learn cross-modal representations when the modalities are significantly different (e.g., text and natural images) and have category supervision.
Wang et al.  propose a supervised Multi-modal Deep Neural Network (MDNN) method, which consists of a deep convolutional neural network (DCNN) model and a neural language model (NLM) to learn mapping functions for the image modality and the text modality respectively. It exploits the label information, and thus can learn robust mapping functions against noisy input data.
4 Binary Representation Learning
Most existing real-valued cross-modal retrieval techniques are based on the brute-force linear search, which is time-consuming for large scale data. A practical way to speed up the similarity search is binary representation learning, which is referred as hashing. Existing hashing methods can be categorized into uni-modal hashing, multi-view hashing and cross-modal hashing. Representative hashing methods on single modal data include spectral hashing (SH) , self-taught hashing , iterative quantization hashing (ITQ) , and so on. The approaches mentioned above focus on learning hash functions for data objects with homogeneous features. However, in real world applications, we often extract multiple types of features with different properties from data objects. Accordingly, multi-view hashing methods (e.g., CHMIS  and MFH ) leverage information contained in different features to learn more accurate hash codes.
Cross-modal hashing methods aim to discover correlations among different modalities of data to enable cross-modal similarity search. They project different modalities of data into a common Hamming space for fast retrieval (as shown in Figure 9). Similarly, cross-modal hashing methods can be categorized into: 1) unsupervised methods, 2) pairwise based methods, and 3) supervised methods. To the best of our knowledge, there is little literature on rank based cross-modal hashing. According to the learnt hash function being linear or nonlinear, the cross-modal hashing method can be further divided into two categories: linear modeling and nonlinear modeling. Linear modeling methods aim to learn linear functions to obtain hash codes. Whereas, nonlinear modeling methods learn the hash codes in a nonlinear manners.
4.1 Unsupervised methods
4.1.1 Linear modeling
In , Kumar et al. propose a cross-view hashing (CVH), which extends spectral hashing  from traditional uni-modal setting to the multi-modal scenario. The hash functions map similar objects to similar codes across views, and thus enable cross-view similarity search. The objective function is:
where is the total number of views, and is the Hamming distance between objects and summed over all views:
Actually, CCA can be viewed as a special case of CVH by setting .
Rastegari et al.  propose a predictable dual-view hashing (PDH) algorithm for two-modalities. They formulate an objective function to maintain the predictability of binary codes and optimize the objective function by applying an iterative method based on block coordinate descent.
Ding et al.  propose a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH assumes that all modalities of an instance generate identical hash codes. It learns unified hash codes by collective matrix factorization with a latent factor model from different modalities of one instance. The objective function is:
where and represent two modalities of data, and represents the latent semantic representations. and are the learned projections.
Zhou et al.  propose a novel Latent Semantic Sparse Hashing (LSSH) to perform cross-modal similarity search. In particular, LSSH uses Sparse Coding to capture the salient structures of images:
and uses Matrix Factorization to learn the latent concepts from text:
Then the learned latent semantic features are mapped to a joint abstraction space.
The overall objective function is
Song et al.  propose a novel inter-media hashing (IMH) model to transform multimodal data into a common Hamming space. This method explores the inter-media consistency and intra-media consistency to derive effective hash codes, based on which hash functions are learnt to efficiently map new data points into the Hamming space. To learn the hash codes and for two modalities respectively, the objective function is defined as follows:
The first term models intra-modal consistency, the second term models inter-modal consistency, and the third term learns the hash functions to generate hash codes for new data.
Zhu et al.  propose a novel hashing method, named linear cross-modal hashing (LCMH), to enable scalable indexing for multimedia search. This method achieves a linear time complexity to the training data size in the training phase. The key idea is to partition the training data of each modality into clusters, and then represent each training data point with its distances to centroids of the clusters for preserving the intra-similarity in each modality. To preserve the inter-similarity among data points across different modalities, they transform the derived data representations into a common binary subspace.
4.1.2 Nonlinear modeling
The hash functions learned by most existing cross-modal hashing methods are linear. To capture more complex structure of the multimodal data, nonlinear hash function learning is studied recently. Based on stacked auto-encoders, Wang et al.  propose an effective nonlinear mapping mechanism for multi-modal retrieval, called Multi-modal Stacked Auto-Encoders (MSAE). Mapping functions are learned by optimizing a new objective function, which captures both intra-modal and inter-modal semantic relationships of data from heterogeneous sources effectively. The stacked structure of MSAE enables the method to learn nonlinear projections rather than linear projections.
Wang et al.  propose a Deep Multimodal Hashing with Orthogonal Regularization (DMHOR) to learn accurate and compact multimodal representations. The method can better capture the intra-modality and inter-modality correlations to learn accurate representations. Meanwhile, in order to make the representations compact and reduce redundant information lying in the codes, an orthogonal regularizer is imposed on the learned weighting matrices.
4.2 Pairwise based methods
4.2.1 Linear modeling
To the best of our knowledge, Bronstein et al.  propose the first cross-modal hashing method, called cross-modal similarity sensitive hashing (CMSSH). CMSSH learns hash functions for the bimodal case in a standard boosting manner. Specifically, given two modalities of data sets, CMSSH learns two groups of hash functions to ensure that if two data points (with different modalities) are relevant, their corresponding hash codes are similar and otherwise dissimilar. However, CMSSH only preserves the inter-modality correlation but ignores the intra-modality similarity.
Zhen et al.  propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted co-regularization framework. The hash functions for each bit of hash codes are learned by solving DC (difference of convex functions) programs. To learn projections and from multimodal data, the objective function is defined as:
where and are intra-modality loss terms for modalities and , defined as follows:
where is equal to if and otherwise. The inter-modality loss term is defined as follows:
where and is the smoothly clipped inverted squared deviation function in Eq.6 . The learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized.
Hu et al.  propose a multi-view hashing algorithm for cross-modal retrieval, called Iterative Multi-View Hashing (IMVH). IMVH aims to learn discriminative hashing functions for mapping multi-view datum into a shared hamming space. It not only preserves the within-view similarity, but also incorporates the between-view correlations into the encoding scheme, where it maps the similar points to be close and push apart the dissimilar ones.
The cross-modal hashing methods usually assume that the hashed data reside in a common Hamming space. However, this may be inappropriate, especially when the modalities are quite different. To address this problem, Ou et al.  propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities from multiple heterogeneous domains. Unlike some existing cross-modal hashing methods that map heterogeneous data into a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. The RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to learn hash codes . Specifically, the homogeneous loss is
where indicates homogeneous relationship. The heterogeneous loss is
where indicates that modality has relationship with modality , and the logistic loss is
where indicates heterogeneous relationship. To minimize the loss, and need to be close for a large . Based on the similar idea, Wei et al.  present a Heterogeneous Translated Hashing (HTH) method. HTH simultaneously learns hash functions to embed heterogeneous media into different Hamming spaces and translators to align these spaces.
Wu et al.  present a cross-modal hashing approach, called quantized correlation hashing (QCH), which considers the quantization loss over domains and the relation between domains. Unlike previous approaches that separate the optimization of the quantizer and the maximization of domain correlation, this approach simultaneously optimizes both processes. The underlying relation between the domains that describes the same objects is established via maximizing the correlation among the hash codes across domains.
4.2.2 Nonlinear modeling
Zhen et al. 
propose a probabilistic latent factor model, called multimodal latent binary embedding (MLBE) to learn hash functions for cross-modal retrieval. MLBE employs a generative model to encode the intra-similarity and inter-similarity of data objects across multiple modalities. Based on maximum a posteriori estimation, the binary latent factors are efficiently obtained and then taken as hash codes in MLBE. However, the optimization is easy to get trapped in local minima during learning, especially when the code length is large.
Zhai et al.  present a new parametric local multimodal hashing (PLMH) method for cross-view similarity search. PLMH learns a set of hash functions to locally adapt to the data structure of each modality. Different local hash functions are learned at different locations of the input spaces, therefore, the overall transformations of all points in each modality are locally linear but globally nonlinear.
To learn nonlinear hash functions, Masci et al.  introduce a novel learning framework for multimodal similarity-preserving hashing based on the coupled siamese neural network architecture. It utilizes similar pairs and dissimilar pairs for both intra- and inter-modality similarity learning. For modalities and , the objective function is defined as:
where two siamese networks are coupled by a cross-modal loss:
and the unimodal loss is
where and denote the similar pairs set and dissimilar pairs set, respectively. is similar to . The full multimodal version making use of inter- and intra-modal training data is called MM-NN in the original paper. Unlike most existing cross-modality similarity learning approaches, the hashing functions are not limited to linear projections. By increasing the number of layers in the network, mappings of the arbitrary complexity can be trained.
Cao et al.  propose Correlation Hashing Network (CHN), a new hybrid architecture for cross-modal hashing. They jointly learn good image and text representations tailored to hash coding and formally control the quantization error.
4.3 Supervised methods
4.3.1 Linear modeling
Zhang et al.  propose a multimodal hashing method, called semantic correlation maximization (SCM), which integrates semantic labels into the hashing learning procedure. This method uses label vectors to get semantic similarity matrix , and tries to reconstruct it through the learned hash codes. Finally, the objective function is:
To learn orthogonal projection, the objective function is reformulated as:
Furthermore, a sequential learning method (SCM-Seq) is proposed to learn the hash functions bit by bit without imposing the orthogonality constraints.
Based on the dictionary learning framework, Wu et al.  develop an approach to obtain the sparse codesets for data objects across different modalities via joint multi-modal dictionary learning, which is called sparse multi-modal hashing (abbreviated as SMH). In SMH, both intra-modality similarity and inter-modality similarity are firstly modeled by a hypergraph, and then multi-modal dictionaries are jointly learned by Hypergraph Laplacian sparse coding. Based on the learned dictionaries, the sparse codeset of each data object is acquired and conducted for multi-modal approximate nearest neighbor retrieval using a sensitive Jaccard metric. Similarly, Yu et al.  propose a discriminative coupled dictionary hashing (DCDH) method to capture the underlying semantic information of the multi-modal data. In DCDH, the coupled dictionary for each modality is learned with the aid of class information. As a result, the coupled dictionaries not only preserve the intra-similarity and inter-correlation among multi-modal data, but also contain dictionary atoms that are semantically discriminative (i.e., data from the same category are reconstructed by the similar dictionary atoms). To perform fast cross-media retrieval, hash functions are learned to map data from the dictionary space to a low-dimensional Hamming space.
4.3.2 Nonlinear modeling
To capture more complex data structure, Lin et al. 
propose a two-step supervised hashing method termed SePH (Semantics-Preserving Hashing) for cross-view retrieval. For training, SePH firstly transforms the given semantic affinities of training data into a probability distribution and approximates it with to-be-learnt hash codes in Hamming space via minimizing the KL-divergence. Then in each view, SePH utilizes kernel logistic regression with a sampling strategy to learn the nonlinear projections from features to hash codes. And for any unseen instance, predicted hash codes and their corresponding output probabilities from observed views are utilized to determine its unified hash code, using a novel probabilistic approach.
Cao et al.  propose a novel supervised cross-modal hashing method, Correlation Autoencoder Hashing (CAH), to learn discriminative and compact binary codes based on deep autoencoders. Specifically, CAH jointly maximizes the feature correlation revealed by bimodal data and the semantic correlation conveyed in similarity labels, while embeds them into hash codes by nonlinear deep autoencoders.
Jiang et al.  propose a novel cross-modal hashing method, called deep cross-modal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch.
5 Multimodal datasets
|Dataset||Modality||Number of Samples||Image Features||Text Features||Number of Categories|
|NUS-WIDE||image/tags||186,577||6 types||tag occurrence feature||81|
|Pascal-VOC||image/tags||9,963||3 types||tag occurrence feature||20|
With the popularity of multimodal data, cross-modal retrieval becomes an urgent and challenging problem. To evaluate the performance of cross-modal retrieval algorithms, researchers collect multimodal data to build up multimodal datasets. Here, we introduce five commonly used datasets, i.e., Wikipedia, INRIA-Websearch, Flickr30K, Pascal VOC, and NUS-WIDE datasets.
The Wiki image-text dataset111http://www.svcl.ucsd.edu/projects/crossmodal/ : it is generated from Wikipedia’s “featured article”, which consists of 2866 image-text pairs. In each pair, the text is an article describing people, places or some events and the image is closely related to the content of the article (as shown in Figure 10). Besides, each pair is labeled with one of 10 semantic classes. The representation of the text with 10 dimensions is derived from a latent Dirichlet allocation model . The images are represented by the 128 dimensional SIFT descriptor histograms .
The INRIA-Websearch dataset222http://lear.inrialpes.fr/pubs/2010/KAVJ10/ : it contains 71,478 image-text pairs, which can be categorized into 353 different concepts that include famous landmarks, actors, films, logos, etc. Each concept comes with a number of images retrieved via Internet search, and each image is marked as either relevant or irrelevant to its query concept. The text modality consists of text surrounding images on web pages. This dataset is very challenging because it contains a large number of classes.
The Flickr30K dataset333http://shannon.cs.illinois.edu/DenotationGraph/ : it is an extension of Flicker8K , which contains 31,783 images collected from different Flickr groups and focuses on events involving people and animals. Each image is associated with five sentences independently written by native English speakers from Mechanical Turk.
: this dataset contains 186,577 labeled images. Each image is associated with user tags, which can be taken as an image-text pair. To guarantee that each class has abundant training samples, researchers generally select those pairs that belong to one of the K (K=10, or 21) largest classes with each pair exclusively belonging to one of the selected classes. Six types of low-level features are extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5 5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions. The textual tags are represented with 1000-dimensional tag occurrence feature vectors.
The Pascal VOC dataset555http://www.cs.utexas.edu/ grauman/research/datasets.html : it consists of 5011/4952 (training/testing) image-tag pairs, which can be categorized into 20 different classes. Some examples are shown in Figure 11. Since some images are multi-labeled, researchers usually select images with only one object as the way in , resulting in 2808 training and 2841 testing data. The image features include histograms of bag-of-visual-words, GIST and color , and the text features are 399-dimensional tag occurrence features.
These datasets generally contain two modalities of data, i.e., image and text. Among those datasets, only the Wiki dataset is designed for cross-modal retrieval, but the number of samples and categories it contains are small. Another commonly used dataset is the NUS-WIDE dataset, especially in cross-modal hashing, due to its relative massiveness. But the text modality in the NUS-WIDE dataset is user tag, which is simple and limited for text description. The Flickr30K dataset is usually used for the image-sentence retrieval, which does not have category information. So it is desirable to design a more general, large-scale multimodal dataset for future research, which contains more modalities of data and more categories.
In this section, we test the performance of different kinds of cross-modal retrieval methods. Firstly, we introduce evaluation metrics, then choose representative cross-modal retrieval methods for evaluation. For the compared methods, we cite the reported results in the relevant literature.
6.1 Evaluation metric
To evaluate the performance of the cross-modal retrieval methods, two cross-modal retrieval tasks are conducted: (1) Image query vs. Text database, (2) Text query vs. Image database. More specifically, in testing phase, we take one modality of data of the testing set as the query set to retrieve another modality of data. The cosine distance is adopted to measure the similarity of features. Given an image (or text) query, the goal of each cross-modal task is to find the nearest neighbors from the text (or image) database.
The mean average precision (MAP)  is used to evaluate the overall performance of the tested algorithms. To compute MAP, we first evaluate the average precision (AP) of a set of retrieved documents by , where is the number of relevant documents in the retrieved set, denotes the precision of the top retrieved documents, and if the th retrieved document is relevant (where ’relevant’ means belonging to the class of the query) and otherwise. The MAP is then computed by averaging the AP values over all queries in the query set. The larger the MAP, the better the performance.
For image-sentence retrieval, the commonly used metric is another one, which will be described in Section 6.2.2.
6.2 Comparison of real-valued representation learning methods
For image-text retrieval, the Wiki image-text dataset and the NUS-WIDE dataset are commonly used to evaluate performance. For the real-valued representation learning methods, we choose three popular unsupervised methods (i.e., PLS , BLM [15, 16] and CCA ), two rank based methods (i.e., LSCMR , and Bi-LSCMR ) and eight popular supervised methods (i.e., CDFE , GMLDA , GMMFA , CCA-3V , SliM , MR , LCFS  and JFSSL ).
For image-sentence retrieval, the Flickr30K is the commonly used dataset for evaluation. Deep learning methods are generally used for modeling the images and sentences. For the deep learning methods, we choose four recently proposed methods: DeViSE , SDT-RNN , Deep Fragment  and End-to-end DCCA 
6.2.1 Results on Wiki and NUS-WIDE
Tables III and V show the MAP scores 666For the Wiki and NUS-WIDE datasets, we use the features provided by the authors. For PLS, BLM, CCA, CDFE, GMMFA, GMLDA, CCA-3V, LCFS and JFSSL, we adopt the codes provided by the authors and use the setting in . For LSCMR, Bi-LSCMR, SliM, and MR, we cite the publicly reported results. achieved by PLS , BLM [15, 16], CCA , LSCMR , Bi-LSCMR , CDFE , GMMFA , GMLDA , CCA-3V , SliM , MR , LCFS  and JFSSL 
. For the experiments on the NUS-WIDE dataset, Principal Component Analysis (PCA) is first performed on the original features to remove redundancy for methods PLS, BLM, CCA, CDFE, GMMFA, GMLDA, and CCA-3V.
We observe that the supervised learning methods (CDFE, GMMFA, GMLDA, CCA-3V, SliM, and M
R, LCFS and JFSSL) perform better than the unsupervised learning methods (PLS, BLM and CCA). The reason is that PLS, BLM and CCA only care about pair-wise closeness in the common subspace, but CDFE, GMMFA, GMLDA, CCA-3V, SliM, and MR, LCFS and JFSSL utilize class information to obtain much better separation between classes in the common representation space. Hence, it is helpful for cross-modal retrieval to learn a discriminative common representation space. The rank based methods (LSCMR and Bi-LSCMR) achieve comparable results on the Wiki dataset, but perform worse on the NUS-WIDE dataset. One of the reasons is that they only use a small set of data to generate rank lists on the NUS-WIDE dataset, and the number of data used for training is not enough. Another reason is that the features (the bag of words (BoW) representation with the TF-IDF weighting scheme for text, the bag-of-visual words (BoVW) feature for images) they used are not powerful enough. LCFS and JFSSL performs the best, and one reason is that they perform feature selection on different feature spaces simultaneously.
Results on the Wiki with different types of features: To evaluate the effect of different types of features, we test the cross-modal retrieval performance with different types of features for images and texts on the Wiki dataset. Here we cite the results in 
. Besides the features provided by the Wiki dataset itself, 4096-dimensional CNN(Convolutional Neural Networks) features for images are extracted by Caffe, and 5000-dimensional feature vectors for texts are extracted by using the bag of words representation with the TF-IDF weighting scheme. Table IV shows the MAP scores of GMLDA, CCA-3V, LCFS and JFSSL with different types of features on the Wiki dataset. PCA is performed on CNN and TF-IDF features for GMLDA and CCA-3V. It can be seen that all methods achieve better results when using the CNN features. This is because CNN features are more powerful for image representation, which has been proved in many fields. Overall, better representation of data leads to better performance.
|Methods||Image query||Text query||Average|
Results on the Wiki in three-modality case: To evaluate the performance of cross-modal retrieval methods in a three-modality case, we test several methods on the Wiki dataset. Here we cite the results in . To the best of our knowledge, there are no three or more modalities of datasets available publicly in the recent literature. The Wiki dataset contains two modalities of data: text and image. We adopt the settings in . To simulate a three-modality setting, 4096-dimensional CNN (Convolutional Neural Networks) features of images are extracted by Caffe  as another virtual modality. Here 128-dim SIFT histogram, 10-dim LDA feature and 4096-dimensional CNN features are used as Modality A, Modality B and Modality C, respectively. Table VI shows the MAP comparison on the Wiki dataset in the three-modality case. We can see that JFSSL outperforms the other methods in three cross-modal retrieval tasks. This is mainly due to the fact that JFSSL is designed for the -modality case, which can model the correlations between different modalities more accurately in the three-modality case. However, the other methods are designed for only the two-modality case, and they are not suitable for the three-modality case.
From the above experiments, we draw the following conclusions:
Supervised learning methods generally achieve better results than unsupervised learning methods. The reason is that unsupervised methods only care about pair-wise closeness in the common subspace, but supervised learning methods utilize class information to obtain much better separation between classes in the common representation space. Hence, it is helpful for cross-modal retrieval to learn a discriminative common representation.
For cross-modal retrieval, better features generally lead to better performance. So it is beneficial for cross-modal retrieval to learn powerful representations for various modalities of data.
For algorithms designed for the two-modality case, it cannot be directly extended for more than two-modality case well. So in more than two-modality case, they generally perform worse than the algorithms designed for more than two-modality methods.
|Methods||Image query||Text query||Average|
|Query||Modality A||Modality B||Modality C|
6.2.2 Results on Flickr30K
For the Flickr30K dataset, We adopt the evaluation metrics in  for a fair comparison. More specifically, for the image-sentece retrieval, we report the median rank (Med r) of the closest ground truth result in the list, as well as the R@K (with K = 1, 5, 10) that computes the fraction of times the correct result being found among the top K items. In contrast to R@K, a lower median rank indicates a better performance.
Table VII shows the publicly reported results achieved by DeViSE , SDT-RNN , Deep Fragment  and End-to-end DCCA  on the Flickr30K dataset. End-to-end DCCA and Deep Fragment perform better than other methods. The reason is that Deep Fragment breaks an image into objects and a sentence into dependency tree relations, and maximises the explicit alignment between the image fragments and text fragments. For End-to-end DCCA, the used TF-IDF based text features and CNN based visual features capture global properties of the two modalities respectively. The alignment of the fragments in image and text is implicitly considered by the CCA correlation objective.
|Methods||Sentence retrieval||Image retrieval|
6.3 Comparison of binary representation learning methods
For cross-modal hashing methods, the Wiki image-text dataset and the NUS-WIDE dataset are most commonly used datasets. We use the settings in  and utilize the Mean Average Precision (MAP) as the evaluation metric.
Tables VIII and IX show the publicly reported MAP scores achieved by unsupervised cross-modal hashing methods (CVH , IMH , LSSH , and CMFH ), pairwise base methods (CMSSH  and MM-NN ), supervised methods (SCM-Seq  and SePH ) on the Wiki and NUS-WIDE datasets, respectively. MM-NN 777For MM-NN, we show the results reported in the original paper, MAP scores at some code length are missing. and SePH are nonlinear modeling methods, and others are linear modeling methods. The code length is set to be 16, 32, 64, and 128 bits, respectively.
From the experimental results, we can draw the following observations.
1) IMH performs better than CVH. The reason is that CVH only exploits the inter-modality similarity, but IMH exploits both inter-modality and intra-modality similarity. So it is useful to model the intra-modal similarity in the cross-modal hashing algorithms.
2) CMSSH performs better than CVH. The reason is that CVH only uses the pairwise information, but CMSSH uses both similar and dissimilar pairs. So using more information is helpful for improving performance.
3) Although without supervised information, experiments showed that CMFH and LSSH can well exploit the latent semantic affinities of training data and yield state-of-the-art performance for cross-modal retrieval. So modeling the common representation space in a appropriate manner (like CMFH and LSSH) is very important.
4) SCM-Seq integrates semantic labels into the hashing learning procedure via maximizing semantic correlations, which outperforms several state-of-the-art methods. It can be seen that supervised information is beneficial for learning binary codes for various modalities of data.
5) SePH performs better than other methods, due to its capability to better preserve semantic affinities in Hamming space, as well as the effectiveness of kernel logistic regression to model the non-linear projections from features to hash codes. MM-NN learns non-linear projection by using the siamese neural network architecture. It generally performs better than the linear models. So learning non-linear projection is more appropriate for complex structure of multimodal data.
6) As the length of hash codes increases, the performance of supervised hashing methods (SCM-Seq and SePH) keeps increasing, which reflects the capability of utilizing longer hash codes to better preserve semantic affinities. Meanwhile, performance of some baselines like CVH, IMH, CMSSH decreases, which is also observed in previous work [64, 80, 63]. So it is more difficult to learn longer binary codes without supervised information.
7 Discussion and future trends
Although some promising results have been achieved in the field of cross-modal retrieval, there is still a gap between state-of-the-art methods and user expectation, which indicates that we still need to investigate the cross-modal retrieval problem. In the following, we discuss the future research opportunities for cross-modal retrieval.
1. Collection of multimodal large-scale datasets
Now researchers have been working hard to design more and more sophisticated algorithms to retrieve or summarize the multimodal data. However, there is a lack of good sources for further training, testing, evaluating and comparing the performance of different algorithms. Currently available datasets for cross-modal retrieval research are either too small such as the Wikipedia dataset that only contains 2866 documents, or too specific such as NUS-WIDE only consists of user tags. Hence, it would be tremendously helpful for researchers if there exists a multimodal large-scale dataset, which contains more than two modalities of data and large-scale multimodal data with ground truth.
2. Multimodal learning with limited and noisy annotations
The emerging applications on social networks have produced huge amount of multimodal content created by people. Typical examples include Flickr, YouTube, Facebook, MySpace, WeiBo, WeiXin, etc. It is well known that the multimodal data in the web is loosely organized, and the annotations of these data are limited and noisy. Obviously, it is difficult to label large scale multimodal data. However, the annotations provide semantic information for multimodal data, so how to utilize the limited and noisy annotations to learn semantic correlations among the multimodal data in this scenario need to be addressed in the future.
3. Scalability on large-scale data
Driven by wide availability of massive storage devices, mobile devices and fast networks, more and more multimedia resources are generated and propagated on the web. With rapid growth of the multimodal data, we need develop effective and efficient algorithms that are scalable to distributed platforms. We also need to conduct further research on effectively and efficiently organizing each relevant modality of data together.
4. Deep learning on multimodal data
, and so on. Deep learning algorithms show good properties in representation learning. Powerful representations are helpful for reducing heterogeneity gap and semantic gap between different modalities of data. Hence, combining appropriate deep learning algorithms to model different types of data for cross-modal retrieval (such as CNN for modeling images, RNN (Recurrent Neural Networks) for modeling text) is a future trend.
5. Finer-level cross-modal semantic correlation modeling
Most of published works usually embed different modalities into a common embedding space. For example, they map images and texts into a common space, where different modalities of data can be compared. However, this is too rough because different image fragments correspond to different text fragments, and considering such finer correspondence could explore the image-text semantic relations more accurately. So how to obtain the fragments of different modalities and find their correspondence are very important. Accordingly, new model should be designed for modeling such complex relations.
Cross-modal retrieval provides an effective and powerful way to multimodal data retrieval and it is more convenient than traditional single-modality-based techniques. This paper gives an overview of cross-modal retrieval, summarizes a number of representative methods and classifies them into two main groups: 1) real-valued representation learning, and 2) bianry representation learning. Then, we introduce several commonly used multimodal datasets, and empirically evaluate the performance of some representative methods on some commonly used datasets. We also discuss the future trends in cross-modal retrieval field. Although significant work has been carried out in this field, cross-modal retrieval has not been well-addressed to date. There is still much work to be done to better process cross-modal retrieval. We expect this paper will help readers to understand the state-of-the-art in cross-modal retrieval and motivate more meaningful works.
-  J. Liu, C. Xu, and H. Lu, “Cross-media retrieval: state-of-the-art and open issues,” International Journal of Multimedia Intelligence and Security, vol. 1, no. 1, pp. 33–52, 2010.
-  C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” arXiv preprint arXiv:1304.5634, 2013.
-  A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
-  X. Chen and C. L. Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in Computer Vision and Pattern Recognition, 2015, pp. 2422–2431.
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” inInternational Conference on Computer Vision, 2015, pp. 2407–2415.
-  Y. Ushiku, M. Yamaguchi, Y. Mukuta, and T. Harada, “Common subspace for model and similarity: Phrase learning for caption generation from images,” in International Conference on Computer Vision, 2015, pp. 2668–2676.
-  J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” 2015.
-  R. S. Z. Ryan Kiros, Ruslan Salakhutdinov, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv:1411.2539.
-  S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504, 2015.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence–video to text,” arXiv:1505.00487.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, “Long-term recurrent convolutional networks for visual recognition and description,” in Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
-  N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in International conference on Multimedia. ACM, 2010, pp. 251–260.
-  R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in Subspace, latent structure and feature selection. Springer, 2006, pp. 34–51.
-  A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2160–2167.
-  J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural Computation, vol. 12, no. 6, pp. 1247–1283, 2000.
-  D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia content processing through cross-modal association,” in International Conference on Multimedia. ACM, 2003, pp. 604–611.
-  V. Mahadevan, C. W. Wong, J. C. Pereira, T. Liu, N. Vasconcelos, and L. K. Saul, “Maximum covariance unfolding: Manifold learning for bimodal data,” in Advances in Neural Information Processing Systems, 2011, pp. 918–926.
-  X. Shi and P. Yu, “Dimensionality reduction on heterogeneous feature space,” in International Conference on Data Mining, 2012, pp. 635–644.
-  F. Zhu, L. Shao, and M. Yu, “Cross-modality submodular dictionary learning for information retrieval,” in International Conference on Information and Knowledge Management. ACM, 2014, pp. 1479–1488.
-  D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Conference on Research and Development in Informaion Retrieval. ACM, 2003, pp. 127–134.
-  D. Putthividhy, H. T. Attias, and S. S. Nagarajan, “Topic regression multi-modal latent dirichlet allocation for image annotation,” in Computer Vision and Pattern Recognition. IEEE, 2010, pp. 3408–3415.
-  Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in International Conference on Computer Vision. IEEE, 2011, pp. 2407–2414.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep
International Conference on Machine Learning, 2011, pp. 689–696.
-  N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in Neural Information Processing Systems, 2012, pp. 2222–2230.
-  G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013, pp. 1247–1255.
-  F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” 2015, pp. 3441–3450.
-  F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” in International Conference on Multimedia. ACM, 2014, pp. 7–16.
R. Xu, C. Xiong, W. Chen, and C. J. J., “Jointly modeling deep video and
compositional text to bridge vision and language in a unified framework,” in
AAAI Conference on Artificial Intelligence, 2015, pp. 2346–2352.
-  N. Quadrianto and C. H. Lampert, “Learning multi-view neighborhood preserving projections,” in International Conference on Machine Learning, 2011, pp. 425–432.
-  D. Zhai, H. Chang, S. Shan, X. Chen, and W. Gao, “Multiview metric learning with global consistency and local smoothness,” ACM Transactions on Intelligent Systems and Technology, vol. 3, no. 3, 2012.
-  X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for cross-media retrieval,” in AAAI Conference on Artificial Intelligence, 2013, pp. 1198–1204.
-  Z. Yuan, J. Sang, Y. Liu, and C. Xu, “Latent feature learning in social media network,” in International Conference on Multimedia. ACM, 2013, pp. 253–262.
-  J. Wang, Y. He, C. Kang, S. Xiang, and C. Pan, “Image-text cross-modal retrieval via modality-specific feature learning,” in International Conference on Multimedia Retrieval, 2015, pp. 347–354.
-  B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger, “Learning to rank with (a lot of) word features,” Information Retrieval, vol. 13, no. 3, pp. 291–314, 2010.
-  D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1371–1384, 2008.
-  X. Lu, F. Wu, S. Tang, Z. Zhang, X. He, and Y. Zhuang, “A low rank structural large margin method for cross-modal ranking,” in Conference on Research and Development in Information Retrieval. ACM, 2013, pp. 433–442.
-  F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang, “Cross-media semantic representation via bi-directional learning to rank,” in International Conference on Multimedia. ACM, 2013, pp. 877–886.
-  J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in International Joint Conference on Artificial Intelligence, vol. 11, 2011, pp. 2764–2770.
-  T. Yao, T. Mei, and C.-W. Ngo, “Learning query and image similarities with ranking canonical correlation analysis,” in International Conference on Computer Vision, 2015, pp. 28–36.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in Advances in Neural Information Processing Systems, 2013, pp. 2121–2129.
-  R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 207–218, 2014.
-  A. Karpathy, A. Joulin, and F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Advances in Neural Information Processing Systems, 2014, pp. 1889–1897.
-  X. Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang, “Deep compositional cross-modal learning to rank via local-global alignment,” in International Conference on Multimedia. ACM, 2015, pp. 69–78.
-  D. Lin and X. Tang, “Inter-modality face recognition,” in European Conference on Computer Vision. Springer, 2006, pp. 13–26.
-  X.-Y. Jing, R.-M. Hu, Y.-P. Zhu, S.-S. Wu, C. Liang, and J.-Y. Yang, “Intra-view and inter-view supervised correlation analysis for multi-view feature learning,” in AAAI Conference on Artificial Intelligence, 2014, pp. 1882–1889.
-  X. Mao, B. Lin, D. Cai, X. He, and J. Pei, “Parallel field alignment for cross media retrieval,” in International Conference on Multimedia. ACM, 2013, pp. 897–906.
-  X. Zhai, Y. Peng, and J. Xiao, “Learning cross-media joint representation with sparse and semisupervised regularization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 965–978, 2014.
-  Y. T. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. M. Lu, “Supervised coupled dictionary learning with group structures for multi-modal retrieval,” in AAAI Conference on Artificial Intelligence, 2013.
-  N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Cluster canonical correlation analysis,” in International Conference on Artificial Intelligence and Statistics, 2014, pp. 823–831.
-  Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
-  K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for cross-modal matching,” in International Conference on Computer Vision. IEEE, 2013, pp. 2088–2095.
-  K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, preprint.
-  Y. Zheng, Y.-J. Zhang, and H. Larochelle, “Topic modeling of multimodal data: an autoregressive approach,” in Computer Vision and Pattern Recognition. IEEE, 2014, pp. 1370–1377.
-  R. Liao, J. Zhu, and Z. Qin, “Nonparametric bayesian upstream supervised multi-modal topic models,” in International Conference on Web Search and Data Mining. ACM, 2014, pp. 493–502.
-  Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multi-modal mutual topic reinforce modeling for cross-media retrieval,” in International Conference on Multimedia. ACM, 2014, pp. 307–316.
-  C. Wang, H. Yang, and C. Meinel, “Deep semantic mapping for cross-modal retrieval,” in International Conference on Tools with Artificial Intelligence, 2015, pp. 234–241.
-  Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline,” in IEEE Transactions on Cybernetics, p. Preprint.
-  W. Wang, X. Yang, B. C. Ooi, D. Zhang, and Y. Zhuang, “Effective deep learning-based multi-modal retrieval,” International Journal on Very Large Data Bases, vol. 25, no. 1, pp. 79–101, 2016.
-  L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in International Conference on Machine learning. ACM, 2008, pp. 1024–1031.
-  J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-media hashing for large-scale retrieval from heterogeneous data sources,” in International Conference on Management of Data. ACM, 2013, pp. 785–796.
-  M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Predictable dual-view hashing,” in International Conference on Machine Learning, 2013, pp. 1328–1336.
-  X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear cross-modal hashing for efficient multimedia search,” in International Conference on Multimedia. ACM, 2013, pp. 143–152.
-  G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in Computer Vision and Pattern Recognition. IEEE, 2014, pp. 2083–2090.
-  J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for cross-modal similarity search,” in Conference on Research & Development in Information Retrieval. ACM, 2014, pp. 415–424.
-  W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, “Effective multi-modal retrieval based on stacked auto-encoders,” in International Conference on Very Large Data Bases, 2014, pp. 649–660.
-  D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” IEEE Transactions on Multimedia, vol. 17, no. 9, pp. 1404–1416, 2015.
-  R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterion for robust face recognition,” Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1561–1576, 2011.
-  Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multimodal data,” in Advances in Neural Information Processing Systems, 2012, pp. 1376–1384.
-  Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He, “Iterative multi-view hashing for cross media indexing,” in International Conference on Multimedia. ACM, 2014, pp. 527–536.
-  B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang, “Quantized correlation hashing for fast cross-modal search,” in International Joint Conference on Artificial Intelligence, 2015, pp. 3946–3952.
-  M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang, “Comparing apples to oranges: a scalable solution with heterogeneous hashing,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2013, pp. 230–238.
-  Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang, “Scalable heterogeneous translated hashing,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2014, pp. 791–800.
-  Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodal hash function learning,” in International Conference on Knowledge Discovery and Data Mining. ACM, 2012, pp. 940–948.
-  D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao, “Parametric local multimodal hashing for cross-view similarity search,” in International Joint Conference on Artificial Intelligence. AAAI Press, 2013, pp. 2754–2760.
-  J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, “Multimodal similarity-preserving hashing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 824–830, 2014.
-  Y. Cao, M. Long, and J. Wang, “Correlation hashing network for efficient cross-modal retrieval,” CoRR, vol. abs/1602.06697, 2016. [Online]. Available: http://arxiv.org/abs/1602.06697
-  F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multi-modal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
-  Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast cross-media retrieval,” in Conference on Research & Development in Information Retrieval. ACM, 2014, pp. 395–404.
-  D. Zhang and W.-J. Li, “Large-scale supervised multimodal hashing with semantic correlation maximization,” in AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183.
-  Z. Lin, G. Ding, M. Hu, and J. Wang, “Semantics-preserving hashing for cross-view retrieval,” in Computer Vision and Pattern Recognition, 2015, pp. 3864–3872.
-  Y. Cao, M. Long, J. Wang, and H. Zhu, “Correlation autoencoder hashing for supervised cross-modal search,” in International Conference on Multimedia Retrieval, 2016.
-  Q. Jiang and W. Li, “Deep cross-modal hashing,” CoRR, vol. abs/1602.02255, 2016. [Online]. Available: http://arxiv.org/abs/1602.02255
-  Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
-  J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in cross-modal multimedia retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 521–535, 2014.
-  R. Udupa and M. Khapra, “Improving the multilingual user experience of wikipedia using cross-language name search,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 492–500.
-  A. Li, S. Shan, X. Chen, and W. Gao, “Face recognition based on non-corresponding region matching,” in International Conference on Computer Vision. IEEE, 2011, pp. 1060–1067.
-  A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch,” in Computer Vision and Pattern Recognition. IEEE, 2011, pp. 593–600.
-  Y. Chen, L. Wang, W. Wang, and Z. Zhang, “Continuum regression for cross-modal multimedia retrieval,” in International Conference on Image Processing. IEEE, 2012, pp. 1949–1952.
-  X. Wang, Y. Liu, D. Wang, and F. Wu, “Cross-media topic mining on wikipedia,” in International Conference on Multimedia. ACM, 2013, pp. 689–692.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
-  W. Wu, J. Xu, and H. Li, “Learning similarity function between objects in heterogeneous spaces,” Microsoft Research Technique Report, 2010.
-  A. Mignon and F. Jurie, “CMML: a new metric learning approach for cross modal matching,” in Asian Conference on Computer Vision, 2012, pp. 14–pages.
-  Y. Hua, H. Tian, A. Cai, and P. Shi, “Cross-modal correlation learning with deep convolutional architecture,” in Visual Communications and Image Processing, 2015, pp. 1–4.
-  V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multi-label cross-modal retrieval,” 2015, pp. 4094–4102.
-  Z. Li, W. Lu, E. Bao, and W. Xing, “Learning a semantic space by deep network for cross-media retrieval,” in International Conference on Distributed Multimedia Systems, 2015, pp. 199–203.
-  L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, “Learning aligned cross-modal representations from weakly aligned data,” in Computer Vision and Pattern Recognition, 2016.
-  Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in Neural Information Processing Systems, 2009, pp. 1753–1760.
-  D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast similarity search,” in Conference on Research and Development in Information Retrieval. ACM, 2010, pp. 18–25.
-  Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in Computer Vision and Pattern Recognition. IEEE, 2011, pp. 817–824.
-  D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Conference on Research and Development in Information Retrieval. ACM, 2011, pp. 225–234.
-  J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Multiple feature hashing for real-time large scale near-duplicate video retrieval,” in International Conference on Multimedia. ACM, 2011, pp. 423–432.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
-  J. Krapac, M. Allan, J. Verbeek, and F. Jurie, “Improving web image search results using query-relative classifiers,” in Computer Vision and Pattern Recognition, 2010, pp. 1094–1101.
-  Y. Peter, L. Alice, H. Micah, and H. Julia, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” in Transactions of the Association for Computational Linguistics, 2014, pp. 67–78.
-  M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: A real-world web image database from national university of singapore,” in International Conference on Image and Video Retrieval. ACM, 2009, p. 48.
-  S. J. Hwang and K. Grauman, “Reading between the lines: Object localization using implicit cues from image tags,” Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1145–1158, 2012.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNeural Information Processing Systems, 2012, pp. 1106–1114.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale visual recognition,” in International Conference on Learning Representations, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition, 2015, pp. 1–9.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li, “Large-scale video classification with convolutional neural networks,” in Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.