A New Evaluation Protocol and Benchmarking Results for Extendable Cross-media Retrieval

by   Ruoyu Liu, et al.

This paper proposes a new evaluation protocol for cross-media retrieval which better fits the real-word applications. Both image-text and text-image retrieval modes are considered. Traditionally, class labels in the training and testing sets are identical. That is, it is usually assumed that the query falls into some pre-defined classes. However, in practice, the content of a query image/text may vary extensively, and the retrieval system does not necessarily know in advance the class label of a query. Considering the inconsistency between the real-world applications and laboratory assumptions, we think that the existing protocol that works under identical train/test classes can be modified and improved. This work is dedicated to addressing this problem by considering the protocol under an extendable scenario, , the training and testing classes do not overlap. We provide extensive benchmarking results obtained by the existing protocol and the proposed new protocol on several commonly used datasets. We demonstrate a noticeable performance drop when the testing classes are unseen during training. Additionally, a trivial solution, , directly using the predicted class label for cross-media retrieval, is tested. We show that the trivial solution is very competitive in traditional non-extendable retrieval, but becomes less so under the new settings. The train/test split, evaluation code, and benchmarking results are publicly available on our website.


Open Set Adversarial Examples

Adversarial examples in recent works target at closed set recognition sy...

Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval

This paper contributes a new large-scale dataset for weakly supervised c...

Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

In order to retrieve unlabeled images by textual queries, cross-media si...

Deep Cross-media Knowledge Transfer

Cross-media retrieval is a research hotspot in multimedia area, which ai...

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Cross-media retrieval is to return the results of various media types co...

Revisiting Hotels-50K and Hotel-ID

In this paper, we propose revisited versions for two recent hotel recogn...

A critical look at the current train/test split in machine learning

The randomized or cross-validated split of training and testing sets has...

I Introduction and Related Work

This paper focuses on the cross-media retrieval between images and texts. In this task, given a query text/image, we aim to retrieve the relevant images/texts from the gallery (database). Since the two modalities are located in different feature spaces, the challenge of cross-media retrieval consists in the similarity measurement between the heterogeneous data. An effective solution learns a unified representation for different modalities so that the common distances can be employed for similarity measurement.

Related Work. The research of multimedia retrieval has two diverse branches: single-media retrieval and cross-media retrieval. The former branch, such as image retrieval [1, 2, 3] and video retrieval [4], uses the homogeneous queries to perform image-to-image or video-to-video retrieval. However, the query and gallery in cross-media retrieval are heterogeneous and their similarities cannot be directly measured.

The concept of cross-media retrieval is firstly defined by Wu et al. [5] and they also propose the earliest cross-media model: multimedia document (MMD). The media objects of different modalities that carry the same semantic (like the image, text and audio in the same web page) are collected together as an MMD. Then, the distance between two MMDs is calculated from the distances of the media objects in each modality. After [5], Yang et al. propose a series of methods to tackle cross-media retrieval by using MMD [6, 7, 8, 9]. The shortcoming of MMD is that it is not very flexible, because it handles the set of component media objects as a whole.

The main body of cross-media methods is based on learning the common representation. The milestone work is [10] proposed by Rasiwasia et al., which employs the canonical correlation analysis (CCA) [11]

and multi-class logistic regression to learn the descriptors for the heterogeneous data. Inspired by 


, many approaches have been proposed to learn the common representations, which can be classified into two groups: real-valued and binary representations.

The real-valued representations map the heterogeneous data into a common continuous feature space. Shallow methods learn two linear functions [10, 12, 13] or simple nonlinear functions [14, 15]

to maximize the correlations between the pairwise data or further improve feature discrimination by using the category labels. With the introduction of deep learning technique, deep networks have also been employed in cross-media retrieval, which learns more complex projections. The feasible networks include fully connected networks 

[16, 17]

, convolutional neural networks 

[18, 19, 20], recursive neural networks [21]

, recurrent neural networks 

[22], auto-encoders [23, 24] and adversarial networks [25]. These deep methods have shown their superiority in retrieval accuracy to the shallow methods.

The binary representations, on the other hand, map the heterogeneous data into a discrete space, where the entries of the features consist of two common values: . These methods are also called cross-media hashing, and they focus on large-scale retrieval. In this scenario, these methods use Hamming distance to accelerate the search process. Most of the binary methods are shallow models which relax the problem into a real-valued case [26, 27, 28, 29, 30, 31, 32] or optimize to learn the hash codes directly [33, 34, 35]. Some recent works have also employed deep learning to learn better hash codes [36, 37, 38].

There are some methods using deep networks to tackle cross-media retrieval from other perspectives. For example, [39, 40, 41, 42] train the networks to learn the similarities between the heterogeneous data directly, which can be viewed as extensions of metric learning. Several image caption methods also show their capability of performing cross-media retrieval [43, 44], since they use a part of their networks to embed images and texts into a common feature space.

Currently, a majority line of cross-media research relies on the class labels in training [14, 19, 32]. For example, Wei et al. [19]

fine-tune a convolutional neural network (CNN) on the text and image domains using the class supervision end extracts the softmax layer for retrieval. Gong

et al. [14] introduce supervision by treating the class labels as the third domain. On the end of performance evaluation, classic datasets, e.g., Wikipedia [10] and NUS-WIDE [45], define their ground-truths based on the category labels. They aim to search for texts/images belonging to the same class with the query image/text. Usually, the heterogeneous data is treated as a true match if it has at least one common category label with the query. This task is called class(-level) retrieval in this paper, and this type of ground-truth is called the class ground-truth.

To evaluate the performance of cross-media retrieval, many datasets have been built or employed. Classical datasets are featured by the following aspects: they generally have two media types, i.e., images and texts, their data are labeled by several category labels, and they perform class retrieval using the class ground-truths. Wikipedia [10] is the first such dataset, and other datasets include NUS-WIDE [45], MIRFlickr25K [46], Pascal VOC (2007) [47], Web Queries [48] and Pascal Sentence [49]. Most of the recent works use the datasets employed in the field of image captioning [40, 41, 42, 43, 44, 50]. Such (image-sentence) datasets include Flickr8K [51], Flickr30K [52], MSCOCO [53] and SBU [54]. These datasets define the true matches of a query as the heterogeneous data describing it. In spite of the popularity of these new datasets, it is still meaningful to re-evaluate some popular methods on the traditional datasets. One import reason is that their texts cover more types (article, tag, surrounding words and sentence). A recent specialized cross-media dataset, XMedia [55], consists of five media types in total and still uses the class ground-truths.

In this paper, we propose a new evaluation protocol to re-evaluate the existing cross-media methods on the extendability to unseen classes. This protocol is designed for extendable cross-media retrieval, and it can reflect the more realistic performance compared to the existing protocol.

Ii the Evaluation Protocol

In the cross-media retrieval community, the current mainstream methods employ the same set of classes in both the training and testing steps, e.g., the 10 classes in Wikipedia, the most frequent 10 classes in NUS-WIDE. This train/test protocol assumes that a query always belongs to one of the pre-defined classes. Yet, this assumption does not always hold in practice, because the query text/image may exhibit various content and it is challenging for the training process to take into account all the variety in query types. Some examples are illustrated in Fig. 1. Moreover, in the field of learning to hash, Sablayrolles et al. [56] suggests that there exists a trivial solution to this problem: a classifier is trained for each class, class predictions are made for the query and the gallery items, and the retrieval process is equivalent to finding the relevant images with the same class predictions with the query; so performing an explicit hashing-based retrieval process does not seem necessary. It is therefore not well-grounded for an evaluation protocol to assume that the train/test data have the same set of classes.

Fig. 1: Some examples of (a) image and (b) text queries exhibiting various content. These images and texts are hard to be taken into consideration when training cross-media models.

Our main point is that cross-media retrieval can be in effect viewed as an extendable problem, in which the query class is “unseen” during training. On the one hand, this setting meets closely with the reality. On the other hand, the extendable assumption has also been used by default in several other retrieval problems, such as generic instance retrieval [3, 57], person re-identification [58, 59] and vehicle re-identification [60, 61]. A model is usually learned on the training set and tested for the unseen query and gallery. Under this scenario, the effectiveness of previous learning methods should be re-evaluated; it is possible that a method that works well on the non-extendable problem may exhibit low generalization ability under the extendable setting. More insights need to be gained.

Another problem associated with the existing protocol is that the same data is used for training and gallery. This is potentially problematic because in practice, the gallery may be very large, and it is infeasible to label all the gallery data. As a consequence, for practical evaluation, our second point is that it would be best to separate the training set and the testing set (composed of the gallery and query).

Considering the above two points, i.e., the extendable nature and the separation of train/test splits, we proposes a new evaluation protocol on the currently available datasets.

Dataset Splitting. Motivated by [56], we propose a new train/test splitting for cross-media retrieval. It separates the training and testing data so that the training and testing sets each has 50% of the categories, i.e., there is no class overlap between them. Models are learned on the training classes only and are directly tested on the testing set (gallery+query). The new splitting is in accordance with practical usage. Under this circumstance, the trivial solution does not produce competitive performance (to be shown in Section III).

Specifically, each dataset is separated into two parts: a training set consisting of the data from half of the categories, and a testing set consisting of the other half categories. Each set is further separated into two subsets: a database subset and a query subset. Using the four subsets, we evaluate cross-media retrieval on two tasks as illustrated in Fig. 2:

(1) Non-extendable (non-XTD) retrieval: In this task, we use the database subset of the training set to train the methods. Then, each sample in the query subset of the training set is used as a query to search its relevant heterogeneous data in the training subset of the training set. The train/text classes are identical, and it evaluates the performance of traditional non-xtd cross-media retrieval.

(2) Extendable (XTD) retrieval: In this task, we still use the database subset of the training set for training. But different from non-xtd retrieval, we use the samples of the query subset of the testing set as the queries to search their relevant heterogeneous data in the database subset of the testing set. There is no class overlap between the training and testing data, and in this task it evaluates the extendability to new datasets.

To balance the influences of the different class splits, we shuffle the categories and use folds to define such class splits. The performances are averaged over the folds to get the final metric scores. In our experiments, we set .

Evaluation metrics.

Two evaluation metrics are employed: CMC curve and MAP.

We still use mean average precision (MAP) as a metric of performance. MAP is the mean value of the average precision (AP) scores of the whole queries, which can be formulated as follows:


AP computes the average value of the precisions along with the variation of the recall, which is the area under the precision-recall curve. In practice, the integral is replaced with a finite sum over all the positions in the ranked sequence of the retrieved documents. Given a query , we define an indicator if the -th retrieved document is positive, and otherwise. The precision at the rank is given by . Denote as the total number of positive documents in the database, then the average precision at is:


Generally, we set as the volume of the database so that to omit the second parameter in Eq. 2 for simplification as Eq. 1.

MAP is a common metric of retrieval, which can reflect the overall performance of the methods. However, it lacks the insights into the details of the retrieval results. To overcome this shortage, we use an additional metric: cumulative matching characteristics (CMC) curve.

CMC curve is a common evaluation metric used in person re-identification [59]

, which represents the probability that the positive results can be found within the top

ranks of the returned list. No matter how many ground-truth matches are there in the database, only the first match is counted in the calculation. Compared to MAP, CMC curve is a fine-grained metric, which shows the variation of precision with the ranks. CMC curve is a good complementary metric for MAP.

Fig. 2: Train/test splitting of two retrieval tasks. (a) Non-extendable retrieval: the training and testing data (query and database) are from the same classes, the database subset is used for training. (b) Extendable retrieval: it uses the same training data in traditional retrieval, and uses the data of the other classes as the testing data (query and database). Classes are best viewed in color.
Dataset Media Types Capacity # Categories
Wikipedia image/article 2,866 10
Pascal Sentence image/sentence 1,000 20
NUS-WIDE image/tags 67,994 10
TABLE I: Summarization of the Benchmark Datasets

Iii Experimental Results

In this section, we evaluate some methods on three benchmark datasets under the existing protocol and the new protocol. These results can serve as the baselines for the future works.

Iii-a Datasets

In our experiments, we employ three datasets: Wikipedia, Pascal Sentence and NUS-WIDE. We summarize the three datasets in Table I and provide dataset details as follows:

Wikipedia [10] contains 2,866 image-article pairs, and each pair is labeled by one of its 10 categories. It has a training set of 2,173 pairwise data and a testing set of the rest 693 pairs. In our experiments, we separate the database into the four subsets partially based on its original separation. That is, we build the database subsets from the original training set, and build the query subsets from the original testing set.

Pascal Sentence [49] is a subset of Pascal VOC [47], which contains 1,000 images of 20 categories (50 images for each category). Each image is described by 5 sentences. In our experiments, we treat the 5 sentences of an image together as its corresponding texts. Besides, we use data of each category to construct the database subsets and use the rest data to construct the query subsets.

NUS-WIDE [45] is a dataset that contains 269,648 images with their associated tags. Each image belongs to at least one of the 81 categories. In our experiments, we retain the images belonging to the most 10 categories. Finally, we obtain a set of 67,994 images. We use data of this set to construct the database subsets and the rest data to construct the query subsets.

(a) Non-XTD retrieval: IT
(b) XTD retrieval: IT
(c) Non-XTD retrieval: TI
(d) XTD retrieval: TI
Fig. 3: Evaluation results of real-valued representations on Wikipedia. CMC curves are shown. MAP is shown before the name of each method. (a) and (c) represent the non-extendable retrieval results, while (b) and (d) are the extendable retrieval results. (a) and (b) denote image-to-text retrieval, while (c) and (d) denote text-to-image retrieval.
(a) Non-XTD retrieval: IT
(b) XTD retrieval: IT
(c) Non-XTD retrieval: TI
(d) XTD retrieval: TI
Fig. 4: Evaluation results of real-valued representations on Pascal Sentence. CMC curves are shown. MAP is shown before the name of each method. Retrieval modes in (a)(b)(c)(d) are the same with Fig. 3.
(a) Non-XTD retrieval: IT
(b) XTD retrieval: IT
(c) Non-XTD retrieval: TI
(d) XTD retrieval: TI
Fig. 5: Evaluation results of real-valued representations on NUS-WIDE. The MAP scores are shown before the names of the methods. Retrieval modes in (a)(b)(c)(d) are the same with Fig. 3 and Fig. 4.

Iii-B Image and Text Features

For images, we extract the layer (4096D) of CaffeNet [62], which is a simplified version of AlexNet [63]

and is pre-trained the 1,000-class ImageNet 

[64] dataset, as the image descriptors. For texts, we use the word2vec model pre-trained on the Google News dataset [65]

to represent text documents. We cluster the whole word vectors into a dictionary and then encode each text document into a word-frequency vector (1024D),

i.e., the Bag-of-Words (BoW) feature.

Iii-C Baselines

We compare the following real-valued and binary representations. The real-valued representations include:

  • CM: correlation matching [10] employs canonical correlation analysis (CCA) [11] to learn the uniform descriptors.

  • SM: semantic matching [10] learns two linear classifiers that map data into the semantic concept probabilities.

  • SCM: semantic correlation matching [10] is the combination of CM and SM.

  • PLS: partial least squares [66] is a method to learn common subspace to make corresponding data high correlated.

  • BLM: bilinear model [67] is another common subspace learning method.

  • GMMFA: generalized multiview analysis (GMA) [12] + marginal Fisher analysis (MFA) [68] is a supervised extension of CCA and is used to extend MFA.

  • GMLDA: GMA [12] + linear discriminant analysis (LDA) [69] is the same as GMMFA but is used to extend LDA.

  • CCA3V: three-view canonical correlation analysis [12] incorporates a third view capturing high-level semantics that comes from supervised ground-truth labels or unsupervised clustering the tags.

  • LCFS: Learning coupled feature spaces [13] learns two projection matrices to map multimodal data into a common feature space and imposes -norm penalties to select relevant and discriminative features.

  • deep-SM: deep semantic matching [19] uses deep networks to replace the linear classifiers in SM.

(a) Image-to-text retrieval
(b) Text-to-image retrieval
Fig. 6: Sample retrieval results of (a) image query 658 and (b) text query 9 on Wikipedia. The query is on the left. The first two rows correspond to the non-extendable retrieval of deep-SM. The third and fourth rows correspond to the extendable retrieval of deep-SM. The fifth and sixth rows correspond to the non-extendable retrieval of deep-TS. The last two rows correspond to the extendable retrieval of deep-TS. The performance drop of deep-TS is much larger from non-extendable retrieval to extendable retrieval and it loses competitive accuracy under the extendable setting.
(a) Non-XTD retrieval: IT (8b)
(b) XTD retrieval: IT (8b)
(c) Non-XTD retrieval: TI (8b)
(d) XTD retrieval: TI (8b)
(e) Non-XTD retrieval: IT (16b)
(f) XTD retrieval: IT (16b)
(g) Non-XTD retrieval: TI (16b)
(h) XTD retrieval: TI (16b)
(i) Non-XTD retrieval: IT (32b)
(j) XTD retrieval: IT (32b)
(k) Non-XTD retrieval: TI (32b)
(l) XTD retrieval: TI (32b)
Fig. 7: Evaluation results of binary representations on Wikipedia. CMC curves are shown. MAP is shown before the name of each method. (a)(c)(e)(g)(i)(k) represent the non-extendable retrieval results, while (b)(d)(f)(h)(j)(l) are the extendable retrieval results. (a)(b)(e)(f)(i)(j) denote image-to-text retrieval, while (c)(d)(g)(h)(k)(l) denote text-to-image retrieval. (a)(b)(c)(d) are the results of 8-bit hash codes, (e)(f)(g)(h) are the results of 16-bit hash codes and (i)(j)(k)(l) are the results of 32-bit hash codes.
(a) Non-XTD retrieval: IT (8b)
(b) XTD retrieval: IT (8b)
(c) Non-XTD retrieval: TI (8b)
(d) XTD retrieval: TI (8b)
(e) Non-XTD retrieval: IT (16b)
(f) XTD retrieval: IT (16b)
(g) Non-XTD retrieval: TI (16b)
(h) XTD retrieval: TI (16b)
(i) Non-XTD retrieval: IT (32b)
(j) XTD retrieval: IT (32b)
(k) Non-XTD retrieval: TI (32b)
(l) XTD retrieval: TI (32b)
Fig. 8: Evaluation results of binary representations on Pascal Sentence. CMC curves are shown. MAP is shown before the name of each method. Retrieval modes in (a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l) are the same with Fig. 7.
(a) Non-xtd retrieval: IT (8b)
(b) Extendable retrieval: IT (8b)
(c) Non-xtd retrieval: TI (8b)
(d) Extendable retrieval: TI (8b)
(e) Non-xtd retrieval: IT (16b)
(f) Extendable retrieval: IT (16b)
(g) Non-xtd retrieval: TI (16b)
(h) Extendable retrieval: TI (16b)
(i) Non-xtd retrieval: IT (32b)
(j) Extendable retrieval: IT (32b)
(k) Non-xtd retrieval: TI (32b)
(l) Extendable retrieval: TI (32b)
Fig. 9: Evaluation results of binary representations on Pascal Sentence. CMC curves are shown. MAP is shown before the name of each method. Retrieval modes in (a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l) are the same with Fig. 7 and Fig. 8.

The binary methods include:

  • CVH: cross-view hashing [26] is the extension of spectral hashing [70] to the cross-media case.

  • SCMseq: the sequential learning of semantic correlation maximization hashing (SCM) [29] aims to make the distances of the hash codes equal to the similarities of the label vectors and uses a sequential learning algorithm to learn the hash codes.

  • SCMorth: the orthogonal projection learning of SCM [29] is the same as SCMseq, but uses an orthogonal projection learning algorithm to learn the hash codes.

  • CMFH: collective matrix factorization hashing [28] is based on CCA and learns hash codes with latent factor model from different modalities.

  • LSSH: latent semantic sparse hashing [30] is based on CCA and captures high-level semantic information, i.e., sparse coding and matrix factorization for images and texts respectively.

  • SEPHkm: semantics-preserving hashing (SEPH) [27]

    + k-means transforms the semantic affinities of the training data into a probability distribution and approximates it with to-be-learned hash codes via minimizing the Kullback-Leibler divergence.

  • IMH: inter-media hashing [27] is another extension method of spectral hashing [70].

  • MM-NN: multimodal similarity-preserving hashing [42] is based on a coupled Siamese neural network.

For the real-valued representations, the CNN and BoW features described in Section III-B are projected onto a -dim subspace, where

denotes the number of training classes. For the binary representations, the CNN and BoW features are binarized to 8, 16 and 32 bits for evaluation.

Iii-D Trivial Solution

We compare the trivial solution [56] under the non-extendable and extendable retrieval settings. In a nutshell, the trivial solution learns a classifier in the training set. Then, the classifier assigns a class label to the gallery and query images/texts. The system thus returns those images/texts of the same class with the query. Note that, since there are a number of images/texts predicted into the same class, i.e., these images/texts form ties, and we randomly assign them a rank in the tie.

  • TS: trivial solution [56] uses the multi-class logistic regression to learn classifiers and uses the classification results as the retrieval results.

  • deep-TS: deep trivial solution [56] works in a similar manner with TS, but uses deep networks to learn the classifiers. The training process is the same with deep-SM [19].

We compare TS and deep-TS with both the real-valued and binary representation methods to evaluate the performance of these cross-media methods. Note that for NUS-WIDE and Wikipedia where 10 classes are used for training and testing, TS and deep-TS only consume less than 4 bits under the new protocol (5 class for training); for Pascal Sentence with 20 classes, TS and deep-TS only cost less than 5 bits under the new protocol (10 classes for training).

Iii-E Evaluation on Real-valued Methods

We first present CMC curves and MAP scores obtained by the real-valued baseline methods detailed in Section III-C. The results of the trivial solutions (TS and deep-TS) are also drawn. Note that in the train/test splitting (Fig. 2), we ensure that the non-extendable retrieval and extendable retrieval have the same number of training classes and roughly the same number of training samples. So their numbers are directly comparable. The results on Wikipedia, Pascal Sentence, and NUS-WIDE are shown in Fig. 3, Fig. 4 and Fig. 5, respectively. From these results, we arrive at three major conclusions.

(1) The CMC and MAP scores of non-extendable retrieval are higher than those of extendable retrieval. For example, MAP of semantic matching (SM) [10] decreases from 60.7% (Fig. 3(a)) to 29.4% (Fig. 3(b)), and its rank-1 accuracy on the CMC curve drops from 56.5% (Fig. 3(a)) to 28.8% (Fig. 3(b)). A similarly considerable performance drop can be seen of other methods such as correlation matching (CM) [10], bilinear model (BLM) [67], etc.

This observation is expected because the distributions of the testing data are more different from the training data in extendable retrieval. In comparison, for the non-extendable retrieval, the training and testing data come from very similar distributions because they share the same set of classes. We therefore speculate that testing on data of another domain is a challenging and meaningful task since the benchmarking methods behave poorly compared to their performance of non-extendable retrieval. Regarding this point, we think that transfer learning should be an effective strategy in the future.

(2) While the trivial solution yields competitive accuracy under the non-extendable retrieval settings, its advantage is significantly reduced under the extendable retrieval settings. For example, MAP rank of deep trivial solution (deep-TS) [56] drops from 3 (59.8% in Fig. 3(a)) to 8 (23.6% in Fig. 3(b)). Similarly, trivial solution (TS) [56] drops from 4 (54.5% in Fig. 3(a)) to 7 (25.3% in Fig. 3(b)). The possible reason is that TS and deep-TS are based on classification and they fit the class distribution tightly. In extendable retrieval it is quite possible that the training and testing data come from very dissimilar class distributions. Then, the learned classifiers have a big chance to misclassify the testing data. We therefore think that there is no trivial solution in extendable retrieval. In Fig. 6 we present some sample query results of deep semantic matching (deep-SM) [19] and deep trivial solution [56] (deep-TS) in both non-extendable retrieval and extendable retrieval. The performance drop of deep-TS is much more obvious.

(3) The performance of different methods w.r.t the CMC curves is not consistent with MAP, but the rank-1 accuracy is somehow positively related with MAP. For example, correlation matching (CM) [10] shows the second best performance w.r.t its CMC curve, but it has the lowest MAP (21.6% in Fig. 3(a)). The rank-1 accuracy of CM is the third lowest and is somehow consistent with its MAP. Similar situation is observed for other methods such as GMLDA [12] and bilinear model (BLM) [67]. The difference is caused by the definition of the two evaluation metrics. MAP is a global metric and it averages the precision of the whole matches, its score reflects the distribution of the matching documents in the returned list. A high MAP can be gotten even if most matches rank medially in the returned list. However, CMC curve only counts the first match of each query and reflects the possibility to find the (first) match at each rank. We think the two metrics reflect different aspects of performance, which are complementary to each other.

Iii-F Evaluation on Binary Representations

We evaluate the binary methods with three code lengths: 8, 16 and 32. The experimental results on the three datasets are illustrated in Figures 78 and 9, respectively. The numbers in the brackets denote the code lengths.

The findings obtained from the real-valued representations still hold for the binary cases. Here we add some additional observations specifically observed for cross-media hashing.

(1) As code length increases, the CMC curves and MAP both undergo improvement for most methods, and vice versa. The variations of the CMC curves are coincident with the changes of the MAP scores between different code lengths. For example, MAP of multimodal similarity-preserving hashing (MM-NN) [42] improves from (Fig. 7(a)) to (Fig. 7(e)), the CMC curve also has improvement in accuracy. Exceptions exits. For example, MAP of SCMorth [29] drops from (Fig. 7(c)) to (Fig. 7(g)) but CMC curve is improved. A similar situation is observed for other methods such as inter-media hashing (IMH) [27] and cross-view hashing (CVH) [26]. The reason is that these methods are based on CCA [11] and produce limited effective bits for the global accuracy (MAP). We think that a longer code improves the top-rank accuracy in general.

(2) The CMC curve may serve as a good discriminator between methods under the cases of similar MAP scores. The CMC curve can illustrate the differences of performance even when with equal MAP scores. For example, MAP of CVH is similar in Figs. 7(g) (21.6%), 7(g) (21.7%) and 7(k) (21.9%), but the CMC curves are more discriminative. The reason is that, compared to MAP, CMC curve reflects more details of the search results. Regard this point, and we think that CMC curve is a good supplementary metric for MAP.

(3) In most cases, the difference in performance (CMC and MAP) of various methods is much smaller under the new protocol. For example, in Fig. 7(b), the difference of CMC and MAP between the methods is much smaller compared to Fig. 7(a). Similar situation is observed for both the real-valued and binary representation methods. This observation is expected because the discriminative knowledge learned from the training data cannot be directly transfered to the testing data under the new protocol. The drops of performance reduce the difference between the methods. We think the next challenge of cross-media retrieval is to solve the knowledge transfer problem.

Iv Conclusion

This paper introduces a new evaluation protocol and extensive benchmarking results for extendable cross-media retrieval. The new protocol involves 1) a complete separation of the training the testing sets and 2) a complete separation of the training and testing classes. This protocol thus reflects the extendable settings that have been largely ignored in the cross-media retrieval community. Through the benchmarking results, we demonstrate a significant performance drop from the non-extendable retrieval settings to the extendable retrieval settings. Moreover, we find that the classification trivial solution works under the non-extendable retrieval but is less effective in the extendable retrieval. These observations indicate that the two evaluation protocols (the existing protocol and the new protocol) are entirely different, and the practical usage seems to favor the new protocol through our analysis.

In the future, we point out two critical research directions apart from the common feature/subspace learning techniques:

First, transfer learning within each dataset should be proposed. The model trained from a part of dataset is expected to be effective for another part of data with different class distribution, both non-extendable retrieval and extendable retrieval should be evaluated.

Second, generically applicable models that work on different cross-media datasets are to be investigated. It is expected that a generic model is trained on a large-scale training dataset and the learned model is effective in all the other testing datasets.


  • [1]

    L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Packing and padding: Coupled multi-index for accurate image retrieval,” in

    CVPR, 2014, pp. 1939–1946.
  • [2] L. Zheng, S. Wang, J. Wang, and Q. Tian, “Accurate image search with multi-scale contextual evidences,” IJCV, vol. 120, no. 1, pp. 1–13, 2016.
  • [3] L. Zheng, Y. Yang, and Q. Tian, “Sift meets cnn: a decade survey of instance retrieval,” arXiv preprint arXiv:1608.01807, 2016.
  • [4] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for large-scale near-duplicate video retrieval,” TMM, vol. 15, no. 8, pp. 1997–2008, 2013.
  • [5] F. Wu, Y. Yang, Y. Zhuang, and Y. Pan, “Understanding multimedia document semantics for cross-media retrieval,” in PCM.   Springer Berlin Heidelberg, 2005, pp. 993–1004.
  • [6] Y.-T. Zhuang, Y. Yang, and F. Wu, “Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval,” TMM, vol. 10, no. 2, pp. 221–229, 2008.
  • [7] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval,” TMM, vol. 10, no. 3, pp. 437–446, 2008.
  • [8] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Ranking with local regression and global alignment for cross media retrieval,” in ACMMM.   ACM, 2009, pp. 175–184.
  • [9] Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia, “Cross-media retrieval using query dependent search methods,” Pattern Recognition, vol. 43, no. 8, pp. 2927–2936, 2010.
  • [10] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in ACMMM.   ACM, 2010, pp. 251–260.
  • [11] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
  • [12] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in CVPR.   IEEE, 2012, pp. 2160–2167.
  • [13] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for cross-modal matching,” in ICCV, 2013, pp. 2088–2095.
  • [14] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” IJCV, vol. 106, no. 2, pp. 210–233, 2014.
  • [15] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Cluster canonical correlation analysis.” in AISTATS, 2014, pp. 823–831.
  • [16] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis.” in ICML, 2013, pp. 1247–1255.
  • [17] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in CVPR, 2016, pp. 5005–5013.
  • [18] F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in CVPR, 2015, pp. 3441–3450.
  • [19] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline,” IEEE Trans. on Cybernetics, 2016, Preprint.
  • [20] Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,” TMM, vol. 18, no. 7, pp. 1363–1377, 2016.
  • [21] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” TACL, vol. 2, pp. 207–218, 2014.
  • [22] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
  • [23]

    F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” in

    ACMMM.   ACM, 2014, pp. 7–16.
  • [24] V. Vukotić, C. Raymond, and G. Gravier, “Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications,” in ICMR.   ACM, 2016, pp. 343–346.
  • [25] G. Park and W. Im, “Image-text multi-modal representation learning by adversarial backpropagation,” arXiv preprint arXiv:1612.08354, 2016.
  • [26] S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” in IJCAI, vol. 22, no. 1, 2011, pp. 1360–1365.
  • [27] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-media hashing for large-scale retrieval from heterogeneous data sources,” in SIGMOD.   ACM, 2013, pp. 785–796.
  • [28] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in CVPR, 2014, pp. 2075–2082.
  • [29] D. Zhang and W.-J. Li, “Large-scale supervised multimodal hashing with semantic correlation maximization.” in AAAI, vol. 1, no. 2, 2014, p. 7.
  • [30] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for cross-modal similarity search,” in SIGIR.   ACM, 2014, pp. 415–424.
  • [31] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast cross-media retrieval,” in SIGIR.   ACM, 2014, pp. 395–404.
  • [32] X. Xu, F. Shen, Y. Yang, and H. T. Shen, “Discriminant cross-modal hashing,” in ICMR.   ACM, 2016, pp. 305–308.
  • [33] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through cross-modality metric learning using similarity-sensitive hashing,” in CVPR.   IEEE, 2010, pp. 3594–3601.
  • [34] Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodal hash function learning,” in SIGKDD.   ACM, 2012, pp. 940–948.
  • [35] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semantics-preserving hashing for cross-view retrieval,” in CVPR, 2015, pp. 3864–3872.
  • [36] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, “Multimodal similarity-preserving hashing,” TPAMI, vol. 36, no. 4, pp. 824–830, 2014.
  • [37] Q.-Y. Jiang and W.-J. Li, “Deep cross-modal hashing,” arXiv preprint arXiv:1602.02255, 2016.
  • [38] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu, “Deep visual-semantic hashing for cross-modal retrieval,” in SIGKDD.   ACM, 2016, pp. 1445–1454.
  • [39] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in NIPS, 2013, pp. 2121–2129.
  • [40] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images with multimodal recurrent neural networks,” arXiv preprint arXiv:1410.1090, 2014.
  • [41] A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014, pp. 1889–1897.
  • [42] L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” in ICCV, 2015, pp. 2623–2631.
  • [43] X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in CVPR, 2015, pp. 2422–2431.
  • [44] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015, pp. 3128–3137.
  • [45] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in CIVR.   ACM, 2009, pp. 1–9.
  • [46] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in MIR.   ACM, 2008, pp. 39–43.
  • [47] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
  • [48] J. Krapac, M. Allan, J. Verbeek, and F. Juried, “Improving web image search results using query-relative classifiers.”
  • [49] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in NAACL Workshop.   Association for Computational Linguistics, 2010, pp. 139–147.
  • [50]

    Y. Yan, F. Nie, W. Li, C. Gao, Y. Yang, and D. Xu, “Image classification by cross-media active learning with privileged information,”

    TMM, vol. 18, no. 12, pp. 2494–2502, 2016.
  • [51] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” JAIR, vol. 47, pp. 853–899, 2013.
  • [52] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” TACL, vol. 2, pp. 67–78, 2014.
  • [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV.   Springer, 2014, pp. 740–755.
  • [54] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in NIPS, 2011, pp. 1143–1151.
  • [55] X. Zhai, Y. Peng, and J. Xiao, “Learning cross-media joint representation with sparse and semisupervised regularization,” TCSVT, vol. 24, no. 6, pp. 965–978, 2014.
  • [56] A. Sablayrolles, M. Douze, H. Jégou, and N. Usunier, “How should we evaluate supervised hashing?” arXiv preprint arXiv:1609.06753, 2016.
  • [57] F. Radenović, G. Tolias, and O. Chum, “Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples,” in ECCV.   Springer, 2016, pp. 3–20.
  • [58] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015, pp. 1116–1124.
  • [59] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identification: Past, present and future,” arXiv preprint arXiv:1610.02984, 2016.
  • [60] X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach to progressive vehicle re-identification for urban surveillance,” in ECCV.   Springer, 2016, pp. 869–884.
  • [61] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification in urban surveillance videos,” in ICME, 2016, pp. 1–6.
  • [62]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” 2014.

  • [63] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
  • [64] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR.   IEEE, 2009, pp. 248–255.
  • [65]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    NIPS, 2013, pp. 3111–3119.
  • [66]

    A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch,” in

    CVPR.   IEEE, 2011, pp. 593–600.
  • [67] M. Turk and A. Pentland, “Eigenfaces for recognition,” JOCN, vol. 3, no. 1, pp. 71–86, 1991.
  • [68] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” TPAMI, vol. 29, no. 1, 2007.
  • [69] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” TPAMI, vol. 19, no. 7, pp. 711–720, 1997.
  • [70] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, 2009, pp. 1753–1760.