Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing

04/01/2020 ∙ by Hengtong Hu, et al. ∙ HUAWEI Technologies Co., Ltd. Hefei University of Technology 0

In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin.



There are no comments yet.


page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, with the rapid increase of multimedia data, cross-modal retrieval [37, 46, 18, 47, 7, 1, 10, 25, 22] has attracted more and more attentions in both academia and industry. The goal is to retrieve instances from one modality using a query instance from another modality, e.g., finding an image with a few textual tags. One of the most popular pipeline for this purpose, named cross-modal hashing (CMH) [1, 19, 23, 47, 7], involves mapping contents in different modalities into a common hamming space. By compressing each instance into a fixed-length binary code, the storage cost can be dramatically reduced and the time complexity for retrieval is constant since the indexing structure is built with hashing codes.

State-of-the-art CMH methods can be roughly categorized into two parts, namely, supervised and unsupervised methods. Both of them learn to shrink the gap between the distributions of two sets of training data (e.g., using adversarial-learning-based approaches [20, 21, 13]), but they differ from each other in whether an instance-level annotation is provided during the training stage. From this perspective, the supervised CMH methods [1, 25, 22, 34, 44], receiving additional supervision, often produce more accurate results, and the unsupervised counterparts, while achieving lower performance, are relatively easier to deploy to real-world scenarios.

DCMH [17]
SSAH [20]
UCH [21]
UGACH [45]
Table 1: The difference between our approach and some recent cross-modal hashing methods. Here, ‘WL’ indicates that training without using labels, ‘ER’ indicates that the method utilizes extensive relevance information rather than only the pairwise information, and ‘KD’ indicates utilizing knowledge distillation in the training process.

This paper combines the benefits of both methods by a simple yet effective idea, known as creating something from nothing. The core idea is straightforward: the supervised methods do not really

require each instance to be labeled, but they use the labels to estimate the similarity between each pair of cross-modal data. Such information, in case of no supervision, can also be obtained from calculating the distance between their feature vectors, with the features provided by a trained unsupervised CMH method. Our approach,

unsupervised knowledge distillation (UKD), contains an unsupervised CMH module followed by another supervised one, both of which can be freely replaced by new and more powerful models in the future.

Our research paves the way towards an interesting direction that using an unsupervised method to guide a supervised method, for which CMH is a good scenario to test on. We perform experiments on two popular cross-modality retrieval datasets, i.e., MIRFlickr and NUS-WIDE, and demonstrate state-of-the-art performance, outperforming existing unsupervised CMH methods by a significant margin. Moreover, we delve deep into the benefits of supervision, and point out a few directions for future research.

The remainder of this paper is organized as follows. Section 2 briefly reviews the preliminaries of cross-modal retrieval and hashing, and Section 3 describes the unsupervised knowledge distillation approach. Experimental results are shown in Section 4 and conclusions are drawn in Section 5.

2 Related Work

2.1 Cross-Modal Retrieval and Hashing

Cross-modal retrieval aims to search semantically similar instances in one modality using a query from another modality [37, 39]. Throughout this paper, we consider the retrieval task between vision and language, i.e., involving images and texts. To map them into the same space, two models need to be trained, one for each modality. The goal is to make the image-text pairs with relevant semantics to be close in the feature space. To train and evaluate the mapping functions, a dataset with image-text pairs is present. The dataset is further split into a training set and a query set, i.e., the testing stage is performed on the query set.

In the past decade, many efforts were made on this topic [18, 46, 37]

. However, most of them suffered from high computation costs in real-world, high-dimensional data. To scale up these models to real-world scenarios, researchers often compressed the output of these models into binary vectors of a fixed length 

[1, 19, 10, 24], i.e., Hashing codes. In this situation, this task is often referred to as cross-modality hashing.

2.2 Supervised Cross-Modal Hashing Methods

The fundamental challenges of cross-modal hashing lie in learning reliable mapping functions to bridge the modality gap. Supervised methods [25, 47, 38, 20, 39, 7]

achieved this goal by exploiting semantic labels to capture rich correlation information among data from different modalities. Traditional supervised learning methods were mostly based on handcrafted features, and aimed to understand the semantic relevance in the common space. SePH 

[22] proposed a semantics-preserving hashing method which aimed to approximate the distribution of semantic labels with hash codes on the Hamming space via minimizing the KL-divergence. Wang et al. [34] proposed to leverage list-wise supervision into a principled framework of learning the hashing function.

With the rapid development of deep learning, researchers started to build supervised methods upon more powerful yet discriminative features. DCMH 

[17] proposed a deep cross-modal hashing method by integrating feature learning and binary quantization into one framework. SSAH [20] improved this work by proposing a self-supervised approach, which incorporated adversarial learning into cross-modal hashing. Zhang et al. [47] also investigated a similar idea by proposing an adversarial hashing network with an attention mechanism to enhance the measurement of content-level similarities. These supervised methods achieved superior performance, arguably by acquiring correlation information from the semantic labels of both images and texts. However, acquiring a large amount of such labels is often expensive and thus intractable, which makes the supervised approaches infeasible in the real-world applications.

2.3 Unsupervised Cross-Modal Hashing Methods

Figure 1: The proposed UKD framework which involves training a teacher model in an unsupervised manner, constructing the similarity matrix by distilling knowledge from the teacher model, and using to supervise the student model. Each dot represents an intermediate feature. Please zoom in to see the details of this figure.

Compared with the supervised counterparts, unsupervised cross-modal hashing methods [9, 45, 13, 36] only relied on the correlation information from the paired data, making it easier to be deployed to other scenarios. These methods usually learned hashing codes by preserving inter- and intra-correlations. For example, Song et al. [32] proposed inter-media hashing to establish a common Hamming space by maintaining inter-media and intra-media consistency. Recently, several works introduced deep learning to improve unsupervised cross-modal hashing. UGACH [45] utilized a generative adversarial network to exploit the underlying manifold structure of cross-modal data. As an improvement, UCH [21] coupled the generative adversarial network to build two cycled networks in an unified framework to learn common representations and hash mapping simultaneously.

Despite the superiority in reducing the burden of data annotation, the accuracy of unsupervised cross-modal hashing methods is often below satisfaction, in particular, much lower than the supervised counterparts. The main reason lies in lacking the knowledge of pairwise similarity for the training data pairs. On the other hand, we notice that the output of an unsupervised model contains, though somewhat inaccurate, such semantic information. This motivates us to guide a supervised model by the output of an unsupervised model. This is yet another type of research which distills knowledge to assist model training.

3 Our Approach

In this work, we focus on the idea named creating something from nothing, i.e., a supervised cross-modal hashing method can be guided by the output of an unsupervised method, which reveals the similarity between training data pairs. Figure 1 shows the framework of the proposed UKD. In what follows, we first explain the motivation of our approach, and then introduce the proposed pipeline, unsupervised knowledge distillation, from two aspects, namely, how to distill similarity from an unsupervised model, and how to utilize it efficiently to optimize a supervised model.

3.1 Supervised and Unsupervised Baselines

Throughout this paper, we consider the case that the training set contains paired data, i.e., , where is the number of image-text pairs. Here, be an image and be a text, where the superscripts and denote ‘images’ and ‘texts’, and and denote the dimensionality of the feature spaces, respectively. and can be different, e.g., as in our experiments. The models that map them into the same space are denoted as and , respectively, where is the dimensionality of the common feature space, and and are model parameters. The compressed hashing code for images and texts are denoted by and , respectively, i.e., both and fall within .

The key to cross-modal hashing lies in recognizing which pairs of image-text data are semantically relevant while other are not, so that the model can learn to pull the features of relevant pairs closer in the common space. A straightforward idea is to define all paired image and text instances to be relevant and all others irrelevant. However, this strategy produces a very small positive set and a much larger negative set, which often causes data imbalance during the training stage. A generalized yet more effective solution is to define a similarity matrix , so that when defines a positive pair and vice versa. The original sampling strategy is equivalent to .

Given , the objective of training involves minimizing the total distance with respect to and , i.e.,


Therefore, the definition of

forms the major challenge of the learning task. According to whether extra labels of images and texts, besides the paired information, are used, existing methods can be categorized into either supervised or unsupervised learning. In the supervised setting, instance-level annotations (

e.g., classification tags) are used to measure whether two instances are relevant, while in the unsupervised setting, no additional labels are available and thus the raw features are the only source of judgment. Obviously, the former provides more accurate estimation on than the latter and, consequently, stronger models for cross-modal hashing. However, collecting additional annotations, even at the instance level, can be a large burden especially when the dataset is very large. Hence, we focus on improving the performance of unsupervised learning methods which are easier to be deployed to real-world scenarios.

Figure 2: Knowledge distilled from an unsupervised model (best viewed in color). Compared to the retrieval results in the original feature space, our approach produces more accurate information about the tag of an image and, more importantly, better estimation on the relevance of image-text pairs.

3.2 Unsupervised Knowledge Distillation

Function new image text P@ P@
Table 2:

Comparison among different functions to measure the similarity between image-text pairs. All the results are computed using features extracted from a UGACH 

[45] model trained on the MIRFlickr dataset. Here we consider four properties: ‘new’ means that the new feature space, learned by the teacher model, is used; ‘image’ and ‘text’ means the corresponding features used, and ‘indiv’ means image and text features are used individually. P@/P@ indicate the accuracy rates among the top / retrieved pairs.

Our idea originates from the fact that, as shown above, the difference between supervised and unsupervised cross-modal hashing algorithms is not big, but supervised methods often report much higher accuracy than the unsupervised counterparts. Moreover, the supervised algorithms do not require real supervision, namely, the manually labeled image/text tags, but only need to know, or estimate, the similarity between any pair of data, i.e., elements in . Beyond the unsupervised baseline that estimates using raw image/text features (extracted from a pre-trained deep network or computed using bag-of-words statistics), we seek for the possibility that a cross-modal retrieval model, after trained in an unsupervised manner, can produce a more accurate estimation of . We illustrate an example in Figure 2. Later, we will show in experiments, with the help of oracle annotations, that the updated estimation of is indeed more accurate in terms of finding relevant pairs.

Note that the estimated can be used to train either supervised or unsupervised models, with the formulations detailed above. When is used for unsupervised learning, the only effect is to provide a better sampling strategy, so as to increase the portion of true-positive image-text pairs in the chosen training set. This alleviates the risk that the model learns to pull the features of actually irrelevant pairs. When it is used for supervised learning, we are actually creating something from nothing, i.e., guiding a supervised model with the output of an unsupervised model.

The proposed framework, unsupervised knowledge distillation (UKD), works as follows. After the teacher model has been trained, we obtain both and for image and text feature embedding, respectively. It remains to determine each element of . Without loss of generality, we assume that the feature vectors extracted from either modality, i.e., or , have a -norm. This is to ease the following calculations.

First, we point out that for all . When , takes four vectors, , , and , into consideration. The design of can have various forms. For example, it can consider both image and text features so that , where is the Euclidean distance between two vectors which lies in the range of for two normalized vectors. Also, it is possible for to consider only single-modal information, e.g., in which only image features are used for measuring similarity.

Here, we take several definitions of into consideration, and compare their performance in finding true-positive pairs. Results are shown in Table 2. We can observe several important properties that are useful for similarity measurement. First

, the features trained for cross-modal hashing are indeed better than those without being fine-tuned; Second, measuring similarity in the image feature space is more accurate than that in the text feature space; Third, directly combining image and text similarity into one does not improve accuracy beyond using image similarity alone, though we expect that text features to provide auxiliary information. Motivated by these results, we use image features and text features to retrieve two lists of relevant pairs and then merge them into one. This strategy reports a precision of

at top- instances, surpassing that using image and text features alone. We fix this setting throughout the remainder of this paper.

3.3 Models and Implementation Details

We first illustrate the supervised and unsupervised methods we have used. We take DCMH [17]

as an example of supervised learning. Here we utilize the framework of DCMH but modify its architecture for higher accuracy. This model contains two deep neural networks, designed for the image modality and the text modality, respectively. The image modality network consists of

layers, the first eighteen layers are the same as those in VGG19 network [31], and the last layer maps features into the Hamming space. For the text modality, a multi-scale fusion model from SSAH [20] which consists of multiple average pooling layers and a convolutional layer is used to extract the text features. Then, a hash layer follows to map the text features into the Hamming space.

On the other hand, we investigate UGACH [45], a representative unsupervised learning method as the teacher model. It consists of a generative module and a discriminative module. The discriminator receives the data selected by the generator as negative instances, and take the data sampled using as positive instances. Then a triplet loss is used to optimize to obtain better discriminate ability for the discriminator. Both the generative and discriminative modules have a two-pathway architecture, each of which has two fully-connected layers. We set the dimension of representation layer to in our experiments. The dimension of the hash layer is same as the hash code length.

For the supervised model, we take the raw pixels as inputs. In pre-processing, we resize all images into and crop a patches randomly. We select relevant instances for the student model by using the teacher model with the highest precision ( bits in all experiments). We set the number of relevant instances to be for the supervised student model, and for the unsupervised student model. We train our approach in a batch-based manner and set the batch size to . We train the model using an SGD optimizer with a weight decay of . For the compared methods, we apply the same implementations as provided in the original work.

3.4 Relationship to Previous Work

Our method is related to knowledge distillation [29, 43, 33], which was proposed to extract knowledge from a teacher model to assist training a student model. Hinton et al. [15] suggested that there should be some ‘dark knowledge’ that can be propagated during this process. Recently, many efforts were made to study what the dark knowledge is [41, 40], and/or how to efficiently take advantage of such knowledge [11, 42, 35, 2]. In particular, DarkRank [5]

distilled knowledge for deep metric learning by matching two probability distributions over ranking, while our approach utilized knowledge by selecting relevant instances. On the other hand, both 

[42] and [27] transferred knowledge to improve the student models by designing a distillation loss, while our approach enables guiding a supervised method by an unsupervised method, in which no extra loss is used.

We also notice the connection between our approach and the self-learning algorithms for semi-supervised learning,

e.g., medical image analysis [48]. The shared idea is to start with a small part of labeled data (in our case, labeled image-text pairs) and try to explore the unlabeled part (in our case, other image-text pairs with unknown relevance), but the methods to gain additional supervision are different. Also, the idea that ‘training a stronger model at the second time’ is related to the coarse-to-fine learning approaches [14, 49] which often adopted iteration for larger improvements.

Our approach shares the same idea with some prior work that guided a supervised model with the output of an unsupervised model. DeepCluster [3] groups the features with a standard clustering algorithm and uses the subsequent assignments as supervision to update the weights of the network. Gomez et al. [12] performed self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. Differently, our approach makes use of teacher-student optimization to combine the supervised and unsupervised models. Experiment results show the effectiveness of knowledge distillation.

4 Experiments

4.1 Datasets, Evaluation, and Baselines

We evaluate our approach on two benchmark datasets: MIRFlickr and NUS-WIDE. MIRFlickr-25K [16] consists of images downloaded from Flickr. Each image is associated with text tags and annotated with at least one among pre-defined categories. Following UGACH [45], we use image-text pairs in our experiments, where are preserved as the query set and the rest are used for retrieving. We represent each image by a -dimensional feature vector, extracted from a pre-trained VGGNet [31] of layers, and each text by a -dimensional bag-of-words features.

NUS-WIDE [6] is much larger than MIRFlickr, which contains images and the associated text tags from Flickr. It defined categories, but there are considerable overlaps among them. Still, following UGACH [45], largest categories and the corresponding image-text pairs are used in the experiments. We preserve of data as the query database and use the rest as the retrieval set. Each image is represented by a -dimensional feature vector extracted from the same VGGNet, and each text by a -dimensional bag-of-words vector.

Following the convention, we adopt the mean Average Precision (mAP) criterion to evaluate the retrieval performance of all methods. The mAP score is computed as the mean value of the average precision scores for all queries.

Task Method MIRFlickr-25K NUS-WIDE
16 32 64 128 16 32 64 128
CMSSH [1] 0.611 0.602 0.599 0.591 0.512 0.470 0.479 0.466
SCM [44] 0.636 0.640 0.641 0.643 0.517 0.514 0.518 0.518
DCMH [17] 0.677 0.703 0.725 - 0.590 0.603 0.609 -
SSAH [20] 0.797 0.809 0.810 - 0.636 0.636 0.637 -
CVH [19] 0.602 0.587 0.578 0.572 0.458 0.432 0.410 0.392
PDH [28] 0.623 0.624 0.621 0.626 0.475 0.484 0.480 0.490
CMFH [8] 0.659 0.660 0.663 0.653 0.517 0.550 0.547 0.520
CCQ [26] 0.637 0.639 0.639 0.638 0.504 0.505 0.506 0.505
UGACH [45] 0.676 0.693 0.702 0.706 0.597 0.615 0.627 0.638
UKD-US 0.695 0.703 0.705 0.707 0.606 0.621 0.634 0.643
UKD-SS 0.714 0.718 0.725 0.720 0.614 0.637 0.638 0.645
CMSSH [1] 0.612 0.604 0.592 0.585 0.519 0.498 0.456 0.488
SCM [44] 0.661 0.664 0.668 0.670 0.518 0.510 0.517 0.518
DCMH [17] 0.705 0.707 0.724 - 0.620 0.634 0.643 -
SSAH [20] 0.782 0.797 0.799 - 0.653 0.676 0.683 -
CVH [19] 0.607 0.591 0.581 0.574 0.474 0.445 0.419 0.398
PDH [28] 0.627 0.628 0.628 0.629 0.489 0.512 0.507 0.517
CMFH [8] 0.611 0.606 0.575 0.563 0.439 0.416 0.377 0.349
CCQ [26] 0.628 0.628 0.622 0.618 0.499 0.496 0.492 0.488
UGACH [45] 0.676 0.692 0.703 0.707 0.602 0.610 0.628 0.637
UKD-US 0.704 0.707 0.715 0.714 0.621 0.625 0.640 0.647
UKD-SS 0.715 0.716 0.721 0.719 0.630 0.656 0.657 0.663
Table 3: The mAP scores of our approach and state-of-the-art competitors, in two datasets and four different code lengths. In each half, the four rows above the horizontal line contain supervised learning algorithms, while the right rows below contain unsupervised ones.

We compare our approach against previous methods. of them used additional supervision (CMSSH [1], SCM [44], DCMH [17], and SSAH [20]), and while others (CVH [19], PDH [28], CMFH [8], and CCQ [26]), and UGACH [45]) did not. Following our direct baseline, UGACH, we use a -layer VGGNet [31]

pre-trained on the ImageNet dataset 


to extract deep features and, for a fair comparison, use them to replace the features used in other baselines, including those using handcrafted features.

4.2 Unsupervised Student vs. Supervised Student

In Table 3, we list the accuracy, in terms of mAP, of our approach as well as other methods for comparison. On two benchmark datasets MIRFlickr and NUS-WIDE. We use ‘’ to denote the task that images are taken as the query to retrieval the instances in the text database, and ‘’ the task in the opposite direction. Our approach is denoted by ‘UKD-US’ and ‘UKD-SS’, with ‘US’ and ‘SS’ indicating ‘unsupervised-student’ and ‘supervised-student’, respectively.

We observe interesting results. Regarding the task, UKD-SS outperforms UKD-US significantly on the MIRFlickr dataset, but the advantage on the NUS-WIDE dataset becomes much smaller. This is explained by noting that the impact brought by supervision is different between these two datasets. We consider SSAH [20] and UGACH [45], the supervised and unsupervised models we used as the students. SSAH typically outperforms UGACH by on MIRFlickr, but the number is quickly shrunk to

on NUS-WIDE. This is partially due to the larger variance of the images in NUS-WIDE, which makes it difficulty for the labeled tags to provide accurate and valuable supervision. From this perspective, the reduced advantage of UKD-SS over UKD-US is reasonable, considering that SSAH is the upper-bound of UKD-SS.

On the other hand, by introducing extra supervision, (in particular, by checking the distance between the features extracted from an unsupervised model), considerable noise (e.g., inaccurate similarity measurement) is also introduced to the supervised student model. Hence, there is a tradeoff between the quality and impact of these self-annotated pairs. Most often, the latter can be measured by the advantage of the supervised student model over the unsupervised one, if both can be obtained in a small reference dataset.

4.3 Comparison to the State-of-the-Arts

From Table 3, one can observe that our approach, UKD, significantly outperforms all existing unsupervised cross-modal hashing methods on both datasets, and under any length of hash code. In particular, compared to our baseline (UGACH, which is also the strongest model that ever reported results with VGGNet-19 features), UKD enjoys , , and gains (averaged over and ) under , , and bits on the MIRFlickr dataset, and the corresponding numbers on the NUS-WIDE dataset are , , and , respectively. Given such a high baseline, these improvements clearly demonstrate the effectiveness of distilling knowledge from the teacher model, although it is trained in an unsupervised manner. Moreover, the accuracy gain in more significant in the low-bit scenarios, arguably because richer information is provided by the teacher model which has bits. On the other hand, the amount of supervision saturates with the increasing number of compressed bits. We also tried to use full-precision models to serve as the teacher, but achieved marginal gain.

4.4 Does Iteration Help?

Motivated by the consistent improvement from the teacher to the student, a question is straightforward: is it possible to further improve the performance if we continue distilling knowledge from the student, so as to guide a ‘new student’? We investigate this option, and results are summarized in Table 4. We find that, compared to the significant gain brought by the first knowledge distillation, the gain of the second round is mostly marginal, e.g., the average gain on the task is compared to of the first round.

We owe this to the limited improvement of our student model in intra-modal learning – recall that we have used intra-modal similarities to choose relevant pairs. Unlike the accuracy of cross-modal retreival performance, that of intra-model retrieval, from the teacher to the student, is hardly improved. This is to say, the new batch of image-text pairs for either supervised or unsupervised learning do not have a clear advantage over the previous batch, and so the quality of training data mostly remains unchanged.

Task Method MIRFlickr-25K
16 32 64 128
GEN-0 0.676 0.693 0.702 0.706
GEN-1 0.695 0.703 0.705 0.707
GEN-2 0.698 0.705 0.708 0.712
GEN-0 0.676 0.692 0.703 0.707
GEN-1 0.704 0.707 0.715 0.714
GEN-2 0.705 0.712 0.716 0.719
Table 4: Results of training in generations for unsupervised student model on MIRFlickr-25K. ‘GEN-0’ and ‘GEN-1’ are identical to the UGACH and UKD-US models reported in Table 3, respectively.
Method Task MIRFlickr-25K
16 32 64 128
UKD-SS 0.711 0.704 0.711 0.720
0.692 0.702 0.705 0.706
Table 5: Results of using a 16-bit teacher to guide the supervised student model on MIRFlickr-25K.
Task Method MIRFlickr-25K
16 32 64
UGACH [45] 0.603 0.607 0.616
UCH [21] 0.654 0.669 0.679
UKD-US 0.667 0.674 0.677
UKD-SS 0.678 0.680 0.679
UGACH [45] 0.590 0.632 0.642
UCH [21] 0.661 0.667 0.668
UKD-US 0.676 0.683 0.680
UKD-SS 0.688 0.687 0.694
Table 6: Accuracy (mAP) comparison on MIRFlickr-25K, with UGACH and UCH as the baselines. To observe how a stronger teacher model (-bit) teaches a weaker student model, we only report -bit, -bit and -bit results.

4.5 Diagnostic Experiments

Knowledge Distillation with a Weaker Teacher

In order to show that UKD can work under a relatively weaker teacher signal, we use a -bit model of UGACH [45] as the teacher. As shown in Table 5, we still achieve consistent accuracy gain beyond the baseline. However, the gain is reduced compared to using a -bit teacher, since the benefit of UKD is mostly determined by the quality of the similarity matrix , and a weaker teacher often leads to a weaker , e.g., the precision of the top-ranked list of pairs is reduced.

Transferring to Other Features

To verify that our approach is generalized to other features, we apply it to UCH [21], a recently published unsupervised cross-modal hashing method, using features extracted from a pre-trained CNN-F model [4] (same as in the original paper). Table 6 shows the comparison between UCH and our approach in terms of mAP values on MIRFlickr. Note that our baseline is still UGACH, with the features replaced, since the authors of UCH did not provide the code. One can see that both UKD-US and UKD-SS outperform UGACH (and also, UCH), and UKD-SS works better than UKD-US, i.e., the same phenomena we have observed previously.

Figure 3: The mAP value with respect to the number of relevant pairs selected (tested on the MIRFlickr dataset, teacher is a -bit model, student is a -bit model).
Figure 4: The top- precision curves with respect to the number of relevant pairs selected (tested on the MIRFlickr dataset, teacher is a -bit model).

Sensitivity to the Number of Selected Pairs

Next, we analyze how the performance of cross-modal hashing is related to the number of relevant pairs selected during the training process. In Figure 3, one can observe a trend of accuracy gain as the number of selected pairs increases, but when the number goes to a relatively large value, it tends to saturate and even goes down a little bit. This is related to the total number of relevant pairs in the dataset and, of course, the ability of the model in choosing relevant pairs.

We also compare our approach with the baseline in terms of the precision of the top-ranked, selected instance pairs. From Figure 4, we can see that UKD enjoys a significant advantage over UGACH, our direct baseline. Nevertheless, we see a rapid drop in precision when the number of selected pairs grows, implying that non-top-ranked pairs can introduce noise to the model. Again, this is a tradeoff between quantity and quality.

Qualitative Studies

Finally, we qualitatively compare the results of our approach and the baseline. Figure 5 shows two typical examples. The query (dog) is relatively simple, but in the original paired training set, there are no sufficient amount of labeled data for the algorithm to learn the vision-language correspondence. This is compensated with the enlarged set found by an unsupervised teacher model. In comparison, the query contains complicated semantics that are even more difficult to learn, but our model, by making use of image-level similarity, mines extra training data from other sources (see the examples in Figure 5 which is also related to these tags). Consequently, the prediction of our approach is much better.

Figure 5: Qualitative comparison (top: a text query with top- retrieved instances; bottom: an image query with top- retrieved instances) between our approach and UGACH (-bit hashing), our direct baseline. Red frames and words indicate relevant images or words in the retrieved results. Note that the image query is much more difficult, as it contains semantically complicated concepts which even requires aesthetic perception to understand.

5 Conclusions

In this paper, we propose a novel approach to improve cross-modal hashing which enables guiding a supervised method using the outputs produced by an unsupervised method. We make use of teacher-student optimization for propagating knowledge. Superior performance can be achieved for the supervised student model by utilizing the extensive relevance information exploited from the outputs of the unsupervised teacher model. We evaluate our approach on two benchmarks MIRFlickr and NUS-WIDE, and the experiment results show that our method outperforms the state-of-the-art methods.

Acknowledgements This work was supported in part by the National Natural Science Foundation of China under grant 61722204, 61932009, and in part by the National Key Research and Development Program of China under grant 2019YFA0706200, 2018AAA0102002.


  • [1] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In

    2010 IEEE computer society conference on computer vision and pattern recognition

    pp. 3594–3601. Cited by: §1, §1, §2.1, §4.1, Table 3.
  • [2] Q. Cai, Y. Pan, C. Ngo, X. Tian, L. Duan, and T. Yao (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: §3.4.
  • [3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §3.4.
  • [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §4.5.
  • [5] Y. Chen, N. Wang, and Z. Zhang (2018) Darkrank: accelerating deep metric learning via cross sample similarities transfer. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.4.
  • [6] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 48. Cited by: §4.1.
  • [7] C. Deng, Z. Chen, X. Liu, X. Gao, and D. Tao (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions on Image Processing 27 (8), pp. 3893–3903. Cited by: §1, §2.2.
  • [8] G. Ding, Y. Guo, J. Zhou, and Y. Gao (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Transactions on Image Processing 25 (11), pp. 5427–5440. Cited by: §4.1, Table 3.
  • [9] G. Ding, Y. Guo, and J. Zhou (2014) Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2075–2082. Cited by: §2.3.
  • [10] F. Feng, X. Wang, and R. Li (2014)

    Cross-modal retrieval with correspondence autoencoder

    In Proceedings of the 22nd ACM international conference on Multimedia, pp. 7–16. Cited by: §1, §2.1.
  • [11] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §3.4.
  • [12] L. Gomez, Y. Patel, M. Rusiñol, D. Karatzas, and C. Jawahar (2017) Self-supervised learning of visual features through embedding images into text topic spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239. Cited by: §3.4.
  • [13] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189. Cited by: §1, §2.3.
  • [14] J. Gu, J. Cai, G. Wang, and T. Chen (2018)

    Stack-captioning: coarse-to-fine learning for image captioning

    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.4.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.4.
  • [16] M. J. Huiskes and M. S. Lew (2008) The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp. 39–43. Cited by: §4.1.
  • [17] Q. Jiang and W. Li (2017) Deep cross-modal hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3232–3240. Cited by: Table 1, §2.2, §3.3, §4.1, Table 3.
  • [18] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §1, §2.1.
  • [19] S. Kumar and R. Udupa (2011) Learning hash functions for cross-view similarity search. In Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: §1, §2.1, §4.1, Table 3.
  • [20] C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4242–4251. Cited by: Table 1, §1, §2.2, §2.2, §3.3, §4.1, §4.2, Table 3.
  • [21] C. Li, C. Deng, L. Wang, D. Xie, and X. Liu (2019) Coupled cyclegan: unsupervised hashing network for cross-modal retrieval. arXiv preprint arXiv:1903.02149 1. Cited by: Table 1, §1, §2.3, §4.5, Table 6.
  • [22] Z. Lin, G. Ding, M. Hu, and J. Wang (2015) Semantics-preserving hashing for cross-view retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3864–3872. Cited by: §1, §1, §2.2.
  • [23] H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang (2017) Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7380–7388. Cited by: §1.
  • [24] W. Liu, C. Mu, S. Kumar, and S. Chang (2014) Discrete graph hashing. In Advances in neural information processing systems, pp. 3419–3427. Cited by: §2.1.
  • [25] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang (2012) Supervised hashing with kernels. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2074–2081. Cited by: §1, §1, §2.2.
  • [26] M. Long, Y. Cao, J. Wang, and P. S. Yu (2016) Composite correlation quantization for efficient multimodal retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 579–588. Cited by: §4.1, Table 3.
  • [27] W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §3.4.
  • [28] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis (2013) Predictable dual-view hashing. In

    International Conference on Machine Learning

    pp. 1328–1336. Cited by: §4.1, Table 3.
  • [29] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §3.4.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.1.
  • [31] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3, §4.1, §4.1.
  • [32] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 785–796. Cited by: §2.3.
  • [33] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §3.4.
  • [34] J. Wang, W. Liu, A. X. Sun, and Y. Jiang (2013) Learning hash codes with listwise supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3032–3039. Cited by: §1, §2.2.
  • [35] X. Wang, J. Hu, J. Lai, J. Zhang, and W. Zheng (2019) Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3556–3565. Cited by: §3.4.
  • [36] L. Wu, Y. Wang, and L. Shao (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Transactions on Image Processing 28 (4), pp. 1602–1612. Cited by: §2.3.
  • [37] Y. Wu, S. Wang, and Q. Huang (2017) Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4269–4278. Cited by: §1, §2.1, §2.1.
  • [38] Y. Wu, S. Wang, and Q. Huang (2018) Learning semantic structure-preserved embeddings for cross-modal retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 825–833. Cited by: §2.2.
  • [39] X. Xu, L. He, H. Lu, L. Gao, and Y. Ji (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22 (2), pp. 657–672. Cited by: §2.1, §2.2.
  • [40] C. Yang, L. Xie, S. Qiao, and A. L. Yuille (2019) Training deep neural networks in generations: a more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5628–5635. Cited by: §3.4.
  • [41] J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §3.4.
  • [42] L. Yu, V. O. Yazici, X. Liu, J. v. d. Weijer, Y. Cheng, and A. Ramisa (2019) Learning metrics from teachers: compact networks for image embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2907–2916. Cited by: §3.4.
  • [43] S. Zagoruyko and N. Komodakis (2016)

    Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer

    arXiv preprint arXiv:1612.03928. Cited by: §3.4.
  • [44] D. Zhang and W. Li (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §1, §4.1, Table 3.
  • [45] J. Zhang, Y. Peng, and M. Yuan (2018) Unsupervised generative adversarial cross-modal hashing. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 1, §2.3, §3.3, Table 2, §4.1, §4.1, §4.1, §4.2, §4.5, Table 3, Table 6.
  • [46] T. Zhang and J. Wang (2016) Collaborative quantization for cross-modal similarity search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2036–2045. Cited by: §1, §2.1.
  • [47] X. Zhang, H. Lai, and J. Feng (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 591–606. Cited by: §1, §2.2, §2.2.
  • [48] Y. Zhou, Y. Wang, P. Tang, S. Bai, W. Shen, E. Fishman, and A. Yuille (2019) Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 121–140. Cited by: §3.4.
  • [49] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 693–701. Cited by: §3.4.