Transductive Zero-Shot Hashing for Multi-Label Image Retrieval

11/17/2019 ∙ by Qin Zou, et al. ∙ University of South Carolina Wuhan University SUN YAT-SEN UNIVERSITY 0

Hash coding has been widely used in approximate nearest neighbor search for large-scale image retrieval. Given semantic annotations such as class labels and pairwise similarities of the training data, hashing methods can learn and generate effective and compact binary codes. While some newly introduced images may contain undefined semantic labels, which we call unseen images, zeor-shot hashing techniques have been studied. However, existing zeor-shot hashing methods focus on the retrieval of single-label images, and cannot handle multi-label images. In this paper, for the first time, a novel transductive zero-shot hashing method is proposed for multi-label unseen image retrieval. In order to predict the labels of the unseen/target data, a visual-semantic bridge is built via instance-concept coherence ranking on the seen/source data. Then, pairwise similarity loss and focal quantization loss are constructed for training a hashing model using both the seen/source and unseen/target data. Extensive evaluations on three popular multi-label datasets demonstrate that, the proposed hashing method achieves significantly better results than the competing methods.



There are no comments yet.


page 1

page 3

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Hashing methods can transform high dimensional data into compact binary codes while preserving the similarity between them. With high computing efficiency and low storage cost, hashing methods have been widely used for large-scale image retrieval. A number of hashing methods have been proposed in the past decade 

[20, 29, 38, 37].

Existing hashing methods can be roughly divided into two categories: supervised [21, 53, 54] and unsupervised [28, 5, 14, 47, 8]. The supervised hashing methods incorporate human-annotated information, e.g., semantic labels and pairwise similarities, into the learning process to find an optimal hash function, while the unsupervised methods often learn hash functions by exploiting the intrinsic manifold structure of the unlabeled data. Generally, supervised methods can obtain much higher performance than the unsupervised ones.

Fig. 1: An illustration to the transductive zero-shot hashing. In the learning procedure, both the source and target data are used for training the hashing model. The categories of the target data are unknown to the learning system.

In recent years, inspired by the remarkable success of deep neural networks in a broad range of computer-vision applications such as image classification 


, face recognition 

[41] and semantic segmentation [30], many supervised hashing methods turn to use deep neural networks for hash-code learning [48, 55, 56, 26, 3]. These deep hashing methods have greatly advance the retrieval performance on several popular benchmark datasets.

However, with the rapid emerging of new products and new activities, images may contain concepts (or semantic labels) that are undefined before. For example, various commercial robots with different shapes and appearances are released to the market every year, new sports with novel playing scenes are invented from time to time over the world. The images containing these new products or playing scenes are ‘unseen’ as comparing to the ‘seen’ images holding pre-defined labels, and are supposed to be annotated with new labels for training the supervised learners. Consequently, the supervised hashing methods have been facing tremendous challenges due to the lack of reliable ground-truth labels for the unseen images.

Zero-shot learning (ZSL) [23] is a technique that can potentially solve or alleviate this problem. Zero-shot learning bridges the semantic gap between ‘seen’ and ‘unseen’ categories by transferring supervised knowledge from other modalities or domains, e.g.

, class-attribute descriptors and word vectors. For instance, word embeddings of similar words are located closely in the embedding space that can capture the distributional similarity in the textual domain 


based on a large-scale text corpus such as Wikipedia. Thus, such knowledge transfer can be used to capture the relationship between seen and unseen concepts, and can be helpful to handle unseen images in supervised learning.

For image retrieval under the circumstance of unseen images, some zero-shot hashing (ZSH) methods [50, 13, 33, 22, 52] have also been proposed. Nevertheless, these methods focus on single-label image retrieval, in which a one-to-one visual-semantic representation pair can satisfy the training of a hashing model. While in more complicated scenarios, an image often contains multiple object classes, and hence more complex semantics and their relationships. How to represent the complex visual-semantics relationships for multi-label images, in a unified framework, is a difficult problem. To the best of our knowledge, there does not exist any work on multi-label zero-shot hashing.

Another important but easily ignored problem is that, since the underlying distributions of the source data and the target data are different, learning a hash function from a naive knowledge transfer on the source domain without making it adaptive to the target domain may lead to severe domain-shift problems.

Considering the problems discussed above, we propose a novel transductive zero-shot hashing method (T-MLZSH) for multi-label image retrieval. Both the labeled source and the unlabeled target data are used in the training phase. The labeled source data are used to learn the relationship between visual images and semantic embeddings, and the unlabeled data of target classes are used to alleviate the domain-shift problem. More specifically, we first study a visual-semantic bridge via instance-concept coherence ranking on the source data. In instance-concept coherence ranking, a relatedness score for an image instance and a semantic concept is calculated for each image in the source data, where the score of an instance with a relevant label is larger than that of the same instance with an irrelevant label. Based on the coherence ranking, the visual-semantic bridge is built. Then, we can generate predicted labels for target data, and use these predicted labels as supervised information to guide the learning of hashing models. Moreover, we propose a focal quantization loss for fast and efficient hashing learning.

The contributions of this work lie in three-fold:

  • A transductive zero-shot hashing method (T-MLZSH) is proposed to solve the domain-shift problem in multi-label image retrieval. To the best of our knowledge, it is the first work studying the zero-shot hashing for multi-label image retrieval.

  • An instance-concept coherence ranking algorithm is proposed for visual-semantic mapping, which can be used to predict the labels for unseen target data and hence improve the performance of zero-shot deep hashing.

  • The proposed method obtained very promising results on three popular multi-label datasets, which constructs the benchmark for zero-shot multi-label image retrieval and paves the road for new research in this field.

The rest of this paper is organized as follows. Section II briefly reviews the related work. Section III describes the deep neural network architecture for crack detection. Section IV demonstrates effectiveness of the proposed method by experiments. Finally, Section V concludes the paper.

Ii Related Work

Hashing-based image retrieval. Hashing methods for image retrieval can be roughly divided into two categories: the unsupervised and the supervised. The unsupervised hashing methods generate hash codes without any sematic labels. They use clustering techniques or projection strategies to transfer visual information to feature space, and generate an optimized hash function to preserve the similarity in Hamming space [14, 15, 8, 42, 46]. Some classical algorithms, such as SH [47], formulated hash encoding as a spectral graph-partitioning problem, and learned a nonlinear mapping to preserve semantic similarity. Some other methods, e.g., SPQ [32]

and muti-k-means 

[7], decomposed the feature space into a Cartesian product of low-dimensional subspaces, and encoded high-dimensional feature vectors into binary codes by clustering-based subspace quantization. ECE [27]

treated it as an optimization problem and combined the genetic programming with the boosting-based weight updating. DSTH 

[57] advocated discrete hash codes and resorted to the semantics augment from auxiliary contextual modalities.

The supervised hashing methods use the annotation information to learn compact hash codes, which usually perform better than the unsupervised methods. Among these supervised methods, CNN based hashing methods have attracted more and more attention due to the powerful representation ability of deep neural networks [12]. According to the difference of input forms, the existing deep hashing methods can be divided into two kinds. One receives image triplets as the input of the network, and generate hash codes by minimizing the triplet ranking loss [21, 53]. These methods consume many computing resource and time to train the hashing model since there are enormous triplet combinations. Another one can receive the minibatch of images as input and use pairwise similarity between images as supervised information to learn the hashing network. The typical works of this form include the DHN [56], DSH [26] and HashNet [3], etc.

Zero-shot hashing.

To handle images with unseen categories, some deep learning-based methods formulate the hashing as a unsupervised problem 

[51, 39]. However, without using reliable supervised information, it is difficult to achieve satisfactory performance. Some other methods [50, 13, 34], in a different perspective, consider it as a zero-shot hashing (ZSH) problem. The goal of ZSH [49, 43] is to transfer the model trained on the seen data to unseen data via other available knowledge, such as word vector representations. Since the underlying data distributions of the seen categories and the unseen categories are different, the hashing functions learned by the seen categories without any adaptation to the unseen categories may cause a domain-shift problem. To narrow the domain gap between seen data and unseen data, ZSH-DA [33] first learns a zero-shot hashing model on seen data, and then learns the final hashing model with a domain-adaptation algorithm. In [22], a transductive zero-shot hashing network (TZSH) was proposed, which contains a coarse-to-fine similarity mining to find most presentative target examples of each unseen labels, and adds these presentative examples and its corresponding predicted labels to the process of supervised hashing learning.

Fig. 2: An overview of the proposed T-MLZSH network. At first, the model studies a visual-semantic bridge via instance-concept coherence ranking on source data. It calculates the relatedness scores between visual features and semantic word vectors, and optimizes based on assumptions that related instance-concept pairs should have larger relatedness scores. Given the learned instance-concept coherence relationship, the most relevant concepts for each image instance of unlabeled target data can be selected to guide the similarity-preserving learning. The whole network can be trained in an end-to-end form.

Multi-label zero-shot hashing. In real scenarios, an image often contains multiple labels, which brings more complex semantics and enhances the difficulty in transferring knowledge from the seen classes to unseen classes. The visual-semantic mapping is a widely used idea in the single-label case. Following this idea, the multi-label ZSL method was proposed [10], which infers the meaning of multiple labels for one instance by summing the word vector representations of its individual labels. Similarly, the weakly deep supervised hashing was proposed [11], which uses the standard mean aggregation and weighted mean aggregation to get the collective representation of the tag words. However, such a collective representation is not the most appropriate expression, and inevitably brings information loss.

Another alternative way of using direct visual-semantic mapping [36] tries to find the corresponding area of each semantic label and extract the object-level visual presentation for visual-semantic mapping, which have to segment the image into meaningful subregions and each subregion matched with one of semantic labels. Such conditions require more pixel-level annotations of image and make it more difficult to train the model.

The co-occurrence information can also be used for semantic association, e.g., COSTA [31]. For zero-shot learning, COSTA constructs the linear projection matrix between the seen labels and unseen labels by statistic learning on annotation datasets. As the models of seen labels are trained independently, this method does not take the relationship and coherence among seen labels into consideration, which may undermine the effect of multi-label zero-shot learning. Recent years, graph convolutional networks (GCNs) [6, 18]

have attracted the attention of researchers due to its effectiveness on learning characteristic and structural information of graph structural data. Taking the semantic embeddings of labels as nodes, a knowledge graph can be built and the label correlation can be modeled with stacked graph convolution operations 

[45, 40].

In this work, we build a visual-semantic bridge via instance-concept coherence ranking, which considers the ranking relationship between the relevant labels and irrelevant labels, and learns one-to-many visual-semantic representation. Moreover, we conduct transductive learning by using the visual images from both seen labels and unseen labels, which reduces the domain gap between source and target data.

Iii Transductive Multi-Label Zero-Shot Hashing

Iii-a Problem Definition

Suppose = is a labeled source dataset including images, where is an image and is the corresponding label annotated with one or more classes, and = is an unlabeled target dataset, which includes images of the unseen target classes and has the labels unknown. In the zero-shot setting, the target and source classes are two mutually exclusive label sets, i.e., =. For hash-code learning, we construct the similarity matrix ==+, where = 1 denotes the pairwise images and are similar and = 0 denotes they are dissimilar. The goal of T-MLZSH is to learn a mapping on the labeled source dataset and the unlabeled dataset to encode an input image into an -bit binary code , with the pairwise similarity preserved.

Figure 2 gives a flowchart of the proposed method. The input images firstly go through the deep network with the stacked convolutional and fully-connected layers and are encoded as a high dimensional feature representation. Then, the outputs of the last fully-connected layer are fed into a hashing layer for compact binary encoding. To transfer the knowledge from seen categories to unseen categories and construct the bridge between visual and semantic modalities, we add a fully-connected layer after hashing layer, which maps features from hamming space to the common embedding space.

Iii-B Instance-Concept Coherence Ranking

Since there are no label informations for target images, we should firstly predict labels for these images by transferring the knowledge from the semantic representations to visual features, before learning a supervised hashing function. Let be the visual embedding of the -th image instance and be the semantic embedding of the -th semantic concept, then we can calculate the relatedness score between and the -th semantic concept in the embedding space:


where is the inner product operation. The semantic embeddings can be obtained from the existing word vector models, and the visual embeddings are variables that should be learned. During training process, we can get a score list of source labels , where is the number of seen labels. The goal of our embedding model is to learn a mapping function that scores with a relevant label should be higher than that with an irrelevant one, as illustrated by Fig. 3. Inspired by [44]

, we adopt a RankNet loss function to learn the ranking relationships for instance



where and denote two sets of relevant and irrelevant labels to -th instance. is defined as an indicator function, where indicates that -th instance is related to -th label and indicates that -th instance is irrelated to -th label. plays a regularization role. In the bracket of Eq. (2), the first term gives punishment to the situation when the labels irrelevant to have higher ranking orders than the relevant ones. The second term is used to enlarge the relatedness scores of the relevant pairs and reduce those of the irrelevant pairs.

Based on the above-trained model, the pairwise relatedness scores for the visual embeddings of target images and the semantic embeddings of target classes can be calculated. We rank the scores ( is the number of unseen labels) in descending order, and select the classes of top- highest scores as the predicted target labels.

Iii-C Hash Code Learning

For efficient nearest neighbor search, the semantic similarity of original images should be preserved in the Hamming space. Generally, the similarity relationships can be defined with image labels. For multi-label dataset, if two images share at least one label, they are considered similar, and dissimilar otherwise. Let be a set of hash codes for all images, and =

be the pairwise similarity matrix, then the conditional probability of

can be defined as,


where is the inner product of hash codes and , and

is the sigmoid function, which scales the inner product value within [0, 1].

We adapt negative log-likelihood as the cost function to measure the pairwise similarity loss, as formulated by Eq. (4),


As =, the Eq. (4) can be rewritten as


Fig. 3: An illustration to the embedding model. It learns a mapping function where the scores with a relevant label should be higher than that with an irrelevant one.

It is very challenging to directly optimize this discrete optimization problem, as the binary constraint

requires a thresholding on the network outputs which may result in the vanishing-gradient problem in backpropagation. We adopt the continuous relaxation strategy 

[26, 56] to solve this problem. The output of deep hashing layer is fed to a tanh function , which is used as a substitute for binary code . Thus, is redefined as .

For more efficient and faster hash learning, we design a focal quantization loss to mitigate the divergence between the discrete binary codes and the continuous output of hashing networks, inspired by [25]

. Since the gradient accumulations of a large number of simple samples are not helpful for training, the focal loss attempts to reduce the weights of simple examples to promote the training process. First, we convert the binary code quantization problem into a binary classification problem. We use a sigmoid activation to map the outputs of hash layer into a probability distribution

. Notice that, tanh and sigmoid are both monotonic increasing functions that hold the same variation trend, i.e., when asymptotically approaches to -1, also approaches to 0, and vice versa (both approach to 1). Thus the probability of binary classification can reflect the compactness of hash codes effectively.

The focal quantization loss is defined as



is a label indicator that indicates which class (0 or 1) the output of hash layer should be classified as. We adopt a weighted sigmoid function to achieve such effect,

i.e., , is a parameter far greater than 1.

By integrating the pairwise similarity loss and focal quantization loss, the overall hashing loss can be defined as


Iv Experiments

Iv-a Datasets

To verify the performance of the proposed method, we compare the proposed method with several baselines on three widely used multi-label image datasets.

NUS-WIDE [4] is a dataset containing 269,648 public web images. Each image is annotated with one or more class labels from a total of 81 classes. There exists a widely used subset of images associated with the 21 most common labels and each label associated with at least 5,000 images, resulting in a total of 195,834 images.

VOC2012 [9] is a widely used dataset for object detection and segmentation, which contains 17,125 images, and each image is associated with at least one of the 20 semantic labels.

COCO [24] is a dataset for object detection, stuff segmentation and semantic scene labeling, which contains 82,783 training images and 40,504 testing images. Each image is associated with one of the 90 categories and has 5 captions.

Iv-B Implementation Details

To construct a zero-shot scenario, we should further split the dataset. Considering that there are complex semantic relationships among these multi-label datasets, we use one image dataset as source dataset and another one as target dataset. For example, we can set NUS-WIDE as source data and VOC2012 as target data, and vise versa. Before training, a data preprocessing have to be done. Without loss generality, we set two experiments, one is between NUS-WIDE and VOC2012 and the other is between NUS-WIDE and COCO.

Iv-B1 Experiment between NUS-WIDE and VOC2012

In NUS-WIDE, we remove the common concepts shared by these two datasets and related images, because there are much more images in NUS-WIDE than in VOC2012. In VOC2012, we remove several ambiguous concepts and related images. Such data-clean operations result in a subset of NUS-WIDE containing 106,389 images and 18 labels, and a subset of VOC2012 containing 16,750 images and 17 labels. For NUS-WIDE, we randomly select 10,000 images to form the training set, 2,000 images to form the test query set and others as the retrieval database. For VOC2012, we randomly select 4,000 images as the train set, 1,000 images as the test query set, and others as the retrieval database.

Iv-B2 Experiment between NUS-WIDE and COCO

Because the number of images in COCO which have the common classes accounts for a large proportion, we remove the common concepts and relative images from NUS-WIDE and keep COCO unchanged. After such data-clean, finally a subset of NUS-WIDE containing 100,303 images and 17 labels, and a subset of COCO containing 123,274 images and 80 labels are prepared for the following experiments. The two datasets are divided into train set, test query set and retrieval database according to the proportion of 14:1:5. While training, we randomly select 10,000 images from the train set.

We implement the proposed method (T-MLZSH) using the TensorFlow toolkit 

[1]. In this paper, we use AlexNet [19] as the basic CNN. We use the pre-trained model to initialize the network weight parameters, and focus on training the hashing layer and embedding layer. Adam method is adopted for stochastic optimization with a mini-batch size of 128, and all input images are resized to 227227.

We compare our method (T-MLZSH) with several state-of-the-art hashing methods, including KSH [29], SDH [37], IMH [38], DHN [56], ZSH-DA [33], ZSH [50], TZSH [22]. Among these comparison methods, KSH and SDH are two typical supervised methods, IMH is one of the most representative unsupervised hashing methods, DHN is a deep learning-based supervised methods, ZSH-DA and ZSH are two zero-shot hashing methods, and TZSH is a transductive zero-shot hashing method.

For deep learning-based methods, we use the raw images as input. For the non-CNN approaches, we use the outputs of fc7 layer in AlexNet as their visual features. The semantic representations are obtained from the language model [35] while each category is embedded into a 300-d word vector. For these zero-shot hashing methods which need one-to-one semantic representation for each image, we follow the existing work [10] and use the average of semantic representations of multiple labels as the collective semantic representation for multi-label images.

(a) NUS-WIDE VOC2012
(b) VOC2012 NUS-WIDE
Fig. 4: Performance of using different number predicted labels for target data with hash codes of 12, 24, 36 and 48 bits, respectively.


12-bit 24-bit 36-bit 48-bit 12-bit 24-bit 36-bit 48-bit


KSH [29] 0.4033 0.4079 0.4153 0.4181 0.5476 0.5510 0.5602 0.5670
SDH [37] 0.4097 0.4010 0.4087 0.4095 0.5327 0.5345 0.5368 0.5395
IMH [38] 0.4083 0.4346 0.4320 0.4297 0.5616 0.5723 0.5718 0.5710
DHN [56] 0.4171 0.4282 0.4362 0.4395 0.5664 0.5739 0.5726 0.5688
ZSH-DA [33] 0.3592 0.3618 0.3770 0.3596 0.5132 0.5166 0.5212 0.5191
ZSH [50] 0.3968 0.4055 0.4111 0.4296 0.5340 0.5566 0.5589 0.5500
TZSH [22] 0.4413 0.4683 0.4644 0.4753 0.5736 0.5805 0.5919 0.5896
T-MLZSH 0.4808 0.4884 0.4894 0.5037 0.6106 0.6131 0.6149 0.6200


TABLE I: Results of MAP for different numbers of bits between NUS-WIDE and VOC2012.

Iv-C Metrics

The metrics we used to evaluate the image retrieval quality are four widely-used metrics:Average Cumulative Gains (ACG) [16], Normalized Discounted Cumulative Gains (NDCG) [17], Mean Average Precision (MAP) [2] and Weighted Mean Average Precision (WAP) [55].

MAP is the mean of average precision for each query, which can be calculated by




is an indicator function that if and have same class labels, ;otherwise . Q is the number of query sets and indicates the number of relevant images w.r.t the query image within the top images.

ACG represents the average number of shared labels between the query image and the top retrieved images. For a given query image , the ACG score of the top retrieved images is calculated by


where denotes the number of top retrieval images and is the number of shared class labels between and .

NDCG is a popular evaluation metric in information retrieval. Given a query image

, the DCG score of top retrieved images is defined as


Then, the normalized DCG(NDCG) score at the position can be calculated by , where is the maximum value of , which constrains the value of NDCG in range [0,1].

WAP is similar with MAP, the only difference is that WAP is the average value of ACG scores at each tip retrieval image rather than average precision. WAP can be calculated by

(a) NUS-WIDE VOC2012
(b) VOC2012 NUS-WIDE
Fig. 5: Comparison of performance with different metrics on 12-, 24-, 36- and 48-bits hash codes.

Iv-D Overall Performance

In this part, we analyze the retrieval results all evaluated on the unseen target data. Figure 4 displays the results of using different numbers of predicted labels on the target data. Top- indicates that the first categories in the correlation-score ranking list are used as predicted labels and top-0 means that the labels of target data are set to a vector of all zero.

Fig. 6: Performance comparison on NUS-WIDEVOC2012. The VOC2012 dataset is unseen. From top to bottom, there are ACG, NDCG and precision curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

Fig. 7: Performance comparison on VOC2012NUS-WIDE. The NUS-WIDE dataset is unseen. From top to bottom, there are ACG, NDCG and precision curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

Iv-D1 Results on NUS-WIDE and VOC2012

From Fig. 4 (a) and (b) we can see, when setting VOC2012 as target data, using top-1 predicted labels can achieve the best performance. The possible reason is that the average number of objects in each image on VOC2012 is relatively small. With more predicted labels used for supervised hashing learning, it will inevitably incur misleading information and cause performance degradation. When setting NUS-WIDE as target data, the best results are obtained by using top-3 predicted labels. In the following experiments, we use top-1 and top-3 predicted labels as supervised information for the proposed method in default for VOC2012 and NUS-WIDE, respectively.

The MAP results of the proposed method and other state-of-the-art methods are shown in Table I. It can be seen that the proposed method outperforms the comparison methods significantly on both target datasets. The transductive zero-shot hashing methods, i.e., TZSH and the proposed T-MLZSH, achieve higher MAP values than other methods, as expected. Compared to TZSH, T-MLZSH achieves increments of about 3.1% and 2.8% in average MAP for different bits on NUS-WIDE and VOC2012, respectively. The possible reason is that TZSH adopts a strategy only utilizing partially-selected target data for hashing learning, which limits its performance. It is interesting that both the deep supervised hashing method DHN and unsupervised method IMH outperform traditional supervised hashing methods on these zero-shot hashing problems. The two zero-shot hashing methods ZSH and ZSH-DA achieve relatively low performance on the two multi-label datasets, which indicates that the complex semantics of multi-label images are too hard to be modelled by learning a one-to-one semantic representation.

More results in other three metrics, i.e., ACG, NDCG and WAP, are presented in Figure 5. Figure 6 and Figure 7 shows more detailed comparision results of ACG, NDCG, and precision curves of different numbers of top returned images with 12, 24, 36 and 48 bits on unseen VOC2012 dataset using seen NUS-WIDE dataset and vise,respectively. According to the definition, these three metrics can make a more fair evaluation on multi-label images, as the numbers of shared labels between images are considered. As we can see, T-MLZSH consistently outperforms all other competitors on different metrics and different hash bits.

Iv-D2 Results on NUS-WIDE and COCO

Figure 4 (c) and (d) display the results of using different numbers of predicted labels on the datasets of NUS-WIDE and COCO. We can see that using top-3 predicted labels can achieve the best performance when setting either of these two datasets as target data. One possible reason is that the average number of objects in each image from NUS-WIDE and COCO is 2.48 and 2.97, respectively, which are close to 3. In the following experiments, we use top-3 predicted labels as supervised information for the proposed method in default for COCO and NUS-WIDE.


12-bit 24-bit 36-bit 48-bit 12-bit 24-bit 36-bit 48-bit


KSH [29] 0.3948 0.4069 0.4113 0.4143 0.5948 0.6167 0.6191 0.6224
SDH [37] 0.3782 0.3917 0.3971 0.4050 0.5681 0.5954 0.6102 0.6051
IMH [38] 0.3905 0.4021 0.4114 0.4188 0.5983 0.5961 0.6098 0.6132
DHN [56] 0.4250 0.4325 0.4529 0.4487 0.6177 0.6421 0.6466 0.6559
ZSH-DA [33] 0.3597 0.3592 0.3772 0.3744 0.5256 0.5220 0.5230 0.5247
ZSH [50] 0.3832 0.4091 0.4109 0.4286 0.5708 0.5727 0.5753 0.5782
TZSH [22] 0.4436 0.4585 0.4660 0.4800 0.5933 0.6368 0.6070 0.6336
T-MLZSH 0.4724 0.4941 0.5090 0.5124 0.6374 0.6425 0.6510 0.6693


TABLE II: Results of MAP for different numbers of bits between NUS-WIDE and COCO.

The MAP results of the proposed method and some other methods in NUS-WIDE and COCO are shown in Table II. Although the datasets are different from the former experiment, the overall experimental results are consistent. The proposed T-MLZSH also get the highest MAP values among the comparative methods. When the unseen dataset is COCO or NUS-WIDE, compared to TZSH, T-MLZSH achieves increments of about 3.4% or 3.2% in average MAP for different bits respectively. However, it is unexpected that the deep supervised hashing method DHN performs better than TZSH when the unseen dataset is NUS-WIDE. The possible reason is that COCO is categorized into more categories and DHN can get more detailed supervised information when setting COCO as training dataset. But even compared to DHN, T-MLZSH also increases about 0.95% in average MAP. In general, the performances are similar to that between NUS-WIDE and VOC2012. Because of the the complex semantics of multi-label images, the two zero-shot hashing methods ZSH and ZSH-DA do not achieve high performance.

Figure 8 show the results of ACG, NDCG and WAP, Figure 11 and Figure 12 shows more detailed comparision results of ACG, NDCG, and precision curves of different numbers of top returned images with 12, 24, 36 and 48 bits on NUS-WIDE and COCO, respectively. As we can see, though the datasets are changed, T-MLZSH consistently outperforms all other competing methods on different metrics and different bits of hash codes.

Fig. 8: Comparison of performance with different metrics on 12-, 24-, 36- and 48-bits hash codes.

Iv-E Parameter Analysis

Iv-E1 Influence of the categories of datasets

In the above two groups of experiments, part of them used NUS-WIDE as the unseen dataset and the results are shown in Table I. From the 3rd and 5th collums (big collum) of the table, we notice that using COCO as seen dataset can achieve a better MAP result, which has an improvement of about 2.6%, 2.9%, 3.6% and 4.9% in average MAP with different hash bits, respectively. These two groups of experiments have the same target domain and the only difference is source domain. We guess that the possible reason led to different MAP is the difference of categories. COCO dataset is more finely divided and more semantic information can be used which make the network much stronger.

(a) NUS-WIDE VOC2012
(b) VOC2012 NUS-WIDE
Fig. 9: The influence of the quantization loss.
Fig. 10: Influence of the size of the dataset.

Iv-E2 Necessary of quantization loss

We also explore the effectiveness of the proposed quantization loss. We compare the proposed method with its variant versions: one adopts the widely used absolute error loss that measures the Euclidean distance between continuous outputs and discrete codes directly, and the other does not use quantization loss. The results are presented in Figure 9. It can be seen that, without quantization loss, there is a rapid degradation of the performance. The difference in the evaluation index of MAP is about 0.5%, which illustrates the importance of using quantization loss in deep hashing learning. We can also see that, it achieves improved results when applying the quantization losses. The proposed focal quantization loss leads to a significantly higher performance than all other architectures.

Fig. 11: Performance comparison on NUS-WIDECOCO. The COCO dataset is unseen. From top to bottom, there are ACG, NDCG and precision curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

Fig. 12: Performance comparison on COCONUS-WIDE. The NUS-WIDE dataset is unseen. From top to bottom, there are ACG, NDCG and precision curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

Iv-E3 Influence of the size of datasets

Moreover, we explore the influence of the sizes of the source and target datasets. We train the model with different number of ‘seen’ and ‘unseen’ images. Three different settings are considered. The number of images from the source dataset and target dataset are set in three different cases, that are 10,000 - 10,000, 10,000 - 4,000, and 4,000 - 10,000. The experiments are conducted on NUS-WIDE COCO and COCO NUS-WIDE. Figure 10(a) and (b) show the results. In Fig. 10, the results are very close on the same number of bits, and only slight difference can be observed among the three different settings. It indicates that the proposed method can obtain a stable performance even if the number of training samples from the two domains varies.

Figure 13 shows the top 20 retrieved results of the proposed method and four comparison methods according to the ascending Hamming ranking. The query image contains three semantic labels, i.e., building, sky and water, with the main content of a building. We mark the mismatched image with the red box according to human perception. The retrieval results of T-MLZSH are better in visual plausibility while focusing on the main object of the query image, while other compared methods may return some mismatched results like the forest, or return some images related to the less important part of query image with higher ranking orders.

V Conclusion

In this paper, a novel transductive zero-shot hashing method was proposed for multi-label image retrieval. It utilized the instance-concept coherence to construct a bridge for connecting the seen and unseen labels. Based on these connections, a number of categories with the highest relatedness scores were selected as the predicted labels for target data. Then, hashing learning was performed on both the source data and target data in a supervised manner. Experimental results on three widely-used multi-label datasets showed that, the proposed method outperformed the state-of-the-art methods with a significant margin. The ablation studies verified the effectiveness of the proposed focal loss.

Fig. 13: Top 20 retrieved images of the methods in comparison using the Hamming ranking on the 48-bit hash codes.


  • [1] M. Abadi, P. Barham, J. Chen, et al. (2016)

    Tensorflow: a system for large-scale machine learning.

    In OSDI, pp. 265–283. Cited by: §IV-B2.
  • [2] R. Baeza-Yates, B. Ribeiro-Neto, et al. (1999) Modern information retrieval. Vol. 463, ACM press New York. Cited by: §IV-C.
  • [3] Z. Cao, M. Long, J. Wang, and P. S. Yu (2017) HashNet: deep learning to hash by continuation. In ICCV, pp. 5609–5618. Cited by: §I, §II.
  • [4] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In ICIVR, pp. 48. Cited by: §IV-A.
  • [5] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Annual Symposium on Computational Geometry, pp. 253–262. Cited by: §I.
  • [6] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: §II.
  • [7] S. Ercoli, M. Bertini, and A. D. Bimbo (2017) Compact hash codes for efficient visual descriptors retrieval in large scale databases. IEEE Transactions on Multimedia 19 (11), pp. 2521–2532. Cited by: §II.
  • [8] L. V. Erin, J. Lu, G. Wang, P. Moulin, and J. Zhou (2015) Deep hashing for compact binary codes learning. In CVPR, pp. 2475–2483. Cited by: §I, §II.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §IV-A.
  • [10] Y. Fu, Y. Yang, T. Hospedales, T. Xiang, and S. Gong (2015) Transductive multi-label zero-shot learning. arXiv preprint arXiv:1503.07790. Cited by: §II, §IV-B2.
  • [11] V. Gattupalli, Y. Zhuo, and B. Li (2019) Weakly supervised deep image hashing through tag embeddings. In CVPR, Cited by: §II.
  • [12] J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan (2016) Supervised discrete hashing with relaxation. IEEE Transactions on Neural Networks and Learning Systems 29 (3), pp. 608–617. Cited by: §II.
  • [13] Y. Guo, G. Ding, J. Han, and Y. Gao (2017) SitNet: discrete similarity transfer network for zero-shot hashing. In IJCAI, pp. 1767–1773. Cited by: §I, §II.
  • [14] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon (2015) Spherical hashing: binary code embedding with hyperspheres. In IEEE Trans. Pattern Anal. Mach. Intell., Vol. 37, pp. 2304–2316. Cited by: §I, §II.
  • [15] G. Irie, L. Z., W. X.-M., and C. S.-F. (2014) Locally linear hashing for extracting non-linear manifolds. In CVPR, pp. 2115–2122. Cited by: §II.
  • [16] K. Jarvelin and J. Kekalainen (2000) IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR, pp. 41––48. Cited by: §IV-C.
  • [17] K. Jarvelin and J. Kekalainen (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans. on Inform. Systems 20 (4), pp. 422–446. Cited by: §IV-C.
  • [18] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §II.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §I, §IV-B2.
  • [20] B. Kulis and K. Grauman (2011) Kernelized locality-sensitive hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (6), pp. 1092–1104. Cited by: §I.
  • [21] H. Lai, Y. Pan, Y. Liu, and S. Yan (2015) Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pp. 3270–3278. Cited by: §I, §II.
  • [22] H. Lai (2018) Transductive zero-shot hashing via coarse-to-fine similarity mining. In ICMR, pp. 196–203. Cited by: §I, §II, §IV-B2, TABLE I, TABLE II.
  • [23] C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pp. 951–958. Cited by: §I.
  • [24] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §IV-A.
  • [25] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2018) Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §III-C.
  • [26] H. Liu, R. Wang, S. Shan, and X. Chen (2016) Deep supervised hashing for fast image retrieval. In CVPR, pp. 2064–2072. Cited by: §I, §II, §III-C.
  • [27] L. Liu and L. Shao (2015) Sequential compact code learning for unsupervised image hashing. IEEE Transactions on Neural Networks and Learning Systems 27 (12), pp. 2526–2536. Cited by: §II.
  • [28] W. Liu, J. Wang, S. Kumar, and S.-F. Chang (2011) Hashing with graphs. In ICML, pp. 1–8. Cited by: §I.
  • [29] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang (2012) Supervised hashing with kernels. In CVPR, pp. 2074–2081. Cited by: §I, §IV-B2, TABLE I, TABLE II.
  • [30] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §I.
  • [31] T. Mensink, E. Gavves, and C. G. Snoek (2014) Costa: co-occurrence statistics for zero-shot classification. In CVPR, pp. 2441–2448. Cited by: §II.
  • [32] Q. Ning, J. Zhu, Z. Zhong, S. C. H. Hoi, and C. Chen (2017) Scalable image retrieval by sparse product quantization. IEEE Transactions on Multimedia 19 (3), pp. 586–597. Cited by: §II.
  • [33] S. Pachori, A. Deshpande, and S. Raman (2018) Hashing in the zero shot framework with domain adaptation. Neurocomp. 275, pp. 2137–2149. Cited by: §I, §II, §IV-B2, TABLE I, TABLE II.
  • [34] S. J. Pan and Q. Yang. (2010)

    Survey on transfer learning

    IEEE Trans. Know. Data Eng. 22 (10), pp. 1345–1359. Cited by: §II.
  • [35] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §I, §IV-B2.
  • [36] Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille (2015) Multi-instance visual-semantic embedding. arXiv preprint arXiv:1512.06963. Cited by: §II.
  • [37] F. Shen, C. Shen, W. Liu, and H. Tao Shen (2015) Supervised discrete hashing. In CVPR, pp. 37–45. Cited by: §I, §IV-B2, TABLE I, TABLE II.
  • [38] F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang (2013) Inductive hashing on manifolds. In CVPR, pp. 1562–1569. Cited by: §I, §IV-B2, TABLE I, TABLE II.
  • [39] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen (2018) Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §II.
  • [40] Y. Shen, L. Liu, F. Shen, and L. Shao (2018) Zero-shot sketch-image hashing. In CVPR, pp. 3598–3607. Cited by: §II.
  • [41] Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In NIPS, pp. 1988–1996. Cited by: §I.
  • [42] A. Torralba, R. Fergus, and Y. Weiss (2008) Small codes and large image databases for recognition. In CVPR, pp. 1–8. Cited by: §II.
  • [43] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Pan-chanathan (2017) Deep hashing network for unsupervised domain adaptation. In CVPR, Cited by: §II.
  • [44] Q. Wang and K. Chen (2017) Multi-label zero-shot human action recognition via joint latent embedding. arXiv preprint arXiv:1709.05107. Cited by: §III-B.
  • [45] X. Wang, Y. Ye, and A. Gupta (2018) Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR, pp. 6857–6866. Cited by: §II.
  • [46] Y. Weiss, R. Fergus, and A. Torralba (2012) Multidimensional spectral hashing. In ECCV, pp. 340–353. Cited by: §II.
  • [47] Y. Weiss, A. Torralba, and R. Fergus (2009) Spectral hashing. In NIPS, pp. 1753–1760. Cited by: §I, §II.
  • [48] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2156–2162. Cited by: §I.
  • [49] Y.Fu, T.M.Hospedales, T.Xiang, and S.Gong (2015) Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37 (11), pp. 2332–2345. Cited by: §II.
  • [50] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen (2016) Zero-shot hashing via transferring supervised knowledge. In ACM MM, pp. 1286–1295. Cited by: §I, §II, §IV-B2, TABLE I, TABLE II.
  • [51] H. Zhang, L. Liu, Y. Long, and L. Shao (2018) Unsupervised deep hashing with pseudo labels for scalable image retrieval. IEEE Trans. Image Process. 27 (4), pp. 1626–1638. Cited by: §II.
  • [52] H. Zhang, Y. Long, and L. Shao (2019) Zero-shot hashing with orthogonal projection for image retrieval. Pattern Recognition Letters 117, pp. 201–209. Cited by: §I.
  • [53] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang (2015) Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process. 24 (12), pp. 4766–4779. Cited by: §I, §II.
  • [54] Z. Zhang, Q. Zou, Y. Lin, L. Chen, and S. Wang (2019) Improved deep hashing with soft pairwise similarity for multi-label image retrieval. IEEE Trans. Multimedia, pp. 1–13. Cited by: §I.
  • [55] F. Zhao, Y. Huang, L. Wang, and T. Tan (2015) Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, pp. 1556–1564. Cited by: §I, §IV-C.
  • [56] H. Zhu, M. Long, J. Wang, and Y. Cao (2016) Deep hashing network for efficient similarity retrieval. In AAAI, pp. 2415–2421. Cited by: §I, §II, §III-C, §IV-B2, TABLE I, TABLE II.
  • [57] L. Zhu, Z. Huang, Z. Li, L. Xie, and H. T. Shen (2018) Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Transactions on Neural Networks and Learning Systems 29 (11), pp. 5264–5276. Cited by: §II.