I Introduction
Largescale image retrieval, which is to find images containing similar objects as in a query image, has attracted increasing interest due to the evergrowing amount of available image data on the Web. Similaritypreserving hashing is a popular nearest neighbor search technique for image retrieval on datasets with millions or even billions of images.
A representative stream of similaritypreserving hashing is learningtohash, i.e., learning to compress data points (e.g., images) into binary representations such that semantically similar data points have nearby binary codes. The existing learningtohash methods can be divided into two main categories: unsupervised methods and supervised methods. Unsupervised methods (e.g., [1, 2, 3]) learn a set of hash functions from unlabeled data without any side information. Supervised methods (e.g., [4, 5, 6, 7]
) try to learn compact hash codes by leveraging supervised information on data points (e.g., similarities on pairs of images). Among various supervised learningtohash methods for image retrieval, an emerging stream is deepnetworksbased hashing that learns bitwise codes as well as image representations via carefully designed deep neural networks. Several deepnetworksbased hashing methods have been proposed (e.g.,
[4, 8, 9]).Multilabel images, each of which may contain objects of multiple categories, are widely involved in many image retrieval systems. However, in most existing hashing methods for images, the semantic similarities are defined at image level, and each image is represented by one piece of hash code. This setting may be suboptimal for multilabel image retrieval.
In this paper, we consider instanceaware retrieval for multilabel image data, which includes semantic hashing [10] and categoryaware hashing. Specifically, given a multilabel query image, a natural demand is to organize the retrieved results in groups, each group corresponding to one category. For example, as shown in Figure 1, given a query image containing a bicycle and a sofa, one would like to organize the retrieved results in two groups: each image in the first (second) group contains a bicycle (sofa) similar to the one in the query image. In order to achieve instanceaware retrieval, we propose a new image representation organized in groups, by incorporating automatically generated candidate object proposals and label probability calculation into the learning process. Figure 1 shows an example for the generation of the instanceaware image representation.
More specifically, we propose a deep neural network that simultaneously learns binary hash codes and the representation tailored for multilabel image data. As shown in Figure 2, the proposed architecture has four building blocks: 1) a set of automatically generated candidate object proposals in the form of bounding boxes, as inputs to the deep neural network; 2) stacked convolutional layers to capture the features of the input proposals, followed by a Spatial Pyramid Pooling (SPP) layer [11] to map each proposal to a dimensional intermediate representation; 3) a label probability calculation module that maps the intermediate representation to the image labels (in classes), which leads to an probability matrix with the th row representing the label probabilities of the th proposal belonging to each class; 4) a hash coding module, where firstly an instanceaware representation is captured, in which the probability matrix in the third module is used as the input, and then either categoryaware hash codes or semantic hash codes are generated based on this representation.
The proposed deep architecture can be used to generate hash codes for categoryaware hashing, where an image is represented by multiple pieces of hash codes, each of which corresponds to a category. In addition, we show that the proposed image representation can improve the quality of semantic hashing in which an image is represented by one piece of hash code.
Our contributions in this paper can be summarized as follows. First, we propose a deep architecture that can generate hash codes for instanceaware retrieval. To the best of our knowledge, we are the first to conduct instanceaware retrieval via learningbased hashing. Second, we propose to incorporate automatically generated candidate object proposals and label probability calculation in the proposed deep architecture. We empirically show that the proposed method has superior performance gains over several stateoftheart hashing methods.
Ii Related Work
Due to the encouraging search speed, hashing has become a popular method for nearest neighbor search in largescale image retrieval.
Hashing methods can be divided into data independent hashing and data dependent hashing. The early efforts mainly focus on data independent hashing. For example, the notable LocalitySensitive Hashing (LSH) [12] method constructs hash functions by random projections or random permutations that are independent of the data points. The main limitation of data independent methods is that they usually require long hash codes to obtain good performance. However, long hash codes lead to inefficient search due to the required large storage space and the low recall rates.
Learningbased hashing (or Learningtohash) pursues a compact binary representation from the training data. Based on whether side information is used or not, learningtohash methods can be divided into two categories: unsupervised methods and supervised methods.
Unsupervised methods try to learn a set of similaritypreserving hash functions only from the unlabeled data. Representative methods in this category include Kernelized LSH (KLSH) [2], Semantic hashing [13], Spectral hashing [14], Anchor Graph Hashing [3], and Iterative Quantization (ITQ) [1]. Kernelized LSH (KLSH) [2] generalizes LSH to accommodate arbitrary kernel functions, making it possible to learn hash functions which preserve data points’ similarity in a kernel space. Semantic hashing [13]
generates hash functions by a deep autoencoder via stacking multiple restricted Boltzmann machines (RBMs). Graphbased hashing methods, such as Spectral hashing
[14] and Anchor Graph Hashing [3], learn nonlinear mappings as hash functions which try to preserve the similarities within the data neighborhood graph. In order to reduce the quantization errors, Iterative Quantization (ITQ) [1]seeks to learn an orthogonal rotation matrix which is applied to the data matrix after principal component analysis projections.
Supervised methods aim to learn better bitwise representations by incorporating supervised information. Notable methods in this category include Binary Reconstruction Embedding (BRE) [6], Minimal Loss Hashing (MLH) [15], Supervised Hashing with Kernels (KSH) [5], Column Generation Hash (CGHash) [16], and SemiSupervised Hashing (SSH) [17]. Binary Reconstruction Embedding (BRE) [6] learns hash functions by explicitly minimizing the reconstruction errors between the original distances of data points and the Hamming distances of the corresponding binary codes. Minimal Loss Hashing (MLH) [15]
learns similaritypreserving hash codes by minimizing a hingelike loss function which is formulated as structured prediction with latent variables. Supervised Hashing with Kernels (KSH)
[5] is a kernelbased supervised method which learns to hash the data points to compact binary codes whose Hamming distances are minimized on similar pairs and maximized on dissimilar pairs. Column Generation Hash (CGHash) [16] is a column generation based method to learn hash functions with proximity comparison information. SemiSupervised Hashing (SSH) [17] learns hash functions via minimizing similarity errors on the labeled data while simultaneously maximizing the entropy of the learnt hash codes over the unlabeled data. In most image retrieval applications, the number of labeled positive samples is small, which results in bias towards the negative samples and overfitting. Tao et al. [18] proposed an asymmetric bagging and random subspace SVM (ABRSSVM) to handle these problems.In supervised hashing methods for image retrieval, an emerging stream is the deepnetworksbased methods [19, 4, 8, 9] which learn image representations as well as binary hash codes. Xia et al. [4]
proposed ConvolutionalNeuralNetworksbased Hashing (CNNH), which is a twostage method. In its first stage, approximate hash codes are learned from the supervised information. Then, in the second stage, hash functions are learned based on those approximate hash codes via deep convolutional networks. Lai
et al. [8] proposed a onestage hashing method that generates bitwise hash codes via a carefully designed deep architecture. Zhao et al. [9] proposed a ranking based hashing method for learning hash functions that preserve multilevel semantic similarity between images, via deep convolutional networks. Lin et al. [20] proposed to learn the hash codes and image representations in a pointwised manner, which is suitable for largescale datasets. Wang et al. [21] proposed Deep Multimodal Hashing with Orthogonal Regularization (DMHOR) method for multimodal data. All of these methods generate one piece of hash code for each image, which may be inappropriate for multilabel image retrieval. Different from the existing methods, the proposed method can generate multiple pieces of hash codes for an image, each piece corresponding to a(n) instance/category.Iii The Proposed Method
Our method consists of four modules. The first module is to generate region proposals for an input image. The second module is to capture the features for the generated region proposals. It contains a deep convolution subnetwork followed by a Spatial Pyramid Pooling layer [11]. The third module is a label probability calculation module, which outputs a probability matrix whose th row represents the probability scores of the th proposal belonging to each class. The fourth module is a hash coding module that firstly generates the instanceaware representation, and then converts this representation to hash codes for either categoryaware hashing or semantic hashing. In the following, we will present the details of these modules, respectively.
Iiia Region Proposal Generation Module
Many methods for generating categoryindependent region proposals have been proposed, e.g., Constrained Parametric MinCuts (CPMC) [22], Selective Search [23], Multiscale Combinatorial Grouping (MCG) [24]
, BInarized Normed Gradients (BING)
[25] and Geodesic Object Proposals (GOP) [26]. In this paper, we use GOP [26] to automatically generate region proposals for an input. Note that other methods for region proposal generation can also be used in our framework.GOP is a method that can generate both segmentation masks and bounding box proposals. We use the code^{1}^{1}1http://www.philkr.net/home/gop provided by the authors to generate the bounding boxes for region proposals.
IiiB Deep Convolution SubNetwork Module
GoogLeNet [27] is a recently proposed deep architecture that has shown its success in object categorization and object detection. The core of GoogLeNet is the Inceptionstyle convolution module which allows increasing the depth and width of the network while keeping reasonable computational costs. Here we adopt the architecture of GoogLeNet as our basic framework to compute the features for the input proposals. Since the GoogLeNet is a very deep network and has many layers, we use the pretrained GoogLeNet model^{2}^{2}2http://dl.caffe.berkeleyvision.org/bvlc_googlenet.caffemodel to initialize its weights which can be regarded as regularization [28] and help its generalization.
However, since the number of generated region proposals for an input image may be large (e.g., more than 1000), it is computationally expensive if one directly uses GoogLeNet to extract features from these proposals. This is unaffordable for hashingbased retrieval since the retrieval system may need a long time to respond to a query.
To address this issue, we use the “Spatial Pyramid Pooling” (SPP) scheme [11]. The advantage of using SPP is that we can compute the feature map from the entire input image only once. Then, with this feature map, we pool features in each generated region proposal to generate a fixedlength representation. Using SPP, we avoid repeatedly computing features for the input region proposals via a deep convolutional network. Specifically, as shown in Figure 2, we add an SPP layer after the last convolutional layer of GoogLeNet. We assume that each input image has automatically generated region proposals. For each input region proposal, we encode its topleft and bottomright coordinates to a dimensional vector . The elements in this vector are scaled to by dividing the width/height of the image, making them invariant to the absolute image size. With this dimensional vector as the input, the SPP layer generates a fixedlength feature vector for the corresponding proposal. Through the SPP layer, we assume that each proposal is mapped to a dimensional intermediate feature vector. Hence, for each input image, the output of this module is an matrix.
After this module, the network is divided into two branches: one for the label probability calculation module, and the other for the hash coding module.
IiiC Label Probability Calculation Module
In this subsection, we will show how to learn the label probability for each region proposal. Suppose there are class labels, and a probability vector is generated for each proposal, e.g., indicates that the probability of the image containing the th category is .
However, we do not have the ground truth labels for each proposal. Thus probability distribution can not be directly learned. Fortunately, in the multilabel image annotation, there is a label for the whole image, e.g.,
, where represents an image and is the ground truth label. and . is equal to 1 if the th label is relevant to image and 0 for the irrelevant case. Hence, we can firstly fuse the proposals into one and then use the whole image’s label to learn as shown in Figure 3.More specifically, with the matrix in which the th row represents the dimensional intermediate feature for the th proposal, in this module, we first use a fullyconnected layer to compress to a dimensional vector .
After that, we use the crosshypothesis maxpooling [29] to fuse to one dimensional vector. Specifically, let be the by matrix whose th row is . The crosshypothesis maxpooling can be formulated as
(1) 
where is the pooled value that corresponds to the th category.
Using , we calculate a probability distribution expressed by
(2) 
where can be regarded as the probability score that the input image contains an object in the th category. Using such crosshypothesis maxpooling, if the th proposal contains the th category, then the output should have a large value and will have a high response. Hence, it can guide the learning of .
In this module, we define a loss function based on the cross entropy between the probability scores and the ground truth labels:
(3) 
where we denote as the set of categories which the input image belongs to, and as the number of elements in . This loss function is also referred to as SoftmaxLoss [30], which is a widely used loss function in the Convolutional neural networks. The (sub)gradients with respect to are
(4) 
It can be easily integrated in back propagation in neural networks.
After that, similarly to , we can define a probability matrix for the region proposals as
(5) 
where represents the th element in the th row of . can be viewed as the probability that the th proposal contains an object of the th category.
IiiD Hash Coding Module
In this subsection, we will show how to convert the image representation into (a) one piece of hash code for semantic hashing or (b) multiple pieces of hash codes, each piece corresponding to a category, for categoryaware hashing.
With the matrix as the input, in this module, we first use a fullyconnected layer to compress each to a dimensional vector , where corresponds to the th proposal. We denote as the by matrix whose th row is .
IiiD1 CrossProposal Fusion
In order to convert into the instanceaware representation of the input image, we propose a crossproposal fusion strategy, by using the probability matrix from the label probability calculation module.
Specifically, with the by feature matrix and the by matrix where represents the probability of the th region proposal belonging to the th category, we fuse and into a long vector with elements. This vector is organized in groups, each group representing dimensional features corresponding to one category.
Let , represent the th row of , , respectively. The crossproposal fusion can be formulated as
(6) 
where is the Kronecker product. For the dimensional vector and the dimensional vector , the Kronecker product is a dimensional vector:
Let , where is a dimensional vector. It is easy to verify that
Since represents the features of the th proposal, can be regarded as the weighted average of the proposals’ features. If the th proposal has a relatively higher/lower score (meaning that the th proposal likely/unlikely belongs to the th category), the feature vector (associated to the th proposal) has more/less contribution to the weighted average .
Figure 4 shows an illustrative example of crossproposal fusion. Suppose there are only categories (bicycle and sofa, ) in all of the images. For the input image, region proposals are generated in the first module (). Then, suppose the label probability calculation module generates a probability matrix with , , and . For the st proposal, indicates that it is very likely to contain a bicycle (with a score ), but it seems unlikely to contain a sofa (with a score ). In the hash coding module, the th proposal is represented by a dimensional feature vector . Finally, for the input image, we conduct crossproposal fusion to obtain an instanceaware representation , where the representation of “bicycle” is , and the representation of “sofa” is .
The crossproposal fusion is a crucial step for the instanceaware image representation. If the image contains an object, then the instanceaware representation will have a high response for the object. It also gives us a simple way to combine the multilabel information into the hashing procedure.
Discussions. A concern arising here is that some input proposals may be inaccurate or even do not contain any objects, which will make the features generated by these proposals noisy and harm the final performance. We argue that, to some extent, the operations in the proposed CrossHypothesis MaxPooling and the CrossProposal Fusion can reduce the negative effects of the possibly noisy input proposals. Firstly, in the label probability calculation module before the crosshypothesis maxpooling, each input proposal is assigned with a set of probability scores (one score for one label). Higher scores are assigned to the proposals that may contain objects with more confidence, and lower scores are assigned to those noisy proposals. Hence, those noisy proposals may more probably be suppressed by the crosshypothesis maxpooling. Secondly, similar to [31], in the crossproposal fusion, the proposal’ features are weighted by their probability scores. Hence, those noisy proposals’ features have less contribution to the final feature representation. In summary, the reweighting operations in the crosshypothesis maxpooling and the crossproposal fusion can reduce the negative effects of the inaccurate input proposals.
With generated by the crossproposals fusion, we can generate either the categoryaware hash representation that consists of pieces of hash codes, or the semantic hash representation that consists of one piece of hash code. Next we will present these cases separately.
IiiD2 Categoryaware Hash Representation
Since is organized in groups , each can be converted into a bit binary code , where if , and otherwise . For an image , the categoryaware hash representation of is .
To learn this representation, we define triplet loss functions [32, 8], each for one category. To obtain triplet samples, we randomly select image and image that belong to the same category and the negative image is randomly selected from those which do not contain the category. Then we design a triplet loss that tries to preserve the relative similarities in the form: “image is more similar to image than to ”. For the th category (), suppose we have three images , and , where both and belong to the th category, but does not. Then the triplet loss associated to the th category is defined by
(7) 
where represents the vector for the image .
This loss function is convex. Its subgradient with respect to , and can be easily obtain by
(8) 
when . Otherwise the subgradients are all zeros.
IiiD3 Semantic Hash Representation
For semantic hashing, we assume the target length of a hash code is bits. We first use a fullyconnected layer to convert the dimensional to a dimensional . can be converted to a bit binary code by , where if , and otherwise .
Next we present the triplet loss defined on . Since the original triplet loss [32, 8] is designed for singlelabel data, here we propose a weighted triplet loss for multilabel data. Specifically, we define the similarity function as the number of shared labels between the images and . Then, for the images , and and , the weighted triplet loss is defined by
(9) 
where is defined in (7), and is the dimensional vector for the image .
In many existing supervised hashing methods, the side information is in the form of pairwise labels indicating the similarites/dissimilarites on image pairs. In these hashing methods, a straightforward way is to define the pairwise loss functions which preserve the pairwise similarities of images. Some recent papers (e.g., [32, 8]) learn hash functions by using triplet loss functions, which seek to preserve the relative similarities in the form: “image is more similar to image than to image ”. Such a form of tripletbased relative similarities can be more easily obtained than pairwise similarities (e.g., users’ clickthrough data from image retrieval applications).
Iv CategoryAware Retrieval
Suppose that we have a set of images with being the number of images in for retrieval. Independently from other categories, for the th category (), we can generate the binary codes , and then conduct retrieval based on these codes. Hence, for a query image, the retrieved results can be organized in groups, where the th group has a list of images, each of which is likely to contain a similar object of the th category.
An issue which needs to be considered here is that the number of objects in an image may be less than , and it is inappropriate to organize the retrieved results in groups for all of the query images. Since the label probability calculation module (see Section IIIC) outputs a probability vector , where represents the predicted value for the possibility of the input image containing objects in the th category. For the groups of retrieved results, we can remove the th group if and only if is less than some threshold. In our experiments, we empirically set this threshold to be .
For those images in the database for retrieval, each image is first encoded into pieces of bit binary codes. Then we collect those hash codes with a probability score (i.e., for the th piece of hash code) no less than . We organize the collected hash codes in groups. The th group contains the hash codes with each being the th piece of code of some image. Finally, we build a hash table to store the hash codes in each group, respectively. In retrieval, for a test query image, we first convert it into pieces of bit codes, and then remove those codes with a probability score less than . For each of the rest hash codes, we conduct search in the corresponding hash table and obtain a list of retrieved images.
Figure 5 shows two examples of results from our experiments. For the first example, when retrieving with a query image containing a cat and a dog, the proposed method returns two lists of retrieved images. Each image in the first/second list is likely to have a cat/dog similar to that in the query image, where the approximate location of this cat/dog is also indicated (in the associated grey image). The grey images are saliency maps that are obtained in favor of the automatically generated region proposals (see Section IIIA) and the predicted probability scores (i.e., the vector in Section IIIC).
V Experiments
In this section, we evaluate the performance of the proposed method for either semantic hashing or categoryaware hashing, and compare it with several stateoftheart hashing methods.
Va Datasets and Evaluation Metrics
We evaluate the proposed method on three public datasets of multilabel images: VOC 2007 [33], VOC 2012 [33] and MIRFLICKR25K [34].

VOC 2007 consists of 9,963 multilabel images which are collected from Flickr^{3}^{3}3http://www.flickr.com/. There are 20 object classes in this dataset. On average, each image is annotated with 1.5 labels.

VOC 2012 consists of 22,531 multilabel images in 20 classes. Since the ground truth labels of the test images are not available, in our experiments, we only use 11,540 images from its training and validation set.

MIRFLICKR25K consists of 25,000 multilabel images downloaded from Flickr. There are 38 classes in this dataset. Each image has 4.7 labels on average.
In each dataset, we randomly select 2,000 images as the test query set, and the rest images are used as training samples. Note that, we only use 11,540 images in VOC2012 dataset, which have the ground truth labels. The number of training samples and testing samples are shown in Table I:
VOC2007  VOC2012  MIRFLICKR25K  

Test  2,000  2,000  2,000 
Train  7,963  9,540  23,000 
To evaluate the performance, we use four evaluation metrics: Normalized Discounted Cumulative Gains (NDCG)
[35], Mean Average Precision (MAP) [36], Weighted MAP [9] and Average Cumulative Gains (ACG) [37].NDCG is a popular evaluation metric in information retrieval. Given a query image, the DCG score at the position is defined as
(10) 
where is the similarity between the th position image and the query image, which is defined as the number of shared labels between the query image and the th retrieved image. Then, the NDCG score at the position can be calculated by NDCG, where is the maximum value of DCG, making the value of NDCG fall in the range .
ACG represents the sum of similarities between the query image and each of the top retrieved images, which can be calculated by
(11) 
MAP is a standard evaluation metric for information retrieval. It is the mean of averaged precisions over a set of queries, which can be calculated by
(12) 
where is an indicator function. If the image at the position is relevant (i.e., it at least shares one label with the query image), is 1; otherwise is 0. represents the total number of relevant images w.r.t. the query image. , where represents the number of relevant images within the top images.
The weighted MAP is defined as
(13) 
Methods  VOC 2007  MIRFLICKR25K  VOC 2012  

16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  
NDCG@1000  
Ours  0.7963  0.8696  0.8865  0.8929  0.4725  0.5245  0.5400  0.5552  0.7939  0.8385  0.8566  0.8573 
OneStage  0.7808  0.8296  0.8446  0.8530  0.4413  0.5096  0.5392  0.5550  0.7540  0.8012  0.8170  0.8224 
ITQCCA  0.7704  0.8007  0.8139  0.8146  0.4498  0.4719  0.4866  0.4921  0.7471  0.7759  0.7815  0.7891 
ITQ  0.6848  0.6768  0.6783  0.6766  0.3734  0.3934  0.3966  0.3982  0.6433  0.6371  0.6338  0.6337 
SH  0.5404  0.5013  0.4796  0.4697  0.3096  0.3046  0.2998  0.2959  0.5157  0.4718  0.4409  0.4238 
ACG@1000  
Ours  0.7065  0.7590  0.7674  0.7731  2.6751  2.8353  2.8951  2.9282  0.7523  0.7841  0.7972  0.7963 
OneStage  0.7007  0.7323  0.7407  0.7483  2.6083  2.8302  2.9141  2.9642  0.7141  0.7522  0.7654  0.7705 
ITQCCA  0.6418  0.6658  0.6780  0.6779  2.4785  2.5314  2.5964  2.6149  0.6733  0.6971  0.7031  0.7107 
ITQ  0.5823  0.5695  0.5676  0.5661  2.1964  2.2568  2.2650  2.2747  0.5794  0.5715  0.5669  0.5656 
SH  0.4570  0.4218  0.4044  0.3982  1.8394  1.7668  1.7215  1.6894  0.4660  0.4204  0.3947  0.3785 
MAP  
Ours  0.7997  0.8618  0.8784  0.8830  0.7994  0.8317  0.8366  0.8361  0.7942  0.8437  0.8617  0.8642 
OneStage  0.7488  0.7995  0.8171  0.8259  0.7727  0.8059  0.8136  0.8179  0.7343  0.7870  0.8055  0.8109 
ITQCCA  0.6913  0.7264  0.7404  0.7396  0.7015  0.7053  0.7174  0.7254  0.6952  0.7254  0.7362  0.7427 
ITQ  0.5845  0.5747  0.5769  0.5741  0.6804  0.6822  0.6796  0.6795  0.5715  0.5984  0.5554  0.5549 
SH  0.4432  0.4071  0.3875  0.3799  0.6174  0.6057  0.5994  0.5952  0.4378  0.4184  0.3641  0.3485 
Weighted MAP  
Ours  0.8566  0.9255  0.9449  0.9505  2.0877  2.1926  2.2271  2.2294  0.8429  0.9005  0.9205  0.9229 
OneStage  0.8007  0.8595  0.8794  0.8903  2.0411  2.1584  2.1958  2.2187  0.7798  0.8414  0.8631  0.8698 
ITQCCA  0.7325  0.7725  0.7879  0.7866  1.7359  1.7518  1.7982  1.8185  0.7312  0.7666  0.7779  0.7854 
ITQ  0.6214  0.6129  0.6163  0.6132  1.6269  1.6403  1.6344  1.6369  0.6051  0.5715  0.5911  0.5906 
SH  0.4723  0.4342  0.4137  0.4051  1.4077  1.3598  1.3324  1.3150  0.4637  0.4205  0.3865  0.3697 
VB Experimental Setting
We implement the proposed method based on the opensource Caffe [38]
framework. The networks are trained using stochastic gradient descent. In training, the weights of the layers are initialized by the pretrained GoogLeNet model. The base learning rate is set to be 0.0001. After every 30 epochs on the training data, the learning rate is adjusted to one tenth of the current learning rate. In all of our experiments, we first use GOP to obtain the bounding boxes of region proposals (no more than 1000 proposals for an image). With these bounding boxes, we use nonmaximum suppression to obtain a smaller number of boxes, and then select the top
(here we set ) boxes with the highest confidence. We use the level pyramid pooling (. The number of feature maps in the last convolution layer is 32, hence the dimension of each intermediate feature vector is 960 (i.e., ). The number of a proposal’s feature vectors is set to be the desired hash bits for each category.During training, we use a randomly sampling strategy to generate triplets (i.e., a triplet of images that is more similar to than to ). Specifically, the proposed network is trained by stochastic gradient descent, where the number of iterations is 15,000 and the minibatch size is 32. We denote SharedLabels ^{4}^{4}4As an illustrative example, suppose has the class labels and , has the class labels and , then we have SharedLabels because and have shared labels and . as the number of shared labels between the image and the image . The procedure of generating triplets in each iteration is shown in the following:
Input: a batch of 32 training images. 
Output: a set of triplets. 
. 
For every triplet that and 
If SharedLabels SharedLabels 
End If 
End For 
Output . 
The source code of the proposed method is made publicly available at http://ss.sysu.edu.cn/p̃y/tiphashing.rar.
VC Results on Semantic Hashing
The first set of experiments is to evaluate the performance of the proposed method in semantic hashing.
We use SH [14], ITO [1], ITQCCA [1] and OneStage Hashing [8] as the baselines in our experiments. SH and ITQ are unsupervised methods, while ITQCCA and onestage hashing are supervised methods. OneStage hashing is a recently proposed deepnetworksbased hashing method that is the most related competitor to the proposed method. For SH, ITQ and ITQCCA, we use the pretrained GoogLeNet model^{5}^{5}5https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet to extract features for the images. The feature vector for each image is with 1024 dimensions. For a fair comparison, in our implementation of OneStage Hashing, we use the architecture of GoogLeNet as its shared subnetwork, instead of the NIN architecture used in [8]; we also use the same weighted triplet loss in (9) as the proposed method. The variant of OneStage Hashing also uses the opensource Caffe for training.
Table II shows the comparison results w.r.t. NDCG@1000, ACG@1000, MAP and Weighted MAP. Figure 6 and Figure 7 show the NCCG and ACG with varying . As can be seen, the proposed method shows superior performance gains over the baselines. On VOC 2007 and VOC 2012, the NDCG@1000 values of the proposed methods indicate a / relative increase over the second best baseline. The ACG@1000 value of the proposed method is 0.7731 with bits, compared to 0.7483 of OneStage Hashing. On MIRFLICKR25K, the values of MAP indicate a relative increase of . It can be observed from these results that incorporating automatically generated region proposals and label probability calculation in the process of hash learning can help improve the performance of semantic hashing.
VD Results on Categoryaware Hashing
We also evaluate the performance of the proposed method for categoryaware hashing. Since little effort has been devoted to categoryaware hashing on multilabel images, to demonstrate the advantages of the proposed method, we implement a deepnetworksbased baseline that also outputs pieces of bit hash codes, each code corresponding to a category. As shown in Figure 9, this baseline adopts GoogLeNet as the basic framework. After the last (1024dimensional) fully connected layer of GoogLeNet, a fully connected layer with nodes is added, and then this layer is separated into slices (each is in dimensions). For the th () slice, a triplet loss is defined which regards the images belonging to the th category as positive examples, and other images as negative ones. To train this baseline, we also use the pretrained GoogLeNet model to initialize its weights.
The baseline is a simpler categoryaware retrieval system, which does not use the region proposal module and label probability module. The experimental results can answer us whether the retrieval system with these two modules can contribute to the accuracy improvement or not.
For a test query image, we first convert it into pieces of bit codes, and then use the hash codes of categories that the test image contains to conduct search in the corresponding hash table and obtain a list of retrieved images.
The MAP results (for each category) are shown in Figure 8. We can observe that the proposed method consistently outperforms the baseline. For example, on VOC 2007 with , the averaged MAP (over 20 classes) of the proposed method is 0.5831, compared to 0.3190 of the baseline. On VOC 2012, the averaged MAP of the proposed method has a relative increase of over the baseline with . On MIRFILCKR25K, the proposed method yields a relative increase over the baseline with w.r.t. averaged MAP. Figure 5 shows two examples of results from our experiments.
Vi Conclusions and Future Work
In this paper, we proposed a deepnetworksbased hashing method for multilabel image retrieval, by incorporating automatically generated region proposals and label probability calculation in the hash learning process. In the proposed deep architecture, an input image is converted to an instanceaware representation organized in groups, each group corresponding to a category. Based on this representation, we can easily generate binary hash codes for either semantic hashing or categoryaware hashing. Empirical evaluations on both the categoryaware hashing and semantic hashing show that the proposed method substantially outperforms the stateofthearts.
In future work, we plan to study unsupervised instanceaware image retrieval, in which the virtual classes can be obtained by clustering.
References

[1]
Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to
learning binary codes,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2011, pp. 817–824.  [2] B. Kulis and K. Grauman, “Kernelized localitysensitive hashing for scalable image search,” in Proceedings of the IEEE International Conference on Computer Vision, 2009, pp. 2130–2137.

[3]
W. Liu, J. Wang, S. Kumar, and S.F. Chang, “Hashing with graphs,” in
Proceedings of the International Conference on Machine Learning
, 2011, pp. 1–8.  [4] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning,” in Proceedings of the AAAI Conference on Artificial Intellignece, 2014, pp. 2156–2162.
 [5] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang, “Supervised hashing with kernels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
 [6] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Proceedings of the Advances in Neural Information Processing Systems, 2009, pp. 1042–1050.
 [7] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general twostep approach to learningbased hashing,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2552–2559.
 [8] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3270–3278.
 [9] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.
 [10] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
 [11] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 346–361.
 [12] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the International Conference on Very Large Data Bases, 1999, pp. 518–529.

[13]
R. Salakhutdinov and G. Hinton, “Learning a nonlinear embedding by preserving
class neighbourhood structure,” in
Proceedings of the International Conference on Artificial Intelligence and Statistics
, 2007, pp. 412–419.  [14] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proceedings of the Advances in Neural Information Processing Systems, 2008, pp. 1753–1760.
 [15] M. Norouzi and D. M. Blei, “Minimal loss hashing for compact binary codes,” in Proceedings of the International Conference on Machine Learning, 2011, pp. 353–360.
 [16] X. Li, G. Lin, C. Shen, A. v. d. Hengel, and A. Dick, “Learning hash functions using column generation,” in Proceedings of the International Conference on Machine Learning, 2013, pp. 142–150.
 [17] J. Wang, S. Kumar, and S.F. Chang, “Semisupervised hashing for scalable image retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.

[18]
D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and random subspace for support vector machinesbased relevance feedback in image retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 7, pp. 1088–1099, 2006.  [19] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [20] K. Lin, H.F. Yang, J.H. Hsiao, and C.S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 27–35.
 [21] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashing with orthogonal regularization,” in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, 2015, pp. 2291–2297.
 [22] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric mincuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1312–1328, 2012.
 [23] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
 [24] P. Arbelaez, J. PontTuset, J. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 328–335.

[25]
M.M. Cheng, Z. Zhang, W.Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.  [26] P. Krähenbühl and V. Koltun, “Geodesic object proposals,” in Proceedings of European Conference on Computer Vision, 2014, pp. 725–739.
 [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
 [28] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pretraining,” in Proceedings of International Conference on Artificial Intelligence and Statistics, 2009, pp. 153–160.
 [29] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Cnn: Singlelabel to multilabel,” arXiv preprint arXiv:1406.5726, 2014.
 [30] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894, 2013.
 [31] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” arXiv preprint arXiv:1411.7718, 2014.
 [32] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1–9.
 [33] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 [34] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in Proceedings of the 1st ACM international conference on Multimedia information retrieval, 2008, pp. 39–43.
 [35] K. Järvelin and J. Kekäläinen, “Cumulated gainbased evaluation of ir techniques,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422–446, 2002.
 [36] R. BaezaYates, B. RibeiroNeto et al., Modern information retrieval. ACM press New York, 1999, vol. 463.
 [37] K. Järvelin and J. Kekäläinen, “Ir evaluation methods for retrieving highly relevant documents,” in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000, pp. 41–48.
 [38] Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” h ttp://caffe. berkeleyvision. org, 2013.
Comments
There are no comments yet.