Instance-Aware Hashing for Multi-Label Image Retrieval

03/10/2016 ∙ by Hanjiang Lai, et al. ∙ National University of Singapore SUN YAT-SEN UNIVERSITY 0

Similarity-preserving hashing is a commonly used method for nearest neighbour search in large-scale image retrieval. For image retrieval, deep-networks-based hashing methods are appealing since they can simultaneously learn effective image representations and compact hash codes. This paper focuses on deep-networks-based hashing for multi-label images, each of which may contain objects of multiple categories. In most existing hashing methods, each image is represented by one piece of hash code, which is referred to as semantic hashing. This setting may be suboptimal for multi-label image retrieval. To solve this problem, we propose a deep architecture that learns instance-aware image representations for multi-label image data, which are organized in multiple groups, with each group containing the features for one category. The instance-aware representations not only bring advantages to semantic hashing, but also can be used in category-aware hashing, in which an image is represented by multiple pieces of hash codes and each piece of code corresponds to a category. Extensive evaluations conducted on several benchmark datasets demonstrate that, for both semantic hashing and category-aware hashing, the proposed method shows substantial improvement over the state-of-the-art supervised and unsupervised hashing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 6

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Large-scale image retrieval, which is to find images containing similar objects as in a query image, has attracted increasing interest due to the ever-growing amount of available image data on the Web. Similarity-preserving hashing is a popular nearest neighbor search technique for image retrieval on datasets with millions or even billions of images.

Fig. 1:

Illustration of instance-aware image retrieval. (1) Given a query image, e.g., containing a bicycle and a sofa, the proposal method (2) generates region proposals, (3) computes the label probability scores for each proposal, (4) encodes each proposal to an intermediate feature vector, and then computes the weighted average of these vectors (with the label probability scores being the weights) to generate the instance-aware representations organized in multiple groups, each corresponding to an object. After that, this representation is converted to (5a) one piece of hash code for semantic hashing or (5b) multiple pieces of hash codes, each piece corresponding to a category, for category-aware hashing.

A representative stream of similarity-preserving hashing is learning-to-hash, i.e., learning to compress data points (e.g., images) into binary representations such that semantically similar data points have nearby binary codes. The existing learning-to-hash methods can be divided into two main categories: unsupervised methods and supervised methods. Unsupervised methods (e.g., [1, 2, 3]) learn a set of hash functions from unlabeled data without any side information. Supervised methods (e.g., [4, 5, 6, 7]

) try to learn compact hash codes by leveraging supervised information on data points (e.g., similarities on pairs of images). Among various supervised learning-to-hash methods for image retrieval, an emerging stream is deep-networks-based hashing that learns bitwise codes as well as image representations via carefully designed deep neural networks. Several deep-networks-based hashing methods have been proposed (e.g., 

[4, 8, 9]).

Fig. 2: Overview of the proposed deep architecture for hashing on multi-label images. The proposed architecture takes an image (e.g, in classes) and its automatically generated region proposals as inputs. The image firstly goes trough the deep convolution sub-network, and then intermediate feature vectors are generated for the region proposals via the spatial pyramid pooling scheme. With the intermediate features, our network is divided into two branches: one for calculating the label probabilities of the region proposals (see Fig. 3), and the other for generating the proposals’ features. Cross-proposal fusion is performed to merge the proposals’ features (with label probabilities) into an intermediate multiple-slice representation (i.e., ) in which each slice corresponds to one category (see Fig. 4). After that, this intermediate representation is converted to multiple pieces of hash codes (for category-aware hashing) or one piece of hash code (for semantic hashing).

Multi-label images, each of which may contain objects of multiple categories, are widely involved in many image retrieval systems. However, in most existing hashing methods for images, the semantic similarities are defined at image level, and each image is represented by one piece of hash code. This setting may be suboptimal for multi-label image retrieval.

In this paper, we consider instance-aware retrieval for multi-label image data, which includes semantic hashing [10] and category-aware hashing. Specifically, given a multi-label query image, a natural demand is to organize the retrieved results in groups, each group corresponding to one category. For example, as shown in Figure 1, given a query image containing a bicycle and a sofa, one would like to organize the retrieved results in two groups: each image in the first (second) group contains a bicycle (sofa) similar to the one in the query image. In order to achieve instance-aware retrieval, we propose a new image representation organized in groups, by incorporating automatically generated candidate object proposals and label probability calculation into the learning process. Figure 1 shows an example for the generation of the instance-aware image representation.

More specifically, we propose a deep neural network that simultaneously learns binary hash codes and the representation tailored for multi-label image data. As shown in Figure 2, the proposed architecture has four building blocks: 1) a set of automatically generated candidate object proposals in the form of bounding boxes, as inputs to the deep neural network; 2) stacked convolutional layers to capture the features of the input proposals, followed by a Spatial Pyramid Pooling (SPP) layer [11] to map each proposal to a -dimensional intermediate representation; 3) a label probability calculation module that maps the intermediate representation to the image labels (in classes), which leads to an probability matrix with the -th row representing the label probabilities of the -th proposal belonging to each class; 4) a hash coding module, where firstly an instance-aware representation is captured, in which the probability matrix in the third module is used as the input, and then either category-aware hash codes or semantic hash codes are generated based on this representation.

The proposed deep architecture can be used to generate hash codes for category-aware hashing, where an image is represented by multiple pieces of hash codes, each of which corresponds to a category. In addition, we show that the proposed image representation can improve the quality of semantic hashing in which an image is represented by one piece of hash code.

Our contributions in this paper can be summarized as follows. First, we propose a deep architecture that can generate hash codes for instance-aware retrieval. To the best of our knowledge, we are the first to conduct instance-aware retrieval via learning-based hashing. Second, we propose to incorporate automatically generated candidate object proposals and label probability calculation in the proposed deep architecture. We empirically show that the proposed method has superior performance gains over several state-of-the-art hashing methods.

Ii Related Work

Due to the encouraging search speed, hashing has become a popular method for nearest neighbor search in large-scale image retrieval.

Hashing methods can be divided into data independent hashing and data dependent hashing. The early efforts mainly focus on data independent hashing. For example, the notable Locality-Sensitive Hashing (LSH) [12] method constructs hash functions by random projections or random permutations that are independent of the data points. The main limitation of data independent methods is that they usually require long hash codes to obtain good performance. However, long hash codes lead to inefficient search due to the required large storage space and the low recall rates.

Learning-based hashing (or Learning-to-hash) pursues a compact binary representation from the training data. Based on whether side information is used or not, learning-to-hash methods can be divided into two categories: unsupervised methods and supervised methods.

Unsupervised methods try to learn a set of similarity-preserving hash functions only from the unlabeled data. Representative methods in this category include Kernelized LSH (KLSH) [2], Semantic hashing [13], Spectral hashing [14], Anchor Graph Hashing [3], and Iterative Quantization (ITQ) [1]. Kernelized LSH (KLSH) [2] generalizes LSH to accommodate arbitrary kernel functions, making it possible to learn hash functions which preserve data points’ similarity in a kernel space. Semantic hashing [13]

generates hash functions by a deep auto-encoder via stacking multiple restricted Boltzmann machines (RBMs). Graph-based hashing methods, such as Spectral hashing 

[14] and Anchor Graph Hashing [3], learn non-linear mappings as hash functions which try to preserve the similarities within the data neighborhood graph. In order to reduce the quantization errors, Iterative Quantization (ITQ) [1]

seeks to learn an orthogonal rotation matrix which is applied to the data matrix after principal component analysis projections.

Supervised methods aim to learn better bitwise representations by incorporating supervised information. Notable methods in this category include Binary Reconstruction Embedding (BRE) [6], Minimal Loss Hashing (MLH) [15], Supervised Hashing with Kernels (KSH) [5], Column Generation Hash (CGHash) [16], and Semi-Supervised Hashing (SSH) [17]. Binary Reconstruction Embedding (BRE) [6] learns hash functions by explicitly minimizing the reconstruction errors between the original distances of data points and the Hamming distances of the corresponding binary codes. Minimal Loss Hashing (MLH) [15]

learns similarity-preserving hash codes by minimizing a hinge-like loss function which is formulated as structured prediction with latent variables. Supervised Hashing with Kernels (KSH) 

[5] is a kernel-based supervised method which learns to hash the data points to compact binary codes whose Hamming distances are minimized on similar pairs and maximized on dissimilar pairs. Column Generation Hash (CGHash) [16] is a column generation based method to learn hash functions with proximity comparison information. Semi-Supervised Hashing (SSH) [17] learns hash functions via minimizing similarity errors on the labeled data while simultaneously maximizing the entropy of the learnt hash codes over the unlabeled data. In most image retrieval applications, the number of labeled positive samples is small, which results in bias towards the negative samples and over-fitting. Tao et al. [18] proposed an asymmetric bagging and random subspace SVM (ABRS-SVM) to handle these problems.

In supervised hashing methods for image retrieval, an emerging stream is the deep-networks-based methods [19, 4, 8, 9] which learn image representations as well as binary hash codes. Xia et al. [4]

proposed Convolutional-Neural-Networks-based Hashing (CNNH), which is a two-stage method. In its first stage, approximate hash codes are learned from the supervised information. Then, in the second stage, hash functions are learned based on those approximate hash codes via deep convolutional networks. Lai

et al. [8] proposed a one-stage hashing method that generates bitwise hash codes via a carefully designed deep architecture. Zhao et al. [9] proposed a ranking based hashing method for learning hash functions that preserve multi-level semantic similarity between images, via deep convolutional networks. Lin et al. [20] proposed to learn the hash codes and image representations in a point-wised manner, which is suitable for large-scale datasets. Wang et al. [21] proposed Deep Multimodal Hashing with Orthogonal Regularization (DMHOR) method for multimodal data. All of these methods generate one piece of hash code for each image, which may be inappropriate for multi-label image retrieval. Different from the existing methods, the proposed method can generate multiple pieces of hash codes for an image, each piece corresponding to a(n) instance/category.

Iii The Proposed Method

Our method consists of four modules. The first module is to generate region proposals for an input image. The second module is to capture the features for the generated region proposals. It contains a deep convolution sub-network followed by a Spatial Pyramid Pooling layer [11]. The third module is a label probability calculation module, which outputs a probability matrix whose -th row represents the probability scores of the -th proposal belonging to each class. The fourth module is a hash coding module that firstly generates the instance-aware representation, and then converts this representation to hash codes for either category-aware hashing or semantic hashing. In the following, we will present the details of these modules, respectively.

Iii-a Region Proposal Generation Module

Many methods for generating category-independent region proposals have been proposed, e.g., Constrained Parametric Min-Cuts (CPMC) [22], Selective Search [23], Multi-scale Combinatorial Grouping (MCG) [24]

, BInarized Normed Gradients (BING) 

[25] and Geodesic Object Proposals (GOP) [26]. In this paper, we use GOP [26] to automatically generate region proposals for an input. Note that other methods for region proposal generation can also be used in our framework.

GOP is a method that can generate both segmentation masks and bounding box proposals. We use the code111http://www.philkr.net/home/gop provided by the authors to generate the bounding boxes for region proposals.

Iii-B Deep Convolution Sub-Network Module

GoogLeNet [27] is a recently proposed deep architecture that has shown its success in object categorization and object detection. The core of GoogLeNet is the Inception-style convolution module which allows increasing the depth and width of the network while keeping reasonable computational costs. Here we adopt the architecture of GoogLeNet as our basic framework to compute the features for the input proposals. Since the GoogLeNet is a very deep network and has many layers, we use the pre-trained GoogLeNet model222http://dl.caffe.berkeleyvision.org/bvlc_googlenet.caffemodel to initialize its weights which can be regarded as regularization [28] and help its generalization.

However, since the number of generated region proposals for an input image may be large (e.g., more than 1000), it is computationally expensive if one directly uses GoogLeNet to extract features from these proposals. This is unaffordable for hashing-based retrieval since the retrieval system may need a long time to respond to a query.

To address this issue, we use the “Spatial Pyramid Pooling” (SPP) scheme [11]. The advantage of using SPP is that we can compute the feature map from the entire input image only once. Then, with this feature map, we pool features in each generated region proposal to generate a fixed-length representation. Using SPP, we avoid repeatedly computing features for the input region proposals via a deep convolutional network. Specifically, as shown in Figure 2, we add an SPP layer after the last convolutional layer of GoogLeNet. We assume that each input image has automatically generated region proposals. For each input region proposal, we encode its top-left and bottom-right coordinates to a -dimensional vector . The elements in this vector are scaled to by dividing the width/height of the image, making them invariant to the absolute image size. With this -dimensional vector as the input, the SPP layer generates a fixed-length feature vector for the corresponding proposal. Through the SPP layer, we assume that each proposal is mapped to a -dimensional intermediate feature vector. Hence, for each input image, the output of this module is an matrix.

After this module, the network is divided into two branches: one for the label probability calculation module, and the other for the hash coding module.

Fig. 3: Illustration of the proposed label probability calculation module. For an image (in classes) with region proposals, our network generates a probability vector for each proposal, e.g., (

). After that, the cross-hypothesis max-pooling is used to fuse the

probability vectors into one vector.

Iii-C Label Probability Calculation Module

In this subsection, we will show how to learn the label probability for each region proposal. Suppose there are class labels, and a probability vector is generated for each proposal, e.g., indicates that the probability of the image containing the -th category is .

However, we do not have the ground truth labels for each proposal. Thus probability distribution can not be directly learned. Fortunately, in the multi-label image annotation, there is a label for the whole image, e.g.,

, where represents an image and is the ground truth label. and . is equal to 1 if the -th label is relevant to image and 0 for the irrelevant case. Hence, we can firstly fuse the proposals into one and then use the whole image’s label to learn as shown in Figure 3.

More specifically, with the matrix in which the -th row represents the -dimensional intermediate feature for the -th proposal, in this module, we first use a fully-connected layer to compress to a -dimensional vector .

After that, we use the cross-hypothesis max-pooling [29] to fuse to one -dimensional vector. Specifically, let be the by matrix whose -th row is . The cross-hypothesis max-pooling can be formulated as

(1)

where is the pooled value that corresponds to the -th category.

Using , we calculate a probability distribution expressed by

(2)

where can be regarded as the probability score that the input image contains an object in the -th category. Using such cross-hypothesis max-pooling, if the -th proposal contains the -th category, then the output should have a large value and will have a high response. Hence, it can guide the learning of .

In this module, we define a loss function based on the cross entropy between the probability scores and the ground truth labels:

(3)

where we denote as the set of categories which the input image belongs to, and as the number of elements in . This loss function is also referred to as Softmax-Loss [30], which is a widely used loss function in the Convolutional neural networks. The (sub-)gradients with respect to are

(4)

It can be easily integrated in back propagation in neural networks.

After that, similarly to , we can define a probability matrix for the region proposals as

(5)

where represents the -th element in the -th row of . can be viewed as the probability that the -th proposal contains an object of the -th category.

Iii-D Hash Coding Module

In this subsection, we will show how to convert the image representation into (a) one piece of hash code for semantic hashing or (b) multiple pieces of hash codes, each piece corresponding to a category, for category-aware hashing.

With the matrix as the input, in this module, we first use a fully-connected layer to compress each to a -dimensional vector , where corresponds to the -th proposal. We denote as the by matrix whose -th row is .

Iii-D1 Cross-Proposal Fusion

In order to convert into the instance-aware representation of the input image, we propose a cross-proposal fusion strategy, by using the probability matrix from the label probability calculation module.

Specifically, with the by feature matrix and the by matrix where represents the probability of the -th region proposal belonging to the -th category, we fuse and into a long vector with elements. This vector is organized in groups, each group representing -dimensional features corresponding to one category.

Let , represent the -th row of , , respectively. The cross-proposal fusion can be formulated as

(6)

where is the Kronecker product. For the -dimensional vector and the -dimensional vector , the Kronecker product is a -dimensional vector:

Let , where is a -dimensional vector. It is easy to verify that

Since represents the features of the -th proposal, can be regarded as the weighted average of the proposals’ features. If the -th proposal has a relatively higher/lower score (meaning that the -th proposal likely/unlikely belongs to the -th category), the feature vector (associated to the -th proposal) has more/less contribution to the weighted average .

Fig. 4: Illustration of the cross-proposal fusion. Each region proposal is firstly encoded into a feature vector. And these feature vectors are fused into an intermediate feature representation, by using the probability scores (learned from the label probability calculation module) as the weights.

Figure 4 shows an illustrative example of cross-proposal fusion. Suppose there are only categories (bicycle and sofa, ) in all of the images. For the input image, region proposals are generated in the first module (). Then, suppose the label probability calculation module generates a probability matrix with , , and . For the st proposal, indicates that it is very likely to contain a bicycle (with a score ), but it seems unlikely to contain a sofa (with a score ). In the hash coding module, the -th proposal is represented by a -dimensional feature vector . Finally, for the input image, we conduct cross-proposal fusion to obtain an instance-aware representation , where the representation of “bicycle” is , and the representation of “sofa” is .

The cross-proposal fusion is a crucial step for the instance-aware image representation. If the image contains an object, then the instance-aware representation will have a high response for the object. It also gives us a simple way to combine the multi-label information into the hashing procedure.

Discussions. A concern arising here is that some input proposals may be inaccurate or even do not contain any objects, which will make the features generated by these proposals noisy and harm the final performance. We argue that, to some extent, the operations in the proposed Cross-Hypothesis Max-Pooling and the Cross-Proposal Fusion can reduce the negative effects of the possibly noisy input proposals. Firstly, in the label probability calculation module before the cross-hypothesis max-pooling, each input proposal is assigned with a set of probability scores (one score for one label). Higher scores are assigned to the proposals that may contain objects with more confidence, and lower scores are assigned to those noisy proposals. Hence, those noisy proposals may more probably be suppressed by the cross-hypothesis max-pooling. Secondly, similar to [31], in the cross-proposal fusion, the proposal’ features are weighted by their probability scores. Hence, those noisy proposals’ features have less contribution to the final feature representation. In summary, the re-weighting operations in the cross-hypothesis max-pooling and the cross-proposal fusion can reduce the negative effects of the inaccurate input proposals.

With generated by the cross-proposals fusion, we can generate either the category-aware hash representation that consists of pieces of hash codes, or the semantic hash representation that consists of one piece of hash code. Next we will present these cases separately.

Fig. 5: Two examples of the results of category-aware hashing. Each retrieved image has a corresponding grey image of the saliency map, in which whiter color indicates higher possibility of an object existing at that position. Given a query image, the proposed method returns multiple lists of images, each list corresponding to a category. Each image in a returned list is likely to contain a similar object as in the query image, where the approximate location of this object is shown in the corresponding grey image.

Iii-D2 Category-aware Hash Representation

Since is organized in groups , each can be converted into a -bit binary code , where if , and otherwise . For an image , the category-aware hash representation of is .

To learn this representation, we define triplet loss functions [32, 8], each for one category. To obtain triplet samples, we randomly select image and image that belong to the same category and the negative image is randomly selected from those which do not contain the category. Then we design a triplet loss that tries to preserve the relative similarities in the form: “image is more similar to image than to ”. For the -th category (), suppose we have three images , and , where both and belong to the -th category, but does not. Then the triplet loss associated to the -th category is defined by

(7)

where represents the vector for the image .

This loss function is convex. Its sub-gradient with respect to , and can be easily obtain by

(8)

when . Otherwise the sub-gradients are all zeros.

Iii-D3 Semantic Hash Representation

For semantic hashing, we assume the target length of a hash code is bits. We first use a fully-connected layer to convert the -dimensional to a -dimensional . can be converted to a -bit binary code by , where if , and otherwise .

Next we present the triplet loss defined on . Since the original triplet loss [32, 8] is designed for single-label data, here we propose a weighted triplet loss for multi-label data. Specifically, we define the similarity function as the number of shared labels between the images and . Then, for the images , and and , the weighted triplet loss is defined by

(9)

where is defined in (7), and is the -dimensional vector for the image .

In many existing supervised hashing methods, the side information is in the form of pairwise labels indicating the similarites/dissimilarites on image pairs. In these hashing methods, a straightforward way is to define the pairwise loss functions which preserve the pairwise similarities of images. Some recent papers (e.g., [32, 8]) learn hash functions by using triplet loss functions, which seek to preserve the relative similarities in the form: “image is more similar to image than to image ”. Such a form of triplet-based relative similarities can be more easily obtained than pairwise similarities (e.g., users’ click-through data from image retrieval applications).

Iv Category-Aware Retrieval

Suppose that we have a set of images with being the number of images in for retrieval. Independently from other categories, for the -th category (), we can generate the binary codes , and then conduct retrieval based on these codes. Hence, for a query image, the retrieved results can be organized in groups, where the -th group has a list of images, each of which is likely to contain a similar object of the -th category.

An issue which needs to be considered here is that the number of objects in an image may be less than , and it is inappropriate to organize the retrieved results in groups for all of the query images. Since the label probability calculation module (see Section III-C) outputs a probability vector , where represents the predicted value for the possibility of the input image containing objects in the -th category. For the groups of retrieved results, we can remove the -th group if and only if is less than some threshold. In our experiments, we empirically set this threshold to be .

For those images in the database for retrieval, each image is first encoded into pieces of -bit binary codes. Then we collect those hash codes with a probability score (i.e., for the -th piece of hash code) no less than . We organize the collected hash codes in groups. The -th group contains the hash codes with each being the -th piece of code of some image. Finally, we build a hash table to store the hash codes in each group, respectively. In retrieval, for a test query image, we first convert it into pieces of -bit codes, and then remove those codes with a probability score less than . For each of the rest hash codes, we conduct search in the corresponding hash table and obtain a list of retrieved images.

Figure 5 shows two examples of results from our experiments. For the first example, when retrieving with a query image containing a cat and a dog, the proposed method returns two lists of retrieved images. Each image in the first/second list is likely to have a cat/dog similar to that in the query image, where the approximate location of this cat/dog is also indicated (in the associated grey image). The grey images are saliency maps that are obtained in favor of the automatically generated region proposals (see Section III-A) and the predicted probability scores (i.e., the vector in Section III-C).

V Experiments

In this section, we evaluate the performance of the proposed method for either semantic hashing or category-aware hashing, and compare it with several state-of-the-art hashing methods.

V-a Datasets and Evaluation Metrics

We evaluate the proposed method on three public datasets of multi-label images: VOC 2007 [33], VOC 2012 [33] and MIRFLICKR-25K [34].

  • VOC 2007 consists of 9,963 multi-label images which are collected from Flickr333http://www.flickr.com/. There are 20 object classes in this dataset. On average, each image is annotated with 1.5 labels.

  • VOC 2012 consists of 22,531 multi-label images in 20 classes. Since the ground truth labels of the test images are not available, in our experiments, we only use 11,540 images from its training and validation set.

  • MIRFLICKR-25K consists of 25,000 multi-label images downloaded from Flickr. There are 38 classes in this dataset. Each image has 4.7 labels on average.

In each dataset, we randomly select 2,000 images as the test query set, and the rest images are used as training samples. Note that, we only use 11,540 images in VOC2012 dataset, which have the ground truth labels. The number of training samples and testing samples are shown in Table I:

VOC2007 VOC2012 MIRFLICKR-25K
Test 2,000 2,000 2,000
Train 7,963 9,540 23,000
TABLE I: The number of training samples and testing samples.

To evaluate the performance, we use four evaluation metrics: Normalized Discounted Cumulative Gains (NDCG) 

[35], Mean Average Precision (MAP) [36], Weighted MAP [9] and Average Cumulative Gains (ACG) [37].

NDCG is a popular evaluation metric in information retrieval. Given a query image, the DCG score at the position is defined as

(10)

where is the similarity between the -th position image and the query image, which is defined as the number of shared labels between the query image and the -th retrieved image. Then, the NDCG score at the position can be calculated by NDCG, where is the maximum value of DCG, making the value of NDCG fall in the range .

ACG represents the sum of similarities between the query image and each of the top retrieved images, which can be calculated by

(11)

MAP is a standard evaluation metric for information retrieval. It is the mean of averaged precisions over a set of queries, which can be calculated by

(12)

where is an indicator function. If the image at the position is relevant (i.e., it at least shares one label with the query image), is 1; otherwise is 0. represents the total number of relevant images w.r.t. the query image. , where represents the number of relevant images within the top images.

The weighted MAP is defined as

(13)
Methods VOC 2007 MIRFLICKR25K VOC 2012
16 bits 32 bits 48 bits 64 bits 16 bits 32 bits 48 bits 64 bits 16 bits 32 bits 48 bits 64 bits
NDCG@1000
Ours 0.7963 0.8696 0.8865 0.8929 0.4725 0.5245 0.5400 0.5552 0.7939 0.8385 0.8566 0.8573
One-Stage 0.7808 0.8296 0.8446 0.8530 0.4413 0.5096 0.5392 0.5550 0.7540 0.8012 0.8170 0.8224
ITQ-CCA 0.7704 0.8007 0.8139 0.8146 0.4498 0.4719 0.4866 0.4921 0.7471 0.7759 0.7815 0.7891
ITQ 0.6848 0.6768 0.6783 0.6766 0.3734 0.3934 0.3966 0.3982 0.6433 0.6371 0.6338 0.6337
SH 0.5404 0.5013 0.4796 0.4697 0.3096 0.3046 0.2998 0.2959 0.5157 0.4718 0.4409 0.4238
ACG@1000
Ours 0.7065 0.7590 0.7674 0.7731 2.6751 2.8353 2.8951 2.9282 0.7523 0.7841 0.7972 0.7963
One-Stage 0.7007 0.7323 0.7407 0.7483 2.6083 2.8302 2.9141 2.9642 0.7141 0.7522 0.7654 0.7705
ITQ-CCA 0.6418 0.6658 0.6780 0.6779 2.4785 2.5314 2.5964 2.6149 0.6733 0.6971 0.7031 0.7107
ITQ 0.5823 0.5695 0.5676 0.5661 2.1964 2.2568 2.2650 2.2747 0.5794 0.5715 0.5669 0.5656
SH 0.4570 0.4218 0.4044 0.3982 1.8394 1.7668 1.7215 1.6894 0.4660 0.4204 0.3947 0.3785
MAP
Ours 0.7997 0.8618 0.8784 0.8830 0.7994 0.8317 0.8366 0.8361 0.7942 0.8437 0.8617 0.8642
One-Stage 0.7488 0.7995 0.8171 0.8259 0.7727 0.8059 0.8136 0.8179 0.7343 0.7870 0.8055 0.8109
ITQ-CCA 0.6913 0.7264 0.7404 0.7396 0.7015 0.7053 0.7174 0.7254 0.6952 0.7254 0.7362 0.7427
ITQ 0.5845 0.5747 0.5769 0.5741 0.6804 0.6822 0.6796 0.6795 0.5715 0.5984 0.5554 0.5549
SH 0.4432 0.4071 0.3875 0.3799 0.6174 0.6057 0.5994 0.5952 0.4378 0.4184 0.3641 0.3485
Weighted MAP
Ours 0.8566 0.9255 0.9449 0.9505 2.0877 2.1926 2.2271 2.2294 0.8429 0.9005 0.9205 0.9229
One-Stage 0.8007 0.8595 0.8794 0.8903 2.0411 2.1584 2.1958 2.2187 0.7798 0.8414 0.8631 0.8698
ITQ-CCA 0.7325 0.7725 0.7879 0.7866 1.7359 1.7518 1.7982 1.8185 0.7312 0.7666 0.7779 0.7854
ITQ 0.6214 0.6129 0.6163 0.6132 1.6269 1.6403 1.6344 1.6369 0.6051 0.5715 0.5911 0.5906
SH 0.4723 0.4342 0.4137 0.4051 1.4077 1.3598 1.3324 1.3150 0.4637 0.4205 0.3865 0.3697
TABLE II: Comparison results of Hamming ranking w.r.t. different numbers of bits on three datasets.

(a) VOC 2007

(b) MIRFLICKR

(c) VOC 2012
Fig. 6: NDCG curves with 32 bits w.r.t. different numbers of top returned samples.

(a) VOC 2007

(b) MIRFLICKR

(c) VOC 2012
Fig. 7: ACG curves with 32 bits w.r.t. different numbers of top returned samples.

V-B Experimental Setting

We implement the proposed method based on the open-source Caffe [38]

framework. The networks are trained using stochastic gradient descent. In training, the weights of the layers are initialized by the pre-trained GoogLeNet model. The base learning rate is set to be 0.0001. After every 30 epochs on the training data, the learning rate is adjusted to one tenth of the current learning rate. In all of our experiments, we first use GOP to obtain the bounding boxes of region proposals (no more than 1000 proposals for an image). With these bounding boxes, we use non-maximum suppression to obtain a smaller number of boxes, and then select the top

(here we set ) boxes with the highest confidence. We use the -level pyramid pooling (. The number of feature maps in the last convolution layer is 32, hence the dimension of each intermediate feature vector is 960 (i.e., ). The number of a proposal’s feature vectors is set to be the desired hash bits for each category.

Fig. 8: Per-category MAP curves on three datasets. The x-axis represents the categories.

During training, we use a randomly sampling strategy to generate triplets (i.e., a triplet of images that is more similar to than to ). Specifically, the proposed network is trained by stochastic gradient descent, where the number of iterations is 15,000 and the mini-batch size is 32. We denote SharedLabels 444As an illustrative example, suppose has the class labels and , has the class labels and , then we have SharedLabels because and have shared labels and . as the number of shared labels between the image and the image . The procedure of generating triplets in each iteration is shown in the following:

Input: a batch of 32 training images.
Output: a set of triplets.
.
For every triplet that and
    If SharedLabels SharedLabels
        
    End If
End For
Output .

The source code of the proposed method is made publicly available at http://ss.sysu.edu.cn/p̃y/tip-hashing.rar.

Fig. 9: The architecture of the baseline for category-aware hashing.

V-C Results on Semantic Hashing

The first set of experiments is to evaluate the performance of the proposed method in semantic hashing.

We use SH [14], ITO [1], ITQ-CCA [1] and One-Stage Hashing [8] as the baselines in our experiments. SH and ITQ are unsupervised methods, while ITQ-CCA and one-stage hashing are supervised methods. One-Stage hashing is a recently proposed deep-networks-based hashing method that is the most related competitor to the proposed method. For SH, ITQ and ITQ-CCA, we use the pre-trained GoogLeNet model555https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet to extract features for the images. The feature vector for each image is with 1024 dimensions. For a fair comparison, in our implementation of One-Stage Hashing, we use the architecture of GoogLeNet as its shared sub-network, instead of the NIN architecture used in [8]; we also use the same weighted triplet loss in (9) as the proposed method. The variant of One-Stage Hashing also uses the open-source Caffe for training.

Table II shows the comparison results w.r.t. NDCG@1000, ACG@1000, MAP and Weighted MAP. Figure 6 and Figure 7 show the NCCG and ACG with varying . As can be seen, the proposed method shows superior performance gains over the baselines. On VOC 2007 and VOC 2012, the NDCG@1000 values of the proposed methods indicate a / relative increase over the second best baseline. The ACG@1000 value of the proposed method is 0.7731 with bits, compared to 0.7483 of One-Stage Hashing. On MIRFLICKR-25K, the values of MAP indicate a relative increase of . It can be observed from these results that incorporating automatically generated region proposals and label probability calculation in the process of hash learning can help improve the performance of semantic hashing.

V-D Results on Category-aware Hashing

We also evaluate the performance of the proposed method for category-aware hashing. Since little effort has been devoted to category-aware hashing on multi-label images, to demonstrate the advantages of the proposed method, we implement a deep-networks-based baseline that also outputs pieces of -bit hash codes, each code corresponding to a category. As shown in Figure 9, this baseline adopts GoogLeNet as the basic framework. After the last (1024-dimensional) fully connected layer of GoogLeNet, a fully connected layer with nodes is added, and then this layer is separated into slices (each is in dimensions). For the -th () slice, a triplet loss is defined which regards the images belonging to the -th category as positive examples, and other images as negative ones. To train this baseline, we also use the pre-trained GoogLeNet model to initialize its weights.

The baseline is a simpler category-aware retrieval system, which does not use the region proposal module and label probability module. The experimental results can answer us whether the retrieval system with these two modules can contribute to the accuracy improvement or not.

For a test query image, we first convert it into pieces of -bit codes, and then use the hash codes of categories that the test image contains to conduct search in the corresponding hash table and obtain a list of retrieved images.

The MAP results (for each category) are shown in Figure 8. We can observe that the proposed method consistently outperforms the baseline. For example, on VOC 2007 with , the averaged MAP (over 20 classes) of the proposed method is 0.5831, compared to 0.3190 of the baseline. On VOC 2012, the averaged MAP of the proposed method has a relative increase of over the baseline with . On MIRFILCKR-25K, the proposed method yields a relative increase over the baseline with w.r.t. averaged MAP. Figure 5 shows two examples of results from our experiments.

Vi Conclusions and Future Work

In this paper, we proposed a deep-networks-based hashing method for multi-label image retrieval, by incorporating automatically generated region proposals and label probability calculation in the hash learning process. In the proposed deep architecture, an input image is converted to an instance-aware representation organized in groups, each group corresponding to a category. Based on this representation, we can easily generate binary hash codes for either semantic hashing or category-aware hashing. Empirical evaluations on both the category-aware hashing and semantic hashing show that the proposed method substantially outperforms the state-of-the-arts.

In future work, we plan to study unsupervised instance-aware image retrieval, in which the virtual classes can be obtained by clustering.

References

  • [1] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2011, pp. 817–824.
  • [2] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in Proceedings of the IEEE International Conference on Computer Vision, 2009, pp. 2130–2137.
  • [3] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” in

    Proceedings of the International Conference on Machine Learning

    , 2011, pp. 1–8.
  • [4] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning,” in Proceedings of the AAAI Conference on Artificial Intellignece, 2014, pp. 2156–2162.
  • [5] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
  • [6] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Proceedings of the Advances in Neural Information Processing Systems, 2009, pp. 1042–1050.
  • [7] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general two-step approach to learning-based hashing,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2552–2559.
  • [8] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3270–3278.
  • [9] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.
  • [10] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 346–361.
  • [12] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the International Conference on Very Large Data Bases, 1999, pp. 518–529.
  • [13] R. Salakhutdinov and G. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure,” in

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    , 2007, pp. 412–419.
  • [14] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proceedings of the Advances in Neural Information Processing Systems, 2008, pp. 1753–1760.
  • [15] M. Norouzi and D. M. Blei, “Minimal loss hashing for compact binary codes,” in Proceedings of the International Conference on Machine Learning, 2011, pp. 353–360.
  • [16] X. Li, G. Lin, C. Shen, A. v. d. Hengel, and A. Dick, “Learning hash functions using column generation,” in Proceedings of the International Conference on Machine Learning, 2013, pp. 142–150.
  • [17] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing for scalable image retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.
  • [18]

    D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 7, pp. 1088–1099, 2006.
  • [19] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [20] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 27–35.
  • [21] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashing with orthogonal regularization,” in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015, pp. 2291–2297.
  • [22] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric min-cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1312–1328, 2012.
  • [23] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [24] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 328–335.
  • [25]

    M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.
  • [26] P. Krähenbühl and V. Koltun, “Geodesic object proposals,” in Proceedings of European Conference on Computer Vision, 2014, pp. 725–739.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
  • [28] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pre-training,” in Proceedings of International Conference on Artificial Intelligence and Statistics, 2009, pp. 153–160.
  • [29] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Cnn: Single-label to multi-label,” arXiv preprint arXiv:1406.5726, 2014.
  • [30] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894, 2013.
  • [31] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” arXiv preprint arXiv:1411.7718, 2014.
  • [32] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1–9.
  • [33] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [34] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in Proceedings of the 1st ACM international conference on Multimedia information retrieval, 2008, pp. 39–43.
  • [35] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422–446, 2002.
  • [36] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval.   ACM press New York, 1999, vol. 463.
  • [37] K. Järvelin and J. Kekäläinen, “Ir evaluation methods for retrieving highly relevant documents,” in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000, pp. 41–48.
  • [38] Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” h ttp://caffe. berkeleyvision. org, 2013.