Deep Hashing Baselines
Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.READ FULL TEXT VIEW PDF
Deep Hashing Baselines
Implementation of our ICCV 2017 paper: MIHash - Hashing with Mutual Information
With the explosive growing of data in real applications like image retrieval, approximate nearest neighbor (ANN) search [Andoni and Indyk2006] has become a hot research topic in recent years. Among existing ANN techniques, hashing has become one of the most popular and effective techniques due to its fast query speed and low memory cost [Kulis and Grauman2009, Gong and Lazebnik2011, Kong and Li2012, Liu et al.2012, Rastegari et al.2013, He et al.2013, Lin et al.2014, Shen et al.2015, Kang et al.2016].
Existing hashing methods can be divided into data-independent methods and data-dependent methods [Gong and Lazebnik2011, Kong and Li2012]. In data-independent methods, the hash function is typically randomly generated which is independent of any training data. The representative data-independent methods include locality-sensitive hashing (LSH) [Andoni and Indyk2006] and its variants. Data-dependent methods try to learn the hash function from some training data, which is also called learning to hash (L2H) methods [Kong and Li2012]. Compared with data-independent methods, L2H methods can achieve comparable or better accuracy with shorter hash codes. Hence, L2H methods have become more and more popular than data-independent methods in real applications.
The L2H methods can be further divided into two categories [Kong and Li2012, Kang et al.2016]: unsupervised methods and supervised methods. Unsupervised methods only utilize the feature (attribute) information of data points without using any supervised (label) information during the training procedure. Representative unsupervised methods include iterative quantization (ITQ) [Gong and Lazebnik2011], isotropic hashing (IsoHash) [Kong and Li2012], discrete graph hashing (DGH) [Liu et al.2014], and scalable graph hashing (SGH) [Jiang and Li2015]. Supervised methods try to utilize supervised (label) information to learn the hash codes. The supervised information can be given in three different forms: point-wise labels, pairwise labels and ranking labels. Representative point-wise label based methods include CCA-ITQ [Gong and Lazebnik2011], supervised discrete hashing (SDH) [Shen et al.2015] and the deep hashing method in [Lin et al.2015]. Representative pairwise label based methods include sequential projection learning for hashing (SPLH) [Wang et al.2010], minimal loss hashing (MLH) [Norouzi and Fleet2011], supervised hashing with kernels (KSH) [Liu et al.2012], two-step hashing (TSH) [Lin et al.2013], fast supervised hashing (FastH) [Lin et al.2014], latent factor hashing (LFH) [Zhang et al.2014]
, convolutional neural network hashing (CNNH)[Xia et al.2014], and column sampling based discrete supervised hashing (COSDISH) [Kang et al.2016]. Representative ranking label based methods include ranking-based supervised hashing (RSH) [Wang et al.2013b], column generation hashing (CGHash) [Li et al.2013], order preserving hashing (OPH) [Wang et al.2013a], ranking preserving hashing (RPH) [Wang et al.2015], and some deep hashing methods [Zhao et al.2015a, Lai et al.2015, Zhang et al.2015]
Although a lot of hashing methods have been proposed as shown above, most existing hashing methods, including some deep hashing methods [Salakhutdinov and Hinton2009, Masci et al.2014, Liong et al.2015], are based on hand-crafted features. In these methods, the hand-crafted feature construction procedure is independent of the hash-code and hash function learning procedure, and then the resulted features might not be optimally compatible with the hashing procedure. Hence, these existing hand-crafted feature based hashing methods might not achieve satisfactory performance in practice. To overcome the shortcoming of existing hand-crafted feature based methods, some feature learning based deep hashing methods [Zhao et al.2015a, Lai et al.2015, Zhang et al.2015] have recently been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels which are a special case of ranking labels.
For another common application scenario with pairwise labels, there have appeared few feature learning based deep hashing methods. To the best of our knowledge, CNNH [Xia et al.2014] is the only one which adopts deep neural network, which is actually a convolutional neural network (CNN) [LeCun et al.1989], to perform feature learning for supervised hashing with pairwise labels. CNNH is a two-stage method. In the first stage, the hash codes are learned from the pairwise labels, and then the second stage tries to learn the hash function and feature representation from image pixels based on the hash codes from the first stage. In CNNH, the learned feature representation in the second stage cannot give feedback for learning better hash codes in the first stage. Hence, CNNH cannot perform simultaneous feature learning and hash-code learning, which still has limitations. This has been verified by the authors of CNNH themselves in another paper [Lai et al.2015].
In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing (DPSH), for applications with pairwise labels. The main contributions of DPSH are outlined as follows:
DPSH is an end-to-end learning framework which contains three key components. The first component is a deep neural network to learn image representation from pixels. The second component is a hash function to map the learned image representation to hash codes. And the third component is a loss function to measure the quality of hash codes guided by the pairwise labels. All the three components are seamlessly integrated into the same deep architecture to map the images from pixels to the pairwise labels in an end-to-end way. Hence, different components can give feedback to each other in DPSH, which results in learning better codes than other methods without end-to-end architecture.
To the best of our knowledge, DPSH is the first method which can perform simultaneous feature learning and hash-code learning for applications with pairwise labels.
Experiments on real datasets show that DPSH can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.
We use boldface lowercase letters like
to denote vectors. Boldface uppercase letters likeare used to denote matrices. The transpose of is denoted as . is used to denote the Euclidean norm of a vector. denotes the element-wise sign function which returns 1 if the element is positive and returns -1 otherwise.
Suppose we have points (images) where is the feature vector of point . can be the hand-crafted features or the raw pixels in image retrieval applications. The specific meaning of can be easily determined from the context. Besides the feature vectors, the training set of supervised hashing with pairwise labels also contains a set of pairwise labels with , where means that and are similar, means that and are dissimilar. Here, the pairwise labels typically refer to semantic labels provided with manual effort.
The goal of supervised hashing with pairwise labels is to learn a binary code for each point , where is the code length. The binary codes should preserve the similarity in . More specifically, if , the binary codes and should have a low Hamming distance. Otherwise if , the binary codes and should have a high Hamming distance. In general, we can write the binary code as , where is the hash function to learn.
Most existing pairwise label based supervised hashing methods, including SPLH [Wang et al.2010], MLH [Norouzi and Fleet2011], KSH [Liu et al.2012], TSH [Lin et al.2013], FastH [Lin et al.2014], and LFH [Zhang et al.2014], adopt hand-crafted features for hash function learning. As stated in Section 1, these methods cannot achieve satisfactory performance because the hand-crafted features might not be optimally compatible with the hash function learning procedure. CNNH [Xia et al.2014] adopts CNN to perform feature learning from raw pixels. However, CNNH is a two-stage method which cannot perform simultaneous feature learning and hash-code learning in an end-to-end way.
In this section, we introduce our model, called deep pairwise-supervised hashing (DPSH), which can perform simultaneous feature learning and hash-code learning in an end-to-end framework.
Figure 1 shows the end-to-end deep learning architecture for our DPSH model, which contains the feature learning part and the objective function part.
Our DPSH model contains a CNN model from [Chatfield et al.2014] as a component. More specifically, the feature learning part has seven layers which are the same as those of CNN-F in [Chatfield et al.2014]. Other CNN architectures, such as the AlexNet [Krizhevsky et al.2012], can also be used to substitute the CNN-F network in our DPSH model. But it is not the focus of this paper to study different networks. Hence, we just use CNN-F for illustrating the effectiveness of our DPSH model, and leave the study of other candidate networks for future pursuit. Please note that there are two CNNs (top CNN and bottom CNN) in Figure 1. These two CNNs have the same structure and share the same weights. That is to say, both the input and loss function are based on pairs of images.
The detailed configuration of the feature learning part of our DPSH model is shown in Table 1
. More specifically, it contains 5 convolutional layers (conv 1-5) and 2 fully-connected layers (full 6-7). Each convolutional layer is described in several aspects: “filter” specifies the number of convolution filters and their receptive field size, denoted as “num x size x size”; “stride” indicates the convolution stride which is the interval at which to apply the filters to the input; “pad” indicates the number of pixels to add to each side of the input; “LRN” indicates whether Local Response Normalization (LRN)[Krizhevsky et al.2012]
is applied; “pool” indicates the downsampling factor. “4096” in the fully-connected layer indicates the dimensionality of the output. The activation function for all layers is the REctification Linear Unit (RELU)[Krizhevsky et al.2012].
|conv1||filter 64x11x11, stride 4x4, pad 0, LRN, pool 2x2|
|conv2||filter 256x5x5, stride 1x1, pad 2, LRN, pool 2x2|
|conv3||filter 256x3x3, stride 1x1, pad 1|
|conv4||filter 256x3x3, stride 1x1, pad 1|
|conv5||filter 256x3x3, stride 1x1, pad 1, pool 2x2|
Given the binary codes for all the points, we can define the likelihood of the pairwise labels as that of LFH [Zhang et al.2014]:
where , and . Please note that .
By taking the negative log-likelihood of the observed pairwise labels in , we can get the following optimization problem:
It is easy to find that the above optimization problem can make the Hamming distance between two similar points as small as possible, and simultaneously make the Hamming distance between two dissimilar points as large as possible. This exactly matches the goal of supervised hashing with pairwise labels.
The problem in (3.1.2) is a discrete optimization problem, which is hard to solve. LFH [Zhang et al.2014] solves it by directly relaxing from discrete to continuous, which might not achieve satisfactory performance [Kang et al.2016].
where , and .
To integrate the above feature learning part and objective function part into an end-to-end framework, we set
where denotes all the parameters of the seven layers in the feature learning part, denotes the output of the full7 layer associated with point , denotes a weight matrix,
is a bias vector. It means that we connect the feature learning part and the objective function part into the same framework by a fully-connected layer, with the weight matrixand bias vector . After connecting the two parts, the problem for learning becomes:
As a result, we get an end-to-end deep hashing model, called DPSH, to perform simultaneous feature learning and hash-code learning in the same framework.
In the DPSH model, the parameters for learning contain , , and . We adopt a minibatch-based strategy for learning. More specifically, in each iteration we sample a minibatch of points from the whole training set, and then perform learning based on these sampled points.
We design an alternating method for learning. That is to say, we optimize one parameter with other parameters fixed.
The can be directly optimized as follows:
For the other parameters , and , we use back-propagation (BP) for learning. In particular, we can compute the derivatives of the loss function with respect to as follows:
Then, we can update the parameters , and by utilizing back propagation:
The whole learning algorithm of DPSH is briefly summarized in Algorithm 1.
After we have completed the learning procedure, we can only get the hash codes for points in the training data. We still need to perform out-of-sample extension to predict the hash codes for the points which are not appeared in the training set.
The deep hashing framework of DPSH can be naturally applied for out-of-sample extension. For any point , we can predict its hash code just by forward propagation:
All our experiments for DPSH are completed with MatConvNet [Vedaldi and Lenc2015] on a NVIDIA K80 GPU server. Our model can be trained at the speed of about 290 images per second with a single K80 GPU.
We compare our model with several baselines on two widely used benchmark datasets: CIFAR-10 and NUS-WIDE.
The CIFAR-10 [Krizhevsky2009] dataset consists of 60,000 3232 color images which are categorized into 10 classes (6000 images per class). It is a single-label dataset in which each image belongs to one of the ten classes.
The NUS-WIDE dataset [Chua et al.2009, Zhao et al.2015b] has nearly 270,000 images collected from the web. It is a multi-label dataset in which each image is annotated with one or mutiple class labels from 81 classes. Following [Lai et al.2015], we only use the images associated with the 21 most frequent classes. For these classes, the number of images of each class is at least 5000.
We compare our method with several state-of-the-art hashing methods. These methods can be categorized into five classes:
The above unsupervised methods and supervised methods with deep features extracted by the CNN-F of the feature learning part in our DPSH.
Deep hashing methods with pairwise labels, including CNNH [Xia et al.2014].
Deep hashing methods with triplet labels, including network in network hashing (NINH) [Lai et al.2015], deep semantic ranking based hashing (DSRH) [Zhao et al.2015a], deep similarity comparison hashing (DSCH) [Zhang et al.2015] and deep regularized similarity comparison hashing (DRSCH) [Zhang et al.2015].
For hashing methods which use hand-crafted features, we represent each image in CIFAR-10 by a 512-dimensional GIST vector. And we represent each image in NUS-WIDE by a 1134-dimensional low level feature vector, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments and 500-D SIFT features.
For deep hashing methods, we first resize all images to be 224
224 pixels and then directly use the raw image pixels as input. We adopt the CNN-F network which has been pre-trained on the ImageNet dataset[Russakovsky et al.2014] to initialize the first seven layers of our DPSH framework. Similar initialization strategy has also been adopted by other deep hashing methods [Zhao et al.2015a].
As most existing hashing methods, the mean average precision (MAP) is used to measure the accuracy of our proposed method and other baselines. The hyper-parameter in DPSH is chosen by a validation set, which is for CIFAR-10 and for NUS-WIDE unless otherwise stated.
Following [Xia et al.2014, Lai et al.2015], we randomly select 1000 images (100 images per class) as the query set in CIFAR-10. For the unsupervised methods, we use the rest images as the training set. For the supervised methods, we randomly select 5000 images (500 images per class) from the rest images as the training set. The pairwise label set is constructed based on the image class labels. That is to say, two images will be considered to be similar if they share the same class label.
In NUS-WIDE, we randomly sample 2100 query images from 21 most frequent labels (100 images per class) by following the strategy in [Xia et al.2014, Lai et al.2015]. For supervised methods, we randomly select 500 images per class from the rest images as the training set. The pairwise label set is constructed based on the image class labels. That is to say, two images will be considered to be similar if they share at least one common label. For NUS-WIDE, we calculate the MAP values within the top 5000 returned neighbors.
The MAP results are reported in Table 4, where DPSH, DPSH0, NINH and CNNH are deep methods, and all the other methods are non-deep methods with hand-crafted features. The result of NINH, CNNH, KSH and ITQ are from [Xia et al.2014, Lai et al.2015]
. Please note that the above experimental setting and evaluation metric is exactly the same as that in[Xia et al.2014, Lai et al.2015]. Hence, the comparison is reasonable. We can find that our method DPSH dramatically outperform other baselines111The accuracy of LFH in Table 4 is much lower than that in [Zhang et al.2014, Kang et al.2016] because less points are adopted for training in this paper. Please note that LFH is an efficient method which can be used for training large-scale supervised hashing problems. But the training efficiency is not the focus of this paper., including unsupervised methods, supervised methods with hand-crafted features, and deep hashing methods with feature learning.
Both DPSH and CNNH are deep hashing methods with pairwise labels. By comparing DPSH to CNNH, we can find that the model (DPSH) with simultaneous feature learning and hash-code learning can outperform the other model (CNNH) without simultaneous feature learning and hash-code learning.
NINH is a triplet label based method. Although NINH can perform simultaneous feature learning and hash-code learning, it is still outperformed by DPSH. More comparison with triplet label based methods will be provided in Section 4.4.
To further verify the importance of simultaneous feature learning and hash-code learning, we design a variant of DPSH, called DPSH0, which does not update the parameter of the first seven layers (CNN-F layers) during learning. Hence, DPSH0 just uses the CNN-F for feature extraction, and then based on the extracted features to learn hash functions. The hash function learning procedure will give no feedback to the feature extraction procedure. By comparing DPSH to DPSH0, we find that DPSH can dramatically outperform DPSH0. It means that integrating feature learning and hash-code learning into the same framework in an end-to-end way can get a better solution than that without end-to-end learning.
|Method||CIFAR-10 (MAP)||NUS-WIDE (MAP)|
|Method||CIFAR-10 (MAP)||NUSWIDE (MAP)|
|FastH + CNN||0.553||0.607||0.619||0.636||0.779||0.807||0.816||0.825|
|SDH + CNN||0.478||0.557||0.584||0.592||0.780||0.804||0.815||0.824|
|KSH + CNN||0.488||0.539||0.548||0.563||0.768||0.786||0.790||0.799|
|LFH + CNN||0.208||0.242||0.266||0.339||0.695||0.734||0.739||0.759|
|SPLH + CNN||0.299||0.330||0.335||0.330||0.753||0.775||0.783||0.786|
|ITQ + CNN||0.237||0.246||0.255||0.261||0.719||0.739||0.747||0.756|
|SH + CNN||0.183||0.164||0.161||0.161||0.621||0.616||0.615||0.612|
|Method||CIFAR-10 (MAP)||NUS-WIDE (MAP)|
To further verify the effectiveness of simultaneous feature learning and hash-code learning, we compare DPSH to other non-deep methods with deep features extracted by the CNN-F pre-trained on ImageNet. The results are reported in Table 4, where “FastH+CNN” denotes the FastH method with deep features and other methods have similar notations. We can find that our DPSH outperforms all the other non-deep baselines with deep features.
Most existing deep supervised hashing methods are based on ranking labels, especially triplet labels. Although the learning procedure of these methods is based on ranking labels, the learned model can also be used for evaluation scenario with pairwise labels. In fact, most triplet label based methods adopt pairwise labels as ground truth for evaluation [Lai et al.2015, Zhang et al.2015]. In Section 4.2, we have shown that our DPSH can outperform NINH. In this subsection, we will perform further comparison to other deep hashing methods with ranking labels (triplet labels). These methods include DSRH [Zhao et al.2015a], DSCH [Zhang et al.2015] and DRSCH [Zhang et al.2015].
The experimental setting in DSCH and DRSCH [Zhang et al.2015] is different from that in Section 4.2. To perform fair comparison, we adopt the same setting as that in [Zhang et al.2015] for evaluation. More specifically, in CIFAR-10 dataset, we randomly sample 10,000 query images (1000 images per class) and use the rest as the training set. In the NUS-WIDE dataset, we randomly sample 2100 query images from 21 most frequently happened semantic labels (100 images per class), and use the rest as training samples. For NUS-WIDE, the MAP values within the top 50,000 returned neighbors are used for evaluation.
The experimental results are shown in Table 4. Please note that the results of DPSH in Table 4 are different from those in Table 4, because the experimental settings are different. The results of DSRH, DSCH and DRSCH are directly from [Zhang et al.2015]. From Table 4, we can find that DPSH with pairwise labels can also dramatically outperform other baselines with triplet labels. Please note that DSRH, DSCH and DRSCH can also perform simultaneously feature learning and hash-code learning in an end-to-end framework.
Figure 2 shows the effect of the hyper-parameter . We can find that DPSH is not sensitive to in a large range. For example, DPSH can achieve good performance on both datasets with .
In this paper, we have proposed a novel deep hashing methods, called DPSH, for settings with pairwise labels. To the best of our knowledge, DPSH is the first method which can perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Because different components in DPSH can give feedback to each other, DPSH can learn better codes than other methods without end-to-end architecture. Experiments on real datasets show that DPSH can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.
This work is supported by the NSFC (61472182), the Fundamental Research Funds for the Central Universities (20620140510), and the Tencent Fund (2014320001013613).