DPH
Code release for "Deep Priority Hashing" (ACMMM 2018)
view repo
Deep hashing enables image retrieval by endtoend learning of deep representations and hash codes from training data with pairwise similarity information. Subject to the distribution skewness underlying the similarity information, most existing deep hashing methods may underperform for imbalanced data due to misspecified loss functions. This paper presents Deep Priority Hashing (DPH), an endtoend architecture that generates compact and balanced hash codes in a Bayesian learning framework. The main idea is to reshape the standard crossentropy loss for similaritypreserving learning such that it downweighs the loss associated to highlyconfident pairs. This idea leads to a novel priority crossentropy loss, which prioritizes the training on uncertain pairs over confident pairs. Also, we propose another priority quantization loss, which prioritizes hardtoquantize examples for generation of nearly lossless hash codes. Extensive experiments demonstrate that DPH can generate highquality hash codes and yield stateoftheart image retrieval results on three datasets, ImageNet, NUSWIDE, and MSCOCO.
READ FULL TEXT VIEW PDF
Deep hashing establishes efficient and effective image retrieval by
end...
read it
Deep hashing methods have been shown to be the most efficient approximat...
read it
Learning compact binary codes for image retrieval problem using deep neu...
read it
Learning binary representation is essential to largescale computer visi...
read it
Hashing aims at generating highly compact similarity preserving code wor...
read it
Recently, deepnetworksbased hashing (deep hashing) has become a leadin...
read it
Hashing has been a widelyadopted technique for nearest neighbor search ...
read it
Code release for "Deep Priority Hashing" (ACMMM 2018)
Multimedia data has been ubiquitous in search engines and online communities, while its efficient retrieval is important to enhance user experience. The major challenges in multimedia retrieval reside in the largescale and highdimension of multimedia data. To enable accurate retrieval under efficient computation, approximate nearest neighbors (ANN) search has attracted increasing attention. Parallel to the traditional indexing methods (Lew et al., 2006) for candidates pruning, another advantageous solution is hashing methods (Wang et al., 2018) for data compression, which transform highdimensional media data into compact binary codes while similar binary codes are generated for similar data items. In this paper, we focus on the learning to hash methods (Wang et al., 2018), which build datadependent hash encoding schemes for efficient image retrieval. These methods can capture the underlying data distributions to achieve better performance than traditional dataindependent hashing methods, e.g. LocalitySensitive Hashing (LSH) (Gionis et al., 1999).
A fruitful line of learning to hash methods have been designed to enable efficient ANN search, where the efficiency comes from the compact binary codes that are orders of magnitude smaller than the original highdimensional feature descriptors. Ranking these binary codes in response to each query entails only a few computations of the Hamming distance between the query and each database item. Hash lookup further reduces the search to constant time by early pruning of irrelevant candidates falling out of a small Hamming ball. The literature can be divided into supervised and unsupervised paradigms (Kulis and Darrell, 2009; Gong and Lazebnik, 2011; Norouzi and Blei, 2011; Fleet et al., 2012; Liu et al., 2012; Wang et al., 2012; Liu et al., 2013, 2014; Zhang et al., 2014)
. Recently, deep learning to hash methods
(Xia et al., 2014; Lai et al., 2015; Shen et al., 2015; Erin Liong et al., 2015; Zhu et al., 2016; Li et al., 2016; Liu et al., 2016; Cao et al., 2017) have shown that deep neural networks can be used as nonlinear hash functions to enable endtoend learning of deep representations and hash codes. These deep hashing methods have shown stateoftheart results. In particular, it proves crucial to jointly learn similaritypreserving representations and control quantization error of converting continuous representations to binary codes (Zhu et al., 2016; Li et al., 2016; Liu et al., 2016; Cao et al., 2017).Most of the existing methods are tailored to image retrieval scenarios with balanced or nearly balanced data. In other words, they weigh equally each data pair, no matter they are similar pairs or dissimilar pairs. Thus, they can maximize retrieval performance on average perinstance accuracy. However, due to the wellknown longtail law, multimedia data with skewed distribution are prevalent in many online image search systems. The data skewness may either stem from the imbalanced numbers of similar and dissimilar pairs associated with each query, or from the diversity of popular and rare classes, or even from the variations in easy and difficult pairs of images. Such data skewness will severely affect the retrieval performance, especially when one needs to trade off the precision (weigh dissimilar pairs more in order to discard irrelevant results) from recall (weigh similar pairs more in order to include potentially relevant results). Therefore, how to address various data skewness problems simultaneously remains an open problem.
This work presents Deep Priority Hashing (DPH), a novel deep hashing model that generates compact binary codes to enable effective and efficient image retrieval under data skewness problems. DPH is formalized as a Bayesian learning framework, providing two novel loss functions motivated by the success of the focal loss in object detection problem (Lin et al., 2017). One is a priority crossentropy loss for similaritypreserving learning, which prioritizes difficult image pairs over easy image pairs to learn prioritized deep representations. The other is a priority quantization loss, which prioritizes hardtoquantize examples for generating nearly lossless hash codes. Both loss functions are wellspecified to similarity retrieval of highly skew image data. The proposed DPH model is an endtoend architecture that can be trained by standard backpropagation. Extensive experiments demonstrate that DPH can generate highquality hash codes and yield stateoftheart image retrieval performance on three benchmark datasets, ImageNet, NUSWIDE, and MSCOCO.
Learning to hash has become an important research direction in multimedia retrieval, which trades off efficacy from efficiency. Wang et al. (Wang et al., 2018) has provided a comprehensive literature survey that covers most of important methods and latest advances.
Existing hashing methods can be divided into unsupervised hashing and supervised hashing. Unsupervised hashing methods learn hash functions that encode data points to binary codes by training solely from unlabeled data. Typical learning criteria include reconstruction error minimization (Salakhutdinov and Hinton, 2007; Gong and Lazebnik, 2011; Jegou et al., 2011) and graph structure preservation (Weiss et al., 2009; Liu et al., 2011). While unsupervised methods are more general and can be trained without semantic labels or relevance information, they are subject to the semantic gap dilemma (Smeulders et al., 2000) that highlevel semantic description of an object differs from lowlevel feature descriptors. Supervised methods can incorporate semantic labels or relevance information to mitigate the semantic gap and improve the hashing quality. Typical supervised methods include Binary Reconstruction Embedding (BRE) (Kulis and Darrell, 2009), Minimal Loss Hashing (MLH) (Norouzi and Blei, 2011), Hamming Distance Metric Learning (Norouzi et al., 2012), and Supervised Hashing with Kernels (KSH) (Liu et al., 2012), which generate hash codes by minimizing the Hamming distances across similar pairs and maximizing the Hamming distances across dissimilar pairs.
As deep convolutional networks (Krizhevsky et al., 2012; He et al., 2016)
yield sharp performance on many computer vision tasks, deep learning to hash has attracted attention recently. CNNH
(Xia et al., 2014) adopts a twostage strategy in which the first stage learns hash codes and the second stage learns a deep network to map input images to the hash codes. DNNH (Lai et al., 2015) improved the twostage CNNH with a simultaneous feature learning and hash coding pipeline such that representations and hash codes can be optimized in a joint optimization process. DHN (Zhu et al., 2016) further improves DNNH by a crossentropy loss and a quantization loss which preserve the pairwise similarity and control the quantization error simultaneously. DHN obtains stateoftheart performance on several benchmarks. DPSH (Li et al., 2016) and DSH (Liu et al., 2016) follow similar framework as DHN, thus yielding similar retrieval performance. HashNet (Cao et al., 2017) improves DHN by balancing the positive and negative pairs in training data to trade off precision vs. recall, and by the continuation technique to yield exactly binary codes with the lowest quantization error. HashNet obtains stateoftheart performance on several benchmark datasets.However, existing deep hashing methods do not consider the data skewness problem. In other words, they let all the training pairs contribute equally to the loss function, where easy pairs will overwhelm the loss function so that the difficult ones cannot be trained sufficiently. To address these problems, we propose a novel Deep Priority Hashing (DPH) model. We design a novel priority crossentropy loss to concentrate on difficult pairs more than on easy pairs. We design another priority quantization loss to concentrate on hardtoquantize examples for generating less lossy hash codes. This work is among the earliest endeavors on deep hashing with the prioritization towards different data skewness scenarios.
The Focal Loss was introduced by He et al. (Lin et al., 2017), which yields stateoftheart performance for object detection. It is designed to address the problem of an extreme imbalance between examples of different classes (e.g. foreground and background classes) during training. The Focal Loss has a close connection to the crossentropy loss for binary classification. The CrossEntropy (CE) loss is defined as
(1) 
In the above specifies the groundtruth class and
is the model’s estimated probability for the class of label
. For notational convenience, He et al. (Lin et al., 2017) defined as(2) 
and rewrote .
One notable property of the crossentropy loss is that even for easily classified examples (
) there incurs a loss with nontrivial magnitude. When summed over a large number of easy examples, these small loss values can accumulate to overwhelm the rare class with difficult examples. Towards this problem, He et al. (Lin et al., 2017) proposes to reshape the crossentropy loss into Focal Loss, which downweighs easy examples and focuses on difficult examples.The Focal Loss (FL) adds a modulating factor to the crossentropy loss with a tunable focusing parameter :
(3) 
Focal Loss has two nice properties. 1) When an example is misclassified and is small, the modulating factor is near and the loss is unaffected. As , the factor goes to and the loss for wellclassified examples is downweighed. 2) The focusing parameter smoothly adjusts the rate at which easy examples are downweighed. When , FL is equivalent to CE, and as increases the effect of the modulating factor is likewise increase.
Intuitively, the modulating factor reduces the contribution from easy examples to the loss and enlarges the loss gap between easy and difficult examples. This in turn increases the force to correct misclassified examples.
In practice, we prefer an balanced variant of the focal loss:
(4) 
This variant of the focal loss can further address the class imbalance problem between large classes and rare classes, which is a typical option for practical applications.
A limitation of the focal loss is that the modulating factor is strictly determined by classification uncertainty . In many real scenarios, we may need to consider other measures to quantify the difficulty of image pairs. The flexibility in choosing different modulating factors for prioritization is the motivation of our work.
In similarity retrieval, we are given a training set of points , each represented by a
dimensional feature vector
. Some pairs of points and are provided with similarity labels , where if and are similar while if and are dissimilar. Deep hashing learns a nonlinear hash function from input space to Hamming space with deep network. It encodes each point into bit hash code such that the similarity in the training pairs can be preserved in the Hamming space. The similarity information can be collected from semantic labels or relevance feedback in online search systems.This paper presents Deep Priority Hashing (DPH), an endtoend architecture to enable efficient image retrieval, as shown in Figure 1. The proposed deep architecture accepts pairwise input images and processes them through a pipeline of deep representation learning and binary hash coding: 1) a convolutional network (CNN) for learning deep representation of each image , 2) a fullyconnected hash layer (fch) for transforming the deep representation into bit hash code
, 3) a priority crossentropy loss that prioritizes difficult pairs over easy pairs for similaritypreserving learning, and 4) a priority quantization loss that prioritizes hardtoquantize images for controlling the binarization error due to continuous relaxation in the optimization.
Figure 1 illustrates the architecture of the proposed Deep Priority Hashing. We extend from AlexNet (Krizhevsky et al., 2012)
, a deep convolutional neural network (CNN) with five convolutional layers
conv1–conv5 and three fullyconnected layers fc6–fc8. We replace the classifier layer fc8 with a new hash layer fch of hidden units, which transforms the representation of the fc7 layer into dimensional continuous code . We can obtain hash code through the sign thresholding . However, we adopt the hyperbolic tangent (tanh) function to squash the continuous code to be within instead of using sign function. To further guarantee the quality of hash codes for efficient image retrieval, we preserve the similarity in training pairs by designing a priority crossentropy loss and control the quantization error by designing another priority quantization loss. Both loss functions can be derived in the Maximum a Posteriori (MAP) estimation framework. Though following most work to use AlexNet in Figure 1, we can easily replace the backbone network in our architecture with any classification network since we only replace the last classifier layer while the other layers can be inherited from the backbone network.This paper enables deep hashing from skew data with both easy and difficult examples by a Bayesian learning framework. The framework jointly preserves similarity information of pairwise images and controls the quantization error of continuous relaxation. For a pair of hash codes and , their Hamming distance and their inner product satisfy , indicating that the Hamming distance and inner product can be used interchangeably for binary codes. Thus we adopt inner product to quantify the pairwise similarity. Given training image pairs with pairwise similarity labels as , the logarithm Weighted Maximum a Posteriori (WMAP) estimation of the hash codes for training images can be defined as
(5)  
where is the weighted likelihood function for pairwise data, and is the weight for each training pair . This is extended from the weighted maximum likelihood on pointwise data (Dmochowski et al., 2010). Another difference from (Dmochowski et al., 2010) is the weighted prior , and is the weight for an image . In this paper, we propose the above Weighted Maximum a Posteriori (WMAP) estimation over pairwise data with different skewness scenarios. This is a general framework for instantiating specific learning to hash models, by choosing wellspecified probability functions and weighting schemes.
We first derive the priority crossentropy loss for similaritypreserving learning. For each image pair with label , is the conditional probability of the similarity label given the pair of corresponding hash codes and , which is defined by the logistic function:
(6)  
where
is an adaptive variant of the sigmoid function with parameter
to control its bandwidth. As the sigmoid function with larger has larger saturation zone, we usually require to perform backpropagation with more gradients.Motivated by the Focal Loss (FL) (Lin et al., 2017), we use the weight combined by scaling and modulating factor to simultaneously model the class diversity (including imbalance in both similar and dissimilar pairs) and the variation in easy and difficult examples. But unlike the focal loss, in the modulating factor we use a different measure instead of the classification uncertainty to quantify the difficulty of each image pair. This new design relaxes the restriction in the focal loss that the “difficulty” of each image pair and the classification uncertainty should be consistent. In this way, we can adopt more flexible choices for the modulating factor. Furthermore, in deep hashing we do not have images with pointwise class labels, and thus need to model image pairs with pairwise labels . We propose a novel priority weighting scheme as follows,
(7) 
where is the weight for each training pair . It consists of a scaling part to weigh for class imbalance problem, and a modulating part to weigh for easy and hard examples. Specifically, the measure of difficulty for the modulating part, i.e. is defined as
(8)  
where indicates how difficult an image pair is classified as similar when , or classified as dissimilar when . With these difficulty quantities , we can put different weights on easy and hard examples to prioritize on more difficult image pairs. An important motivation of using as the measure of “difficulty” is that is magnitudeinvariant, consistent with the fact that hash codes have the same magnitude for different images. Therefore, can potentially eliminate the variations in code magnitudes due to skew data distributions.
In the original focal loss, the scaling part tackles the imbalance between different classes in a pointwise scenario, which is not applicable to pairwise scenario. In HashNet (Cao et al., 2017), the number of similar pairs and dissimilar pairs are used to compute the weight for data imbalance. However, such a strategy cannot quantify the imbalance of underlying different classes, since the class information is unavailable in pairwise scenarios. Quantification of the scaling part for the pairwise scenario remains a nontrivial problem unsolved by previous work. In this paper, we define the scaling part by considering the degrees of each vertex in the similarity graph as
(9) 
where indicates the pairwise weight characterizing the class imbalance, which is influenced by both images in that pair. is the set of pairs containing . It is divided into two subsets: is the subset of similar pairs that contain , while is the subset of dissimilar pairs that contain . , and are similarly defined. Intuitively, for each image , and respectively calculate the numbers of its similar and dissimilar images, which naturally captures the data skewness due to the class imbalance.
By taking the priority weight in Equation (7) into the WMAP estimation in Equation (5) and letting , we can obtain a novel (pairwise) Priority CrossEntropy Loss as follows:
(10) 
The priority crossentropy loss is a natural extension of the focal loss to the pairwise classification scenario, which inherits all nice properties of the focal loss and broadens the choices of the modulating factor to incorporate more reasonable measures of “difficulty”. More specifically, the priority crossentropy loss will downweigh confident pairs and prioritize on difficult pairs with low confidence. And with the proposed scaling strategy in Equation (9), the priority crossentropy loss assigns larger weights to rare classes and smaller weights to popular classes, which for the first time, addresses the class imbalance issue in pairwise scenario for image retrieval.
To control the quantization error of the binarization operation, we further derive the priority quantization loss from the MAP framework to yield nearly lossless hash codes. As discrete optimization of Equation (5) with binary constraints is very challenging, continuous relaxation widely adopted by existing hashing methods (Wang et al., 2018) to apply to the binary constraints for ease of optimization. However, the continuous relaxation will give rise to two important technical issues: 1) uncontrollable quantization error caused by binarizing continuous codes to binary codes and 2) large approximation error by adopting inner product between continuous codes as the surrogate of Hamming distance between binary codes. To control the quantization error and close the gap between Hamming distance and its surrogate, in this paper, we propose an (unnormalized) bimodal Laplacian prior for the continuous codes , which is defined as:
(11) 
where is the diversity parameter, and let . We can validate that the prior puts the largest density on the discrete values , which enforces that the learned Hamming embeddings should be assigned to with the largest probability.
Similar to the priority crossentropy loss, we define the priority weight for the priority quantization loss as
(12) 
which consists of a scaling part and a modulating part. Note that we need to control the quantization error of all images equally, regardless of whether they are from the rare classes. Thus we set constant scaling . The modulating part controls the quantization error under the variations in easy and hard examples, with defined as
(13) 
which indicates how likely a continuous code can be perfectly quantized into binary code . With probabilities , we can put different weights on easytoquantize and hardtoquantize examples and prioritize hardtoquantize ones.
By taking the priority weight in Equation (12) into the WMAP in Equation (5), we obtain a novel Priority Quantization Loss as:
(14) 
The priority quantization loss helps generating nearly lossless hash codes by reducing the quantization error of hardtoquantized examples more than of easytoquantize examples.
This paper establishes deep learning to hash for skew data with pairwise similarity information, which constitutes two key components: Priority CrossEntropy Loss for similaritypreserving learning and Priority Quantization Loss for generating nearly lossless hash codes. The overall optimization problem is an integration of the Priority CrossEntropy Loss in Equations (10) and the Priority Quantization Loss in Equation (14):
(15) 
where is the network parameters efficiently optimized using standard backpropagation with automatic differentiation techniques. Note that, the scale parameter in Equation (11) introduces a tradeoff hyperparameter between the two losses.
Based on the WMAP estimation in Equation (15), we can learn compact hash codes by jointly preserving the pairwise similarity and controlling the quantization error. Finally, we can obtain bit binary codes by simple sign thresholding , where is the sign function on vectors that for , if , otherwise . It is worth noting that, since we have minimized the quantization error in (15) during training, this final binarization step will incur very small loss of retrieval quality.
We conduct extensive experiments to evaluate DPH with several stateoftheart hashing methods on three benchmark datasets. Codes and datasets are available at https://github.com/thuml/DPH.
ImageNet is a benchmark dataset for Large Scale Visual Recognition Challenge (ILSVRC 2015) (Russakovsky et al., 2015). It contains over 1.2M images in the training set and 50K images in the validation set, where each image is singlelabeled by one of the 1,000 categories. We use the sample of 100 categories organized by HashNet (Cao et al., 2017). We use the same database set and query set but resample the training set to make it dataskew. Our training set contains 10,000 images, which consists of three groups: the first group with big class (1,300 images/class), the second group with middle classes (400 images/class) and the third group with small classes (50 images/class). The classes of each group are randomly sampled.
NUSWIDE^{1}^{1}1http://lms.comp.nus.edu.sg/research/NUSWIDE.htm (Chua et al., 2009) is an image dataset containing 269,648 images from Flickr.com. Each image is annotated by some of the 81 ground truth concepts (categories). We follow a slightly different evaluation protocol as HashNet (Cao et al., 2017). We randomly sample 50 images per class as query images, with the remaining images used as the database. We further sample 10,000 images randomly from the database as training images.
MSCOCO^{2}^{2}2http://mscoco.org (Lin et al., 2014) is an image recognition, segmentation, and captioning dataset. The current release contains 82,783 training images and 40,504 validation images, where each image is labeled by some of the 80 categories. We obtain the 12,2218 images set by combining the training and validation images. As described above in the NUSWIDE dataset, we randomly sample 50 images per class as queries. We use the rest images as the database and randomly sample 10,000 images from the database as training images.
Following standard evaluation protocol as previous work (Xia et al., 2014; Lai et al., 2015; Zhu et al., 2016; Cao et al., 2017), the similarity information for hash function learning and for groundtruth evaluation is constructed from image labels: if two images and share at least one label, they are similar and ; otherwise, they are dissimilar and . Note that, although we use the image labels to construct the similarity information, our proposed DPH model can learn hash codes when only the similarity information is available. These datasets exhibit data skewness problems and thus can be used to evaluate different hashing methods in the data skewness scenario.
We compare DPH in terms of the retrieval performance against eleven classical or stateoftheart hashing methods, including unsupervised hashing methods LSH (Gionis et al., 1999), SH (Weiss et al., 2009), ITQ (Gong and Lazebnik, 2011), supervised shallow hashing methods KSH (Liu et al., 2012), SDH (Shen et al., 2015), and supervised deep hashing methods CNNH (Xia et al., 2014), DNNH (Lai et al., 2015), DPSH (Li et al., 2016), DSH (Liu et al., 2016), DHN (Zhu et al., 2016), HashNet (Cao et al., 2017).


Method  ImageNet  NUSWIDE  MSCOCO  
16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  
DPH  0.3252  0.4803  0.5279  0.5492  0.6689  0.7014  0.7218  0.7312  0.7100  0.7435  0.7544  0.7614 
HashNet (Cao et al., 2017)  0.2959  0.4211  0.4836  0.5056  0.6294  0.6604  0.6785  0.6883  0.6529  0.7025  0.7231  0.7335 
DHN (Zhu et al., 2016)  0.2734  0.3821  0.4352  0.4921  0.5940  0.6186  0.6277  0.6343  0.6206  0.6445  0.6641  0.6685 
DSH (Liu et al., 2016)  0.2665  0.3513  0.3963  0.4213  0.5880  0.6102  0.6155  0.6242  0.6242  0.6369  0.6481  0.6511 
DPSH (Li et al., 2016)  0.1625  0.3037  0.4074  0.4841  0.4041  0.4826  0.5067  0.5379  0.5660  0.6269  0.6587  0.6906 
DNNH (Lai et al., 2015)  0.1189  0.2783  0.3328  0.3456  0.3845  0.4632  0.4901  0.5158  0.5461  0.5923  0.6147  0.6272 
CNNH (Xia et al., 2014)  0.0890  0.2459  0.2903  0.3156  0.3653  0.4432  0.4672  0.4801  0.5659  0.5624  0.5519  0.5687 
SDH (Shen et al., 2015)  0.1023  0.1543  0.1785  0.2043  0.3021  0.4056  0.4329  0.4675  0.4933  0.5323  0.5493  0.5577 
KSH (Liu et al., 2012)  0.0823  0.1121  0.1341  0.1456  0.2007  0.2704  0.3002  0.3257  0.4946  0.5193  0.5224  0.5316 
ITQ (Gong and Lazebnik, 2011)  0.2546  0.3313  0.3564  0.4023  0.3309  0.4179  0.4461  0.4774  0.5289  0.5824  0.6227  0.6320 
SH (Weiss et al., 2009)  0.0917  0.1234  0.1524  0.1623  0.1649  0.1904  0.2115  0.2605  0.4429  0.4910  0.4782  0.5073 
LSH (Gionis et al., 1999)  0.0412  0.0623  0.0889  0.1021  0.0557  0.0938  0.1392  0.1724  0.3932  0.4659  0.5301  0.5171 

The experimental results of DPH and comparison methods on the ImageNet dataset under three evaluation metrics.
We evaluate the retrieval quality on four standard metrics: Mean Average Precision (MAP), PrecisionRecall curves (PR), Precision curves within Hamming distance 2 (P@H2), and Precision curves with respect to different numbers of top returned samples (P@N). For fair comparison, all methods use identical training and test sets. We adopt MAP@1000 for ImageNet and MAP@5000 for the other datasets as in (Cao et al., 2017). It is worth noting that, in order to evaluate the retrieval performance equally on large and small classes, we let each class appears nearly equally in the query set for each dataset. This is a reasonable setting under data skewness.
For shallow hashing methods, we use DeCAF features (Donahue et al., 2014). For deep hashing methods, we use raw images as input. We use AlexNet (Krizhevsky et al., 2012) for all deep hashing methods, and implement DPH in Caffe (Jia et al., 2014). We finetune convolutional layers and fullyconnected layers conv1–fc7 copied from AlexNet pretrained on ImageNet 2012 and train the hash layer fch by backpropagation. As the fch
layer is trained from scratch, we set its learning rate to be 10 times that of the lower layers. We use minibatch stochastic gradient descent (SGD) with 0.9 momentum and the learning rate annealing strategy implemented in Caffe, and crossvalidate the learning rate from
to with a multiplicative stepsize . We fix the minibatch size of images as and the weight decay parameter as . We select the hyperparameters of all methods through threefold crossvaluation.


Method  ImageNet  
16 bits  32 bits  48 bits  64 bits  
DPH  0.5168  0.6409  0.6723  0.6967 
HashNet (Cao et al., 2017)  0.5059  0.6306  0.6633  0.6835 



Method  ImageNet  NUSWIDE  MSCOCO  
16 bits  64 bits  16 bits  64 bits  16 bits  64 bits  
DPH  0.3455  0.6395  0.7107  0.7439  0.7252  0.7958 
HashNet (Cao et al., 2017)  0.3252  0.5891  0.6853  0.7163  0.6835  0.7581 

Table 1 shows the MAP
results. DPH outperforms all comparison methods substantially. Compared to ITQ, the best shallow hashing method using deep features, we achieve absolute boosts of
, , and in average MAP for different bits on ImageNet, NUSWIDE, and MSCOCO, respectively. Compared to HashNet, the stateoftheart deep hashing method, we achieve absolute boosts of , , in average MAP for different bits on the three datasets, respectively.HashNet uses Weighted Maximum Likelihood to tackle the data imbalance problem but it only uses the weight between positive and negative pairs, which cannot capture the the more finegrained class imbalance issue. This class imbalance issue substantially deteriorates the average performance on all the classes since previous hashing methods will unavoidably focus on large classes and underperform on small classes. Thus, with each class appears nearly equally in the query set, the results on each dataset are worse than those reported by the original paper. However, our DPH puts larger weights on the samples of small classes to focus more on difficult examples (usually samples of small classes). Thus, DPH performs relatively well on all the classes and yields the best performance.
The performance in terms of Precision within Hamming radius 2 (P@H=2) is important for efficient retrieval with hash codes since such Hamming ranking only requires time for each query. As shown in Figures 2(a), 3(a) and 4(a), DPH achieves the highest P@H=2 results on all three datasets. In particular, P@H=2 of DPH with 32 bits is better than that of HashNet with any bits. This validates that DPH can learn more compact hash codes than HashNet. When using longer codes, the Hamming space will become sparse and few data points fall within the Hamming ball with radius 2 (Fleet et al., 2012). This is why most hashing methods achieve best accuracy with relatively shorter code lengths.
The retrieval performance on the three datasets in terms of PrecisionRecall curves (PR) and Precision curves with respect to different numbers of top returned samples (P@N) are shown in Figures 2(b)4(b) and Figures 2(c)4(c), respectively. DPH outperforms comparison methods by large margins. In particular, DPH achieves much higher precision at lower recall levels or when the number of top results is small. This is desirable for precisionfirst retrieval, which is widely implemented in practical systems.
To enable direct comparison with published papers, we also test our method on the Balanced ImageNet dataset as in HashNet (Cao et al., 2017). As shown in the left part of Table 2, our DPH outperforms HashNet even on the balanced ImageNet dataset, which proves that our method can perform well even on balanced dataset.
As the network structure can influence the performance of generalpurpose computer vision models, we test our model on the three datasets with VGG16 Net (Simonyan and Zisserman, 2015). Observing from the right part of Table 2, our DPH still outperforms HashNet, which demonstrates that our method is robust to the base network architecture.


Method  ImageNet  NUSWIDE  MSCOCO  
16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  
DPH  0.3252  0.4803  0.5279  0.5592  0.6689  0.7014  0.7218  0.7312  0.7100  0.7435  0.7544  0.7614 
DPHF  0.3115  0.4683  0.5139  0.5487  0.6451  0.6798  0.6918  0.7012  0.7100  0.7335  0.7434  0.7555 
DPHW  0.1919  0.3645  0.4473  0.4767  0.6210  0.6535  0.6652  0.6703  0.6826  0.7180  0.7314  0.7356 
DPHQ  0.3102  0.4656  0.5162  0.5431  0.6132  0.6557  0.6723  0.6898  0.6978  0.7329  0.7475  0.7555 

We dive deeper into the efficacy of our DPH model. We study three variants of DPH: 1) DPHF, replace the modulating factor to that of the focal loss, i.e. using instead of in the modulating factor; 2) DPHW, variant without priority weight, i.e. ; 3) DPHQ, variant without priority quantization loss, i.e. . We compare these variants in Table 3.
As expected, DPH substantially outperforms DPHW by large margins of , and in average MAP for different bits on ImageNet, NUSWIDE and MSCOCO, respectively. The classical pairwise crossentropy loss (without priority weighting) has been widely adopted in previous work (Xia et al., 2014; Zhu et al., 2016). However, this classical loss does not account for the class imbalance and for the variations in easy and hard examples. Thus it may suffer from performance drop when training data is highly skew (e.g. NUSWIDE) or has large variations of easy and hard examples. In contrast, our DPH model uses the proposed priority crossentropy loss, which is a principled solution to these data skewness problems.
Furthermore, we notice that DPH outperforms DPHF, which demonstrates the suboptimality in using classification uncertainty in the focal loss. Our priority weight can relax the restriction on the choices of the difficulty measure in the modulating factor.
DPH outperforms DPHQ by 1.44%, 4.81%, and 0.89% in average MAP for different code lengths on ImageNet, NUSWIDE, and MSCOCO respectively. These results validate that the proposed priority quantization loss (14) can control the quantization error caused by continuous relaxation and generate less lossy binary codes.
We investigate the sensitivity of the tradeoff hyperparameter between the Priority CrossEntropy Loss and the Priority quantization Loss on ImageNet, NUSWIDE and MSCOCO datasets. From Figure 5, we can observe that the MAP results fluctuate slightly as increases from to . This demonstrates that DPH is not sensitive to the scale of the tradeoff hyperparameter and can perform stably in a wide range of .
We visualize the tSNE (Donahue et al., 2014) of hash codes generated by HashNet and DPH on ImageNet in Figure 6 (we sample 10 categories for ease of visualization). We observe that the hash codes generated by DPH show clear discriminative structures in that different categories are well separated, while those generated by HashNet do not show such discriminative structures.
Figure 7 illustrates the top 10 returned images of DPH and the best deep hashing baseline HashNet (Cao et al., 2017) for three query images on the three datasets ImageNet, NUSWIDE, and MSCOCO, respectively. DPH yields much more relevant and userdesired retrieval results than the stateoftheart method.
This paper studies deep learning to hash approaches to establish efficient and effective image retrieval under a variety of data skewness scenarios. The proposed Deep Priority Hashing (DPH) approach generates compact and balanced hash codes by jointly optimizing a novel priority crossentropy loss and a priority quantization loss in a single Bayesian learning framework. The overall model can be trained endtoend with wellspecified loss functions. Extensive experiments demonstrate that DPH can yield stateoftheart image retrieval performance under skewness on three benchmark datasets, ImageNet, NUSWIDE, and MSCOCO.
In the future, we plan to extend the probabilistic framework to support image retrieval with relative similarity information, i.e. there is only similarity information on whether an image is more similar to another image than a third image.
This work is supported by National Key R&D Program of China (2016YFB1000701), and NSFC grants (61772299, 61672313, 71690231).
Journal of Machine Learning Research (JMLR)
11, Dec (2010), 3313–3332.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 2139–2146.
Comments
There are no comments yet.