HashNet
Code release for "HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
view repo
Learning to hash has been widely applied to approximate nearest neighbor search for largescale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by endtoend representation learning and hash encoding, has received increasing attention recently. Subject to the illposed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the illposed gradient problem in optimizing deep networks with nonsmooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield stateoftheart multimedia retrieval performance on standard benchmarks.
READ FULL TEXT VIEW PDF
Hashing is widely applied to approximate nearest neighbor search for
lar...
read it
Recently, it has been observed that 0,1,1ternary codes which are simpl...
read it
Graph representation learning has attracted much attention in supporting...
read it
Due to the impressive learning power, deep learning has achieved a remar...
read it
In this work, we firstly propose deep network models and learning algori...
read it
Learning to hash is an efficient paradigm for exact and approximate near...
read it
Embedding representation learning via neural networks is at the core
fou...
read it
Code release for "HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
In the big data era, largescale and highdimensional media data has been pervasive in search engines and social networks. To guarantee retrieval quality and computation efficiency, approximate nearest neighbors (ANN) search has attracted increasing attention. Parallel to the traditional indexing methods [21], another advantageous solution is hashing methods [38], which transform highdimensional media data into compact binary codes and generate similar binary codes for similar data items. In this paper, we will focus on learning to hash methods [38]
that build datadependent hash encoding schemes for efficient image retrieval, which have shown better performance than dataindependent hashing methods, e.g. LocalitySensitive Hashing (LSH)
[10].Many learning to hash methods have been proposed to enable efficient ANN search by Hamming ranking of compact binary hash codes [19, 12, 30, 9, 25, 37, 27, 11, 41, 42]. Recently, deep learning to hash methods [40, 20, 34, 8, 44, 22, 24]
have shown that endtoend learning of feature representation and hash coding can be more effective using deep neural networks
[18, 2], which can naturally encode any nonlinear hash functions. These deep learning to hash methods have shown stateoftheart performance on many benchmarks. In particular, it proves crucial to jointly learn similaritypreserving representations and control quantization error of binarizing continuous representations to binary codes [44, 22, 43, 24]. However, a key disadvantage of these deep learning to hash methods is that they need to first learn continuous deep representations, which are binarized into hash codes in a separated poststep of sign thresholding. By continuous relaxation, i.e. solving the discrete optimization of hash codes with continuous optimization, all these methods essentially solve an optimization problem that deviates significantly from the hashing objective as they cannot learn exactly binary hash codes in their optimization procedure. Hence, existing deep hashing methods may fail to generate compact binary hash codes for efficient similarity retrieval.There are two key challenges to enabling deep learning to hash truly endtoend. First, converting deep representations, which are continuous in nature, to exactly binary hash codes, we need to adopt the sign function as activation function when generating binary hash codes using similaritypreserving learning in deep neural networks. However, the gradient of the sign function is zero for all nonzero inputs, which will make standard backpropagation infeasible. This is known as the illposed gradient problem, which is the key difficulty in training deep neural networks via backpropagation [14]. Second, the similarity information is usually very sparse in real retrieval systems, i.e., the number of similar pairs is much smaller than the number of dissimilar pairs. This will result in the data imbalance problem, making similaritypreserving learning ineffective. Optimizing deep networks with sign activation remains an open problem and a key challenge for deep learning to hash.
This work presents HashNet
, a new architecture for deep learning to hash by continuation with convergence guarantees, which addresses the illposed gradient and data imbalance problems in an endtoend framework of deep feature learning and binary hash encoding. Specifically, we attack the
illposed gradient problem in the nonconvex optimization of the deep networks with nonsmooth sign activation by the continuation methods [1], which address a complex optimization problem by smoothing the original function, turning it into a different problem that is easier to optimize. By gradually reducing the amount of smoothing during the training, it results in a sequence of optimization problems converging to the original optimization problem. A novel weighted pairwise crossentropy loss function is designed for similaritypreserving learning from imbalanced similarity relationships. Comprehensive experiments testify that HashNet can generate exactly binary hash codes and yield stateoftheart retrieval performance on standard datasets.
Existing learning to hash methods can be organized into two categories: unsupervised hashing and supervised hashing. We refer readers to [38] for a comprehensive survey.
Unsupervised hashing methods learn hash functions that encode data points to binary codes by training from unlabeled data. Typical learning criteria include reconstruction error minimization [33, 12, 16] and graph learning[39, 26]. While unsupervised methods are more general and can be trained without semantic labels or relevance information, they are subject to the semantic gap dilemma [35] that highlevel semantic description of an object differs from lowlevel feature descriptors. Supervised methods can incorporate semantic labels or relevance information to mitigate the semantic gap and improve the hashing quality significantly. Typical supervised methods include Binary Reconstruction Embedding (BRE) [19], Minimal Loss Hashing (MLH) [30] and Hamming Distance Metric Learning [31]. Supervised Hashing with Kernels (KSH) [25] generates hash codes by minimizing the Hamming distances across similar pairs and maximizing the Hamming distances across dissimilar pairs.
As deep convolutional neural network (CNN)
[18, 13]yield breakthrough performance on many computer vision tasks, deep learning to hash has attracted attention recently. CNNH
[40] adopts a twostage strategy in which the first stage learns hash codes and the second stage learns a deep network to map input images to the hash codes. DNNH [20] improved the twostage CNNH with a simultaneous feature learning and hash coding pipeline such that representations and hash codes can be optimized in a joint learning process. DHN [44] further improves DNNH by a crossentropy loss and a quantization loss which preserve the pairwise similarity and control the quantization error simultaneously. DHN obtains stateoftheart performance on several benchmarks.However, existing deep learning to hash methods only learn continuous codes and need a binarization poststep to generate binary codes . By continuous relaxation, these methods essentially solve an optimization problem that deviates significantly from the hashing objective , because they cannot keep the codes exactly binary after convergence. Denote by the quantization error function by binarizing continuous codes into binary codes . Prior methods control the quantization error in two ways: (a) through continuous optimization [44, 22]; (b) through discrete optimization on but continuous optimization on (the continuous optimization is used for outofsample extension as discrete optimization cannot be extended to the test data) [24]. However, since cannot be minimized to zero, there is a large gap between continuous codes and binary codes. To directly optimize , we must adopt sign as the activation function within deep networks, which enables generation of exactly binary codes but introduces the illposed gradient problem. This work is the first effort to learn signactivated deep networks by continuation method, which can directly optimize for deep learning to hash.
In similarity retrieval systems, we are given a training set of points , each represented by a
dimensional feature vector
. Some pairs of points and are provided with similarity labels , where if and are similar while if and are dissimilar. The goal of deep learning to hash is to learn nonlinear hash function from input space to Hamming space using deep neural networks, which encodes each point into compact bit binary hash code such that the similarity information between the given pairs can be preserved in the compact hash codes. In supervised hashing, the similarity set can be constructed from semantic labels of data points or relevance feedback from clickthrough data in real retrieval systems.To address the data imbalance and illposed gradient problems in an endtoend learning framework, this paper presents HashNet, a novel architecture for deep learning to hash by continuation, shown in Figure 1. The architecture accepts pairwise input images and processes them through an endtoend pipeline of deep representation learning and binary hash coding: (1) a convolutional network (CNN) for learning deep representation of each image , (2) a fullyconnected hash layer () for transforming the deep representation into dimensional representation , (3) a sign activation function for binarizing the dimensional representation into bit binary hash code , and (4) a novel weighted crossentropy loss for similaritypreserving learning from imbalanced data. We attack the illposed gradient problem of the nonsmooth activation function by continuation, which starts with a smoothed activation function and becomes more nonsmooth by increasing as the training proceeds, until eventually goes back to the original, difficult to optimize, sign activation function.
To perform deep learning to hash from imbalanced data, we jointly preserve similarity information of pairwise images and generate binary hash codes by weighted maximum likelihood [6]. For a pair of binary hash codes and , there exists a nice relationship between their Hamming distance and inner product : . Hence, the Hamming distance and inner product can be used interchangeably for binary hash codes, and we adopt inner product to quantify pairwise similarity. Given the set of pairwise similarity labels
, the Weighted Maximum Likelihood (WML) estimation of the hash codes
for all training points is(1) 
where is the weighted likelihood function, and is the weight for each training pair , which is used to tackle the data imbalance problem by weighting the training pairs according to the importance of misclassifying that pair [6]. Since each similarity label in can only be (similar) or (dissimilar), to account for the data imbalance between similar and dissimilar pairs, we set
(2) 
where is the set of similar pairs and is the set of dissimilar pairs; is continuous similarity, i.e. if labels and of and are given, if only is given. For each pair,
is the conditional probability of similarity label
given a pair of hash codes and , which can be naturally defined as pairwise logistic function,(3)  
where is the adaptivesigmoid function with hyperparameter to control its bandwidth. Note that the sigmoid function with larger will have larger saturation zone where its gradient is zero. To perform more effective backpropagation, we usually require , which is more effective than the typical setting of
. Similar to logistic regression, we can see in pairwise logistic regression that the smaller the Hamming distance
is, the larger the inner product as well as the conditional probability will be, implying that pair andshould be classified as similar; otherwise, the larger the conditional probability
will be, implying that pair and should be classified as dissimilar. Hence, Equation (3) is a reasonable extension of the logistic regression classifier to the pairwise classification scenario, which is optimal for binary similarity labels .By taking Equation (3) into WML estimation in Equation (1), we achieve the optimization problem of HashNet,
(4) 
where denotes the set of all parameters in deep networks. Note that, HashNet directly uses the sign activation function which converts the dimensional representation to exactly binary hash codes, as shown in Figure 1. By optimizing the WML estimation in Equation (4), we can enable deep learning to hash from imbalanced data under a statistically optimal framework. It is noteworthy that our work is the first attempt that extends the WML estimation from pointwise scenario to pairwise scenario. The HashNet can jointly preserve similarity information of pairwise images and generate exactly binary hash codes. Different from HashNet, previous deephashing methods need to first learn continuous embeddings, which are binarized in a separated step using the sign function. This will result in substantial quantization errors and significant losses of retrieval quality.
HashNet learns exactly binary hash codes by converting the dimensional representation of the hash layer , which is continuous in nature, to binary hash code taking values of either or . This binarization process can only be performed by taking the sign function as activation function on top of hash layer in HashNet,
(5) 
Unfortunately, as the sign function is nonsmooth and nonconvex, its gradient is zero for all nonzero inputs, and is illdefined at zero, which makes the standard backpropagation infeasible for training deep networks. This is known as the vanishing gradient problem, which has been a key difficulty in training deep neural networks via backpropagation [14].
Many optimization methods have been proposed to circumvent the vanishing gradient problem and enable effective network training with backpropagation, including unsupervised pretraining
[14, 3], dropout [36][15], and deep residual learning [13]. In particular, Rectifier Linear Unit (ReLU)
[29] activation function makes deep networks much easier to train and enables endtoend learning algorithms. However, the sign activation function is so illdefined that all the above optimization methods will fail. A very recent work, BinaryNet [5], focuses on training deep networks with activations constrained to or . However, the training algorithm may be hard to converge as the feedforward pass uses the sign activation () but the backpropagation pass uses a hard tanh () activation. Optimizing deep networks with sign activation remains an open problem and a key challenge for deep learning to hash.This paper attacks the problem of nonconvex optimization of deep networks with nonsmooth sign activation by starting with a smoothed objective function which becomes more nonsmooth as the training proceeds. It is inspired by recent studies in continuation methods [1], which address a complex optimization problem by smoothing the original function, turning it into a different problem that is easier to optimize. By gradually reducing the amount of smoothing during the training, it results in a sequence of optimization problems converging to the original optimization problem. Motivated by the continuation methods, we notice there exists a key relationship between the sign function and the scaled tanh function in the concept of limit in mathematics,
(6) 
where is a scaling parameter. Increasing , the scaled tanh function will become more nonsmooth and more saturated so that the deep networks using as the activation function will be more difficult to optimize, as in Figure 1 (right). But fortunately, as , the optimization problem will converge to the original deep learning to hash problem in (4) with activation function.
Using the continuation methods, we design an optimization method for HashNet in Algorithm 1. As deep network with as the activation function can be successfully trained, we start training HashNet with as the activation function, where . For each stage , after HashNet converges, we increase and train (i.e. finetune) HashNet by setting the converged network parameters as the initialization for training the HashNet in the next stage. By evolving with , the network will converge to HashNet with as activation function, which can generate exactly binary hash codes as we desire. The efficacy of continuation in Algorithm 1 can be understood as multistage pretraining, i.e., pretraining HashNet with activation function is used to initialize HashNet with activation function, which enables easier progressive training of HashNet as the network is becoming nonsmooth in later stages by . Using we can already achieve fast convergence for training HashNet.
We analyze that the continuation method in Algorithm 1 decreases HashNet loss (4) in each stage and each iteration. Let and , where are binary hash codes. Note that when optimizing HashNet by continuation in Algorithm 1, the network activation in each stage is , which is continuous in nature and will only become binary after convergence . Denote by and the true loss we optimize in Algorithm 1, where and . Our results are two theorems, with proofs provided in the supplemental materials.
The HashNet loss will not change across stages and with bandwidths switched from to .
Loss decreases when optimizing loss
by the stochastic gradient descent (SGD) within each stage.
We conduct extensive experiments to evaluate HashNet against several stateoftheart hashing methods on three standard benchmarks. Datasets and implementations are available at http://github.com/thuml/HashNet.
The evaluation is conducted on three benchmark image retrieval datasets: ImageNet, NUSWIDE and MS COCO.
ImageNet is a benchmark image dataset for Large Scale Visual Recognition Challenge (ILSVRC 2015) [32]. It contains over 1.2M images in the training set and 50K images in the validation set, where each image is singlelabeled by one of the 1,000 categories. We randomly select 100 categories, use all the images of these categories in the training set as the database, and use all the images in the validation set as the queries; furthermore, we randomly select 100 images per category from the database as the training points.
NUSWIDE^{1}^{1}1http://lms.comp.nus.edu.sg/research/NUSWIDE.htm [4] is a public Web image dataset which contains 269,648 images downloaded from Flickr.com. Each image is manually annotated by some of the 81 ground truth concepts (categories) for evaluating retrieval models. We follow similar experimental protocols as DHN [44] and randomly sample 5,000 images as queries, with the remaining images used as the database; furthermore, we randomly sample 10,000 images from the database as training points.
MS COCO^{2}^{2}2http://mscoco.org [23] is an image recognition, segmentation, and captioning dataset. The current release contains 82,783 training images and 40,504 validation images, where each image is labeled by some of the 80 categories. After pruning images with no category information, we obtain 12,2218 images by combining the training and validation images. We randomly sample 5,000 images as queries, with the rest images used as the database; furthermore, we randomly sample 10,000 images from the database as training points.
Following standard evaluation protocol as previous work [40, 20, 44], the similarity information for hash function learning and for groundtruth evaluation is constructed from image labels: if two images and share at least one label, they are similar and ; otherwise, they are dissimilar and . Note that, although we use the image labels to construct the similarity information, our proposed HashNet can learn hash codes when only the similarity information is available. By constructing the training data in this way, the ratio between the number of dissimilar pairs and the number of similar pairs is roughly 100, 5, and 1 for ImageNet, NUSWIDE, and MS COCO, respectively. These datasets exhibit the data imbalance phenomenon and can be used to evaluate different hashing methods under data imbalance scenario.
We compare retrieval performance of HashNet with ten classical or stateoftheart hashing methods: unsupervised methods LSH [10], SH [39], ITQ [12], supervised shallow methods BRE [19], KSH [25], ITQCCA [12], SDH [34], and supervised deep methods CNNH [40], DNNH [20], DHN [44]
. We evaluate retrieval quality based on five standard evaluation metrics: Mean Average Precision (MAP), PrecisionRecall curves (PR), Precision curves within Hamming distance 2 (P@H=2), Precision curves with respect to different numbers of top returned samples (P@N), and Histogram of learned codes without binarization. For fair comparison, all methods use identical training and test sets. We adopt MAP@1000 for ImageNet as each category has 1,300 images, and adopt MAP@5000 for the other datasets
[44].For shallow hashing methods, we use DeCAF features [7] as input. For deep hashing methods, we use raw images as input. We adopt the AlexNet architecture [18] for all deep hashing methods, and implement HashNet based on the Caffe framework [17]. We finetune convolutional layers – and fullyconnected layers – copied from the AlexNet model pretrained on ImageNet 2012 and train the hash layer , all through backpropagation. As the
layer is trained from scratch, we set its learning rate to be 10 times that of the lower layers. We use minibatch stochastic gradient descent (SGD) with 0.9 momentum and the learning rate annealing strategy implemented in Caffe, and crossvalidate the learning rate from
to with a multiplicative stepsize . We fix the minibatch size of images as and the weight decay parameter as .


Method  ImageNet  NUSWIDE  MS COCO  
16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  
HashNet  0.5059  0.6306  0.6633  0.6835  0.6623  0.6988  0.7114  0.7163  0.6873  0.7184  0.7301  0.7362 
DHN [44]  0.3106  0.4717  0.5419  0.5732  0.6374  0.6637  0.6692  0.6714  0.6774  0.7013  0.6948  0.6944 
DNNH [20]  0.2903  0.4605  0.5301  0.5645  0.5976  0.6158  0.6345  0.6388  0.5932  0.6034  0.6045  0.6099 
CNNH [40]  0.2812  0.4498  0.5245  0.5538  0.5696  0.5827  0.5926  0.5996  0.5642  0.5744  0.5711  0.5671 
SDH [34]  0.2985  0.4551  0.5549  0.5852  0.4756  0.5545  0.5786  0.5812  0.5545  0.5642  0.5723  0.5799 
KSH [25]  0.1599  0.2976  0.3422  0.3943  0.3561  0.3327  0.3124  0.3368  0.5212  0.5343  0.5343  0.5361 
ITQCCA [12]  0.2659  0.4362  0.5479  0.5764  0.4598  0.4052  0.3732  0.3467  0.5659  0.5624  0.5297  0.5019 
ITQ [12]  0.3255  0.4620  0.5170  0.5520  0.5086  0.5425  0.5580  0.5611  0.5818  0.6243  0.6460  0.6574 
BRE [19]  0.0628  0.2525  0.3300  0.3578  0.5027  0.5290  0.5475  0.5546  0.5920  0.6224  0.6300  0.6336 
SH [39]  0.2066  0.3280  0.3951  0.4191  0.4058  0.4209  0.4211  0.4104  0.4951  0.5071  0.5099  0.5101 
LSH [10]  0.1007  0.2350  0.3121  0.3596  0.3283  0.4227  0.4333  0.5009  0.4592  0.4856  0.5440  0.5849 

The Mean Average Precision (MAP) results are shown in Table 1. HashNet substantially outperforms all comparison methods. Specifically, compared to the best shallow hashing method using deep features as input, ITQ/ITQCCA, we achieve absolute boosts of , , and in average MAP for different bits on ImageNet, NUSWIDE, and MS COCO, respectively. Compared to the stateoftheart deep hashing method, DHN, we achieve absolute boosts of , , in average MAP for different bits on the three datasets, respectively. An interesting phenomenon is that the performance boost of HashNet over DHN is significantly different across the three datasets. Specifically, the performance boost on ImageNet is much larger than that on NUSWIDE and MS COCO by about , which is very impressive. Recall that the ratio between the number of dissimilar pairs and the number of similar pairs is roughly , , and for ImageNet, NUSWIDE and MS COCO, respectively. This data imbalance problem substantially deteriorates the performance of hashing methods trained from pairwise data, including all the deep hashing methods. HashNet enhances deep learning to hash from imbalanced dataset by Weighted Maximum Likelihood (WML), which is a principled solution to tackling the data imbalance problem. This lends it the superior performance on imbalanced datasets.
The performance in terms of Precision within Hamming radius 2 (P@H=2) is very important for efficient retrieval with binary hash codes since such Hamming ranking only requires time for each query. As shown in Figures 2(a), 3(a) and 4(a), HashNet achieves the highest P@H=2 results on all three datasets. In particular, P@H=2 of HashNet with 32 bits is better than that of DHN with any bits. This validates that HashNet can learn more compact binary codes than DHN. When using longer codes, the Hamming space will become sparse and few data points fall within the Hamming ball with radius 2 [9]. This is why most hashing methods achieve best accuracy with moderate code lengths.
The retrieval performance on the three datasets in terms of PrecisionRecall curves (PR) and Precision curves with respect to different numbers of top returned samples (P@N) are shown in Figures 2(b)4(b) and Figures 2(c)4(c), respectively. HashNet outperforms comparison methods by large margins. In particular, HashNet achieves much higher precision at lower recall levels or when the number of top results is small. This is desirable for precisionfirst retrieval, which is widely implemented in practical systems. As an intuitive illustration, Figure 5 shows that HashNet can yield much more relevant and userdesired retrieval results.
Recent work [28] studies two evaluation protocols for supervised hashing: (1) supervised retrieval protocol where queries and database have identical classes and (2) zeroshot retrieval protocol where queries and database have different classes. Some supervised hashing methods perform well in one protocol but poorly in another protocol. Table 2 shows the MAP results on ImageNet dataset under the zeroshot retrieval protocol, where HashNet substantially outperforms DHN. Thus, HashNet works well under different protocols.


Method  16 bits  32 bits  48 bits  64 bits 
HashNet  0.4411  0.5274  0.5651  0.5756 
DHN [44]  0.2891  0.4421  0.5123  0.5342 



Method  ImageNet  NUSWIDE  MS COCO  
16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  16 bits  32 bits  48 bits  64 bits  
HashNet+C  0.5059  0.6306  0.6633  0.6835  0.6646  0.7024  0.7209  0.7259  0.6876  0.7261  0.7371  0.7419 
HashNet  0.5059  0.6306  0.6633  0.6835  0.6623  0.6988  0.7114  0.7163  0.6873  0.7184  0.7301  0.7362 
HashNetW  0.3350  0.4852  0.5668  0.5992  0.6400  0.6638  0.6788  0.6933  0.6853  0.7174  0.7297  0.7348 
HashNetsgn  0.4249  0.5450  0.5828  0.6061  0.6603  0.6770  0.6921  0.7020  0.6449  0.6891  0.7056  0.7138 

Visualization of Hash Codes: We visualize the tSNE [7] of hash codes generated by HashNet and DHN on ImageNet in Figure 6 (for ease of visualization, we sample 10 categories). We observe that the hash codes generated by HashNet show clear discriminative structures in that different categories are well separated, while the hash codes generated by DHN do not show such discriminative structures. This suggests that HashNet can learn more discriminative hash codes than DHN for more effective similarity retrieval.
Ablation Study: We go deeper with the efficacy of the weighted maximum likelihood and continuation methods. We investigate three variants of HashNet: (1) HashNet+C, variant using continuous similarity when image labels are given; (2) HashNetW, variant using maximum likelihood instead of weighted maximum likelihood, i.e. ; (3) HashNetsgn, variant using instead of as activation function to generate continuous codes and requiring a separated binarization step to generate hash codes. We compare results of these variants in Table 3.
By weighted maximum likelihood estimation, HashNet outperforms HashNetW by substantially large margins of , and in average MAP for different bits on ImageNet, NUSWIDE and MS COCO, respectively. The standard maximum likelihood estimation has been widely adopted in previous work [40, 44]. However, this estimation does not account for the data imbalance, and may suffer from performance drop when training data is highly imbalanced (e.g. ImageNet). In contrast, the proposed weighted maximum likelihood estimation (1) is a principled solution to tackling the data imbalance problem by weighting the training pairs according to the importance of misclassifying that pair. Recall that MS COCO is a balanced dataset, hence HashNet and HashNetW may yield similar MAP results. By further considering continuous similarity (), HashNet+C achieves even better accuracy than HashNet.
By training HashNet with continuation, HashNet outperforms HashNetsgn by substantial margins of , and in average MAP on ImageNet, NUSWIDE, and MS COCO, respectively. Due to the illposed gradient problem, existing deep hashing methods cannot learn exactly binary hash codes using as activation function. Instead, they need to use surrogate functions of , e.g. , as the activation function and learn continuous codes, which require a separated binarization step to generate hash codes. The proposed continuation method is a principled solution to deep learning to hash with as activation function, which learn lossless binary hash codes for accurate retrieval.
Loss Value Through Training Process: We compare the change of loss values of HashNet and DHN through the training process on ImageNet, NUSWIDE and MSCOCO. We display the loss values before (sign) and after (+sign) binarization, i.e. and . Figure 7 reveals three important observations: (a) Both methods converge in terms of the loss values before and after binarization, which validates the convergence analysis in Section 3.3. (b) HashNet converges with a much smaller training loss than DHN both before and after binarization, which implies that HashNet can preserve the similarity relationship in Hamming space much better than DHN. (c) The two loss curves of HashNet before and after binarization become close to each other and overlap completely when convergence. This shows that the continuation method enables HashNet to approach the true loss defined on the exactly binary codes without continuous relaxation. But there is a large gap between two loss curves of DHN, implying that DHN and similar methods [34, 22, 24] cannot learn exactly binary codes by minimizing quantization error of codes before and after binarization.
Histogram of Codes Without Binarization: As discussed previously, the proposed HashNet can learn exactly binary hash codes while previous deep hashing methods can only learn continuous codes and generate binary hash codes by poststep sign thresholding. To verify this key property, we plot the histograms of codes learned by HashNet and DHN on the three datasets without poststep binarization. The histograms can be plotted by evenly dividing into 100 bins, and calculating the frequency of codes falling into each bin. To make the histograms more readable, we show absolute code values (axis) and squared root of frequency (axis). Histograms in Figure 8 show that DHN can only generate continuous codes spanning across the whole range of . This implies that if we quantize these continuous codes into binary hash codes (taking values in ) in a poststep, we may suffer from large quantization error especially for the codes near zero. On the contrary, the codes of HashNet without binarization are already exactly binary.
This paper addressed deep learning to hash from imbalanced similarity data by the continuation method. The proposed HashNet can learn exactly binary hash codes by optimizing a novel weighted pairwise crossentropy loss function in deep convolutional neural networks. HashNet can be effectively trained by the proposed multistage pretraining algorithm carefully crafted from the continuation method. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield stateoftheart multimedia retrieval performance on standard benchmarks.
This work was supported by the National Key R&D Program of China (No. 2016YFB1000701), the National Natural Science Foundation of China (No. 61502265, 61325008, and 71690231), the National Sci.&Tech. Supporting Program (2015BAF32B01), and the Tsinghua TNList Projects.
We briefly analyze that the continuation optimization in Algorithm 1 will decrease the loss of HashNet (4) in each stage and in each iteration until converging to HashNet with sign activation function that generates exactly binary codes.
Let and , where are binary hash codes. Note that when optimizing HashNet by continuation in Algorithm 1, network activation in each stage is , which is continuous in nature and will only become binary when convergence . Denote by and the true loss we optimize in Algorithm 1, where and note that . We will show that HashNet loss descends when minimizing .
The HashNet loss will not change across stages and with bandwidths switched from to .
When the algorithm switches from stages to with bandwidths changed from to , only the network activation is changed from to but its sign , i.e. the hash code, remains the same. Thus is unchanged. ∎
For each pair of binary codes , and their continuous counterparts , , the derivative of w.r.t. each bit is
(7) 
where . The derivative of w.r.t. can be defined similarly. Updating by SGD, the updated is
(8)  
where is the learning rate and is computed similarly.
Denote by , , then
(9) 
Since , Lemma 1 can be proved by verifying that if and if , .
.
(1) If , , then , . Thus, , . And we have .
(2) If , , then , . Thus, , . And we have .
(3) If , , then , . Thus , . So and may be either or and we have .
(4) If , , then , . Thus , . So and may be either or and we have .
. It can be proved similarly as Case 1.
∎
Loss decreases when optimizing loss by the stochastic gradient descent (SGD) within each stage.
The gradient of loss w.r.t. hash codes is
(10) 
We observe that
(11) 
By substituting Lemma 1: if , then , and thus ; if , then , and thus . ∎
Journal of Machine Learning Research (JMLR)
, 11(Dec):3313–3332, 2010.Learning binary codes for highdimensional data using bilinear projections.
In CVPR, pages 484–491. IEEE, 2013.Rectified linear units improve restricted boltzmann machines.
In J. Fürnkranz and T. Joachims, editors, ICML, pages 807–814. Omnipress, 2010.
Comments
There are no comments yet.