Log In Sign Up

One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective

by   Jiun Tian Hoe, et al.

A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at


page 1

page 2

page 3

page 4


Self-Distilled Hashing for Deep Image Retrieval

In hash-based image retrieval systems, the transformed input from the or...

Deep Asymmetric Hashing with Dual Semantic Regression and Class Structure Quantization

Recently, deep hashing methods have been widely used in image retrieval ...

One Loss for Quantization: Deep Hashing with Discrete Wasserstein Distributional Matching

Image hashing is a principled approximate nearest neighbor approach to f...

A Scalable Optimization Mechanism for Pairwise based Discrete Hashing

Maintaining the pair similarity relationship among originally high-dimen...

Fast Online Hashing with Multi-Label Projection

Hashing has been widely researched to solve the large-scale approximate ...

HHF: Hashing-guided Hinge Function for Deep Hashing Retrieval

Deep hashing has shown promising performance in large-scale image retrie...

Hadamard Matrix Guided Online Hashing

Online image hashing has received increasing research attention recently...

Code Repositories


Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

view repo

1 Introduction

A key building block of a real-world large-scale image retrieval system is hashing. The objective of image hashing is to represent the content of an image using a binary code for efficient storage and accurate retrieval. Recently, deep hashing methods

cnnh2014xia ; simultaneous2015lai have shown great improvements over conventional hashing methods spectral09weiss ; itq2012gong ; lsh1998indyk ; klsh2009kulis ; minlosshash2011mohammad ; hammingmetric2012mohammad ; isotropic2012kong . Furthermore, deep hashing methods can be grouped by how the similarity of the learned hashing codes are measured, namely pointwise ssdh2017yang ; adsh2019zhou ; jmlh2019shen ; dpn2020fan ; csq2020yuan , pairwise dpsh2016li ; simultaneous2015lai ; hashnet2017cao ; dch2018cao , triplet-wise dtsh2016wang ; dtq2018liu , or listwise semantic2015zhao . Among them, pointwise methods have a computational complexity, whilst the complexity of the others are of at least for N data points. This means that for large-scale problems, only the pointwise methods are tractable ssdh2017yang . They are thus the focus of most recent studies.

Figure 1: We train a simple CNN model on CIFAR10 with only first 4 classes and 2-bits. The continuous codes are then visualized before sgn. (Left) The model is trained with CE only. Although it can separate the 4 classes in Euclidean space, the output is not bounded, indicating high quantization error and sub-optimal in Hamming space. (Middle) By appending a batch normalization (BN) layer after , the hash codes are now balanced. (Right) The model (proposed) is trained to maximize the cosine similarity between and its corresponding binary target . The black arrows are the binary orthogonal target, denoted as

for each class. It can be seen that the continuous codes exhibit lower intra-class variance and quantization error as compared with the

CE&BN models.

A deep hashing neural network naturally has multiple learning objectives. Specifically, given an image input, the network outputs a continuous code (feature vector) which is then converted into a binary hash code using a quantization layer (usually a


function). There are thus two main objectives. First, the final model output, i.e., the binary codes must be discriminative, meaning the intra-class hamming distances are small, while the inter-class ones are big. Second, a quantization error minimization objective is needed to regularize the continuous codes. But the learning is constrained by the vanishing gradient problem caused by the quantization layer. Although the problem can be avoided by deploying some relaxation schemes

hashnet2017cao ; simultaneous2015lai ; dpsh2016li , these schemes often produce sub-optimal hash codes due to the introduction of quantization error (see Figure 1). Hence, most recently deep hashing methods greedyhash2018su ; dpsh2016li ; dch2018cao ; dbdh2020zheng ; csq2020yuan has an explicit quantization error minimization learning objective. Having these two main objectives/losses are still not enough. In particular, to ensure the quality of hash codes, many other losses are employed by existing methods. These include bit balance loss dbdh2020zheng ; ssdh2017yang ; jmlh2019shen , weights constraints to maximize Hamming distance adsh2019zhou , code orthogonality sdh2015liong ; dtq2018liu . Further, losses are designed to address the vanishing gradient problem caused by the sign function used to obtain binary codes from the continuous ones greedyhash2018su ; jmlh2019shen ; bihalf2021li . As a result, the state-of-the-art hashing models typically have a large number (>4) losses. This means difficulties in optimization which in turn hamper their effectiveness. In this work, for the first time, a deep hashing model with a single loss is developed which removes any needs for loss weight tuning and is thus much easier to optimize. As mentioned earlier, a deep hashing model needs to be trained with at least two objectives, namely binary code discriminativenss and quantization error minimization. So how could one use one loss only? The answer lies in the fact that the two objectives are closely related and can be unified into one. More concretely, we show that both objectives can be satisfied by maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal target, which can be formulated as a cross-entropy (CE) loss. Our model, dubbed OrthoHash has one loss only which maximizes the cosine similarity between the L-normalized continuous codes and binary orthogonal target to maximize inter-class Hamming distance and minimize quantization error simultaneously. We show that this single unifying loss has a number of additional benefits. First, we can leverage the benefit of margin cosface2018wang ; arcface2019deng to further improve the intra-class variance. Second, since conventional CE loss only works for single-label classification, we can easily leverage Label Smoothing labelsmooth2017gabriel to modify the CE loss to tackle multi-labels classification. Finally, we show that code balancing can now be enforced by introducing a batch normalization bn2015ioffe

(BN) layer rather than requiring a different loss. Extensive experiment results suggest that on conventional category-level retrieval tasks using ImageNet100, NUS-WIDE and MS-COCO, our model is on par with the SOTA. More importantly, on the large-scale instance-level retrieval tasks, our method achieves the new SOTA, beating the best results obtained so far on GLDv2,

Oxf and Paris by 0.6%, 9.1% and 17.1% respectively.

2 Related Work

Hashing methods. Conventional hashing methods can be categorized into many streams. Data-independent methods such as Locality-sensitive Hashing (LsH) lsh1998indyk ; lsh1999gionis , and its kernelized version (KLsH) klsh2009kulis have contributed many of the fundamental concepts for hashing such as the requirement of code balance, uncorrelated bit, and similarity preserving. In contrast, data-dependent methods spectral09weiss ; bre2009kulis ; itq2012gong ; isotropic2012kong ; minlosshash2011mohammad ; hammingmetric2012mohammad aim to learn hash codes that are more compact yet more dataset-specific return2014chatfield

. Recently, deep learning based hashing methods

dpsh2016li ; cnnh2014xia ; simultaneous2015lai dominated the hashing research due to the superior learning ability of DNN. Various learning objectives are developed to learn hash codes using a training dataset. The objective functions include 1) task learning objective which can be further categorized into pointwise ssdh2017yang ; adsh2019zhou ; jmlh2019shen ; greedyhash2018su ; dpn2020fan ; csq2020yuan , pairwise hashnet2017cao ; dpsh2016li ; simultaneous2015lai , triplet-wise dtsh2016wang ; dtq2018liu , listwise semantic2015zhao and unsupervised bihalf2021li ; angular2012gong ; 2) quantization error minimization such as the loss designed to minimize the -norm (usually ) between continuous codes and hash codes; 3) code balancing bihalf2021li ; jmlh2019shen . We refer readers to learning to hashing surveys lths2015wang ; lths2018wang ; decade2020dubey for more detailed review. Binary optimization. Hashing is a NP-hard binary optimization problem spectral09weiss , and is prone to the vanishing gradient problem due to the discrete and non-differentiable binary hash functions. Early methods solved the problem by discarding the discrete constraints (e.g., designing a penalty loss term to generate feature as binary as possible dpsh2016li ; simultaneous2015lai ; solve with continuous relaxation, i.e., to optimize in a continuous space using sigmoid or tanh for approximation hashnet2017cao ). Some methods also utilized coordinate descent method in the training twostep2013lin ; dsdh2017li . Nevertheless, these methods have increased the complexity of learning due to need for tuning of hyper-parameters balancing different learning objectives. Bypassing vanishing gradient. Greedy Hash greedyhash2018su designed a new coding layer which uses the sign

function in the forward pass to generate binary codes, and gradients are backpropagated using straight-through estimator

ste2013bengio during optimization. bihalf2021li designed a parameter-free coding layer – Bi-half, to maximize the bit capacity by shifting the network output by median (each bit can have a 50% chance of being or ) . These methods typically requires the modification of computational graphs, in the sense that the original graph is no longer end-to-end trained, hence further complicates the original optimization objective. Ours on the other hand incorporates a neat one-loss design that removes all such complications. Learning hash codes with pre-defined target. Deep Polarized Network (DPN) dpn2020fan used a random assignment scheme to generate target vectors with maximal inter-class distance, then optimized with hinge-like polarized loss. Central Similarity Quantization (CSQ) csq2020yuan uses Hadamard matrix as "hash centers", then optimized with binary cross entropy. Both methods have similar overall objective, i.e., the continuous codes are learned to be as similar as the target vectors (or "hash centers"). Our model also employs a hash target, but uniquely it is used in a single cosine similarity based single objective. Cosine similarity. While most works focus on hashing images with various constraints, we reformulate the problem of deep hashing in the lens of cosine similarity. As inspired by tnt2019zhang ; angular2012gong which utilize cosine similarity to find closest approximate binary or ternary representation, we also interpret the quantization error in terms of cosine similarity. Moreover, deep hypersphere embedding learning methods (e.g., SphereFace sphere2017liu , CosFace cosface2018wang and ArcFace arcface2019deng ) imposed discriminative constraints on a hypersphere manifold and proposed to improve decision boundary by cosine or angular margin. Inspired by them, we also leverage the benefit of margin to improve intra-class variance.

Figure 2: We first obtain continuous codes from our backbone network. It is then passed through a batch normalization (BN) layer to obtain zero-mean continuous codes. Next, we compute scaled cosine similarity between the continuous codes and their binary orthogonal targets where = number of classes. Finally, the scaled cosine similarity will act as a classification output and we minimize a cross entropy loss. See Section 3.2 for details.

3 OrthoHash: One Loss for All

In Section 3.1, we reformulate the problem of deep hashing in the lens of cosine similarity, i.e., interpreting both Hamming distance retrieval and quantization error in cosine similarity. In Section 3.2, we propose to maximize cosine similarity between the continuous codes and binary orthogonal target under a single classification objective (for both single-label and multi-labels classification). Finally, we describe why adding a batch normalization layer after the continuous codes will achieve code balance in Section 3.2.3. Our method is illustrated in Figure 2. Let us first formally define the deep hashing problem. Let -dimensional data, where is the number of training samples, and as one-hot training labels of classes (for multi-labels, , whose if any -th class are assigned to the -th sample and otherwise). Our objective is to learn a set of -bit binary codes for each training point , which is converted from the continuous codes through a sgn function. can be computed by a latent layer , is a deep neural network (backbone network) to compute -dimensional nonlinear feature representation , is the weights of the latent layer and if -th bit of and otherwise. In our work, binary orthogonal targets , where denotes a column vector belongs to -th class. Ideally, for any two rows, , and are orthogonal to each other. We use or to represent scalar, to represent column vector, and to represent matrix. Both often used as index.

3.1 Reformulating Deep Hashing in the Lens of Cosine Similarity

Interpreting Hamming Distance as Cosine Similarity. Typically, Hamming distance can be computed using logical xor operation between binary codes and , followed by popcount. If is represented by , then Hamming distance can also be computed mathematically as:


Geometrically, the dot product can be interpreted as:


in which is the Euclidean norm and is the angle between and . As both and are constant (i.e., ), equation (1) can then be viewed as:


Since is a constant, we can see that the retrieval is now will be only based on the angle between two hash codes i.e., similar hash codes will have a similar direction, yield a lower angle between them, and hence a lower hamming distance. Interpreting Quantization Error as Cosine Similarity. Typically, converting continuous codes to binary codes will lead to information loss, which is also known as quantization error. Therefore, most of the existing hashing methods have included quantization error minimization in their learning objective such as L-norm, L-norm and p-norm (e.g., in Greedy Hash greedyhash2018su ), usually in the form of:



is the supervised learning objective such as Cross Entropy and

is the quantization error between and . However, it is difficult to control the scale , i.e. a low might not be effective, while a high might lead to underfitting. As a result of this, careful tuning is needed and yet the tuned may varies in different tasks. To overcome this cumbersome practise, let us first interpret quantization error geometrically:


in which is in continuous space, is in binary space. We expand equation (5) to get:


According to equation (3), retrieval is only based on the similarity in the direction of two hash codes. Hence, we can ignore the magnitude of by normalizing it to have the same norm with , i.e., and interpret the quantization error as to only the angle between and 111See supplementary material for proof.:


Since is a constant, we can then conclude that maximize the cosine similarity between and will lead to a low quantization error, leading to a better approximation in the hash codes.

3.2 Discriminative Hash Codes with Orthogonal Target

According to similarity2002charikar

, the probability of two samples

and to have the same hash code under a family

of hash functions using random hyperplane technique can be described as

, where is a hash function and is the angle between and . Therefore, based on the same principle, it can be derived that if the two continuous codes and from latent layer have high cosine similarity, then the hash codes and should also have high chance of obtaining the same hash codes. Beside that, as described in Section 3.1, cosine similarity can also be used to justify the retrieval performance using both the hash codes and quantization error between the continuous codes and hash codes. Given these two circumstances, we therefore propose to maximize the cosine similarity of the continuous codes and its corresponding binary orthogonal target,

, where this can be achieved by maximizing the posterior probability of the ground-truth class using softmax (cross-entropy) loss:


where denotes the deep continuous codes of the -th samples from DNN and both , denote the ground-truth class and the -th class of the binary orthogonal targets. For simplicity, we omit the bias term from equation (8). It follows that under the framework of deep hypersphere embedding sphere2017liu ; cosface2018wang ; arcface2019deng

, we can transform the logit

where is the angle between the continuous codes and the binary orthogonal target . Next, we perform L normalization on so that , and

since it is in binary form. Now our loss function can be rewritten as:


As such, instead of introducing the quantization error minimization in the learning objective (equation (4)), our proposed method unifies both the learning objective and quantization error minimization together under a single classification objective as shown in the loss function (equation (9)). Furthermore, since the binary orthogonal targets attain maximal inter-class Hamming distance and that our loss function also aims to minimize the intra-class variance, we can leverage on cosine or angular margin222Cosine margin will transform to and angular margin will transform the same to . that have been proven to be beneficial in CosFace cosface2018wang and ArcFace arcface2019deng , to further improve the minimization of intra-class variance (we set in all of our experiments unless mentioned explicitly). With this, our method is able to perform end-to-end training to learn highly discriminative hash codes without both the sophisticated training objectives and computational graph modifications.

3.2.1 Binary Orthogonal Target

The maximization of the expectation of inter-class Hamming distance will help to increase the recall rate during retrieval as there will be lesser chance to retrieve incorrect items, because the aim is to retrieve more similar items (intra-class), and avoid to retrieve incorrect items (inter-class). That is, given a K-bit Hamming space , for any two binary vectors sampled with probability for on each bit, the expectation of Hamming distance is and it achieves the upper bound of with dpn2020fan ; csq2020yuan (See supplementary material for details.). Hence, hash codes and must be orthogonal so that we can get in equation (3). Orthogonal Targets Generation. Hadamard matrix naturally contains orthogonal rows and columns, which guarantees the maximum Hamming distance of between any two rows csq2020yuan ; hcoh2018lin . However, it is restricted when is not 1, 2, or a multiple of 4. Hence, a simple solution is to sample the targets from which every sampled bit has the probability to be . The result is the expectation of Hamming distance between any two rows equals to which indicates orthogonality. One limitation is that if , the nearest rows in the sampled targets will be identical, which causes performance degrade. Hence, the solution is to increase . In supplementary material, we show that the two nearest rows has Hamming distance closed to as

is higher. We also generate the targets with the objective of maximum inter-class Hamming distance heuristically, it indeed improved the performance at lower

, but the improvement in higher are negligible.

3.2.2 Multi-labels Hash Codes Learning

As conventional cross-entropy loss only works for single-label classification, we leverage the concept of Label Smoothing labelsmooth2017gabriel to generate labels for multi-labels classification. A standard cross entropy (CE) loss is mathematically formulated as:


in which if -th class is assigned to the -th sample in a single label multiclass classification task. In labelsmooth2017gabriel , the target label becomes soft-target such that non-target class has a small "smoothing" value to regularize overconfident samples and we leverage this concept for multi-labels. To adopt CE for multi labels classification, we set if any -th class are assigned to -th sample. The constant is determined such that , e.g., and when the 2 and the 4 classes are the assigned classes. Our motivation is that the model should maximize the probabilities of the target classes, which can optimize the hash codes to be as similar as the binary targets from assigned classes333Note, we cannot guarantee that the final hash codes are the center of hash codes of the target classes. Instead we let the optimization algorithm to find the best hash codes.. In our experiments, we found out empirically that replacing softmax with sigmoid for multi-labels are not effective444See supplementary material for details.. A likely explanation is that softmax will intrinsically suppress the lower activated class unit (i.e., scaled cosine similarity) with lower probability and increase the highly activated class unit with higher probability, while sigmoid will treat each class unit as an individual unit. As a result, maximizing probability of a class might not lead to minimizing the probability of other classes. Therefore, we propose to leverage the concept of Label Smoothing to generate labels so that we can use cross entropy loss for learning.

3.2.3 Code Balance

Although binary orthogonal target helps in code balancing, since every bit has 50% of chance being or , there is no guarantee that the model will learn to output a balanced code. Therefore, we propose to add a batch normalization (BN) layer after the continuous codes to ensure the code balance. If , then we can see that for the -th bit. Because the distribution of has been normalized to have zero-mean and variance of 1, with , the hash codes will follow a uniform binary distribution with 50% chances on both and . Empirically, we found that it improves the retrieval performance on ImageNet100 by about 17-20% as compared with a model with normal cross entropy loss (see Table 1). Note that the Bi-half method bihalf2021li shifts the continuous codes by their median, followed by converting the continuous codes to binary codes for optimization. However, it will have to modify the computational graph in order to have a proxy derivative to the solve vanishing gradient problem. In contrast, appending BN layer will not modify the computational graph, therefore enabling straightforward end-to-end training.

4 Experiment

Training Setup. We select 7 different deep hashing methods for comparison (5 point-wise, 1 pair-wise and 1 triplet-wise). For a fair comparison, we use the same learning rate of , Adam optimizer adam2014kingma and epochs for all methods. For SDH-C sdh2015liong , we have modified it from pair-wise objective to point-wise objective, while all penalty terms are kept (i.e., quantization loss, bit variance loss and orthogonality on projection weights). Datasets. We follow prior works hashnet2017cao ; dpn2020fan ; greedyhash2018su ; dsh2016liu ; jmlh2019shen ; cnnh2014xia ; simultaneous2015lai ; ksh2012liu and choose ImageNet100 imagenet2009deng , NUS-WIDE nuswide2009chua and MS-COCO coco2014lin for category-level retrieval experiments. For a more practical yet challenging large-scale instance-level retrieval task (i.e., tremendous number of classes), we evaluate on the popular GLDv2 gldv22020weyand , Oxf and Par roxfparis2018filip . Architecture. For category-level retrieval, following the settings in hashnet2017cao ; greedyhash2018su ; jmlh2019shen ; dpn2020fan , we use pre-trained AlexNet alexnet2012krizhevsky

as the network backbone initialization. The output from last fully-connected with ReLU (4096-dimension vector) acts as input to the latent layer; various supervised deep hashing methods are then applied to generate binary codes. The image size is 224

224. For instance-level retrieval, due to the expensive cost of training from scratch, we use pre-trained model555 (R50-DELG-GLDv2-clean) from DELG delg2020cao to compute the 2048-dimension global descriptors. We then train a latent layer to compute hash codes where inputs are the global descriptors. For GLDv2, the images input are 512 512. For Oxf and Par, we use 3 scales to produce multi-scale representations. These are subject to L normalization, and then average-pooled to obtain a single descriptor as done by delg2020cao . A GLDv2-trained latent layer is used to compute hash codes for the evaluations. Details of training setups, datasets and architecture can be found in the supplementary material.

4.1 Results on Category-level Retrieval

max width= Methods ImageNet100 (mAP@1K) NUS-WIDE (mAP@5K) MS COCO (mAP@5K) 16 32 64 128 16 32 64 128 16 32 64 128 HashNet hashnet2017cao 0.343 0.480 0.573 0.612 0.814 0.831 0.842 0.847 0.663 0.693 0.713 0.727 DTSH dtsh2016wang 0.442 0.528 0.581 0.612 0.816 0.836 0.851 0.862 0.699 0.732 0.753 0.770 SDH-C sdh2015liong 0.584 0.649 0.664 0.662 0.763 0.792 0.816 0.832 0.671 0.710 0.733 0.742 GreedyHash greedyhash2018su 0.570 0.639 0.659 0.659 0.771 0.797 0.815 0.832 0.677 0.722 0.740 0.746 JMLH jmlh2019shen 0.517 0.621 0.662 0.678 0.791 0.825 0.836 0.843 0.689 0.733 0.758 0.768 DPN dpn2020fan 0.592 0.670 0.703 0.714 0.783 0.818 0.838 0.842 0.668 0.721 0.752 0.773 CSQ csq2020yuan 0.586 0.666 0.693 0.700 0.797 0.824 0.835 0.839 0.693 0.762 0.781 0.789 CE 0.350 0.379 0.406 0.445 0.744 0.770 0.796 0.813 0.602 0.639 0.658 0.676 CE+BN 0.533 0.586 0.612 0.617 0.801 0.814 0.823 0.825 0.697 0.721 0.729 0.726 CE+Bihalf bihalf2021li 0.541 0.630 0.661 0.662 0.802 0.825 0.836 0.839 0.674 0.728 0.755 0.757 OrthoCos 0.583 0.660 0.702 0.714 0.795 0.826 0.842 0.851 0.690 0.745 0.772 0.784 OrthoCos+Bihalf 0.562 0.656 0.698 0.711 0.804 0.834 0.846 0.852 0.690 0.746 0.775 0.782 OrthoCos+BN 0.606 0.679 0.711 0.717 0.804 0.836 0.850 0.856 0.709 0.762 0.787 0.797 OrthoArc+BN 0.614 0.681 0.709 0.714 0.806 0.833 0.850 0.856 0.708 0.762 0.785 0.794

Table 1: Performance of different methods for 4 different bits on different benchmark datasets. All results are run by us. The superscript , and indicate point-wise, pair-wise and triplet-wise method respectively. Bold values indicate best performance in the column.

For performance evaluation, we use mean average precision (mAP@R) which is the mean of average precision scores of the top R retrieved items. Table 1 offers performance comparison amongst all selected hashing methods and our methods (+variants). CE denotes model trained with cross entropy only, the hash codes are computed from sign of continuous codes. CE+BN denotes CE model with BN layer bn2015ioffe appended after the latent layer. CE+Bihalf denotes CE model with Bihalf bihalf2021li layer appended after the latent layer. OrthoCos denotes model trained with cosine margin and binary orthogonal target. OrthoCos+Bihalf denotes a variant of OrthoCos, and with Bihalf layer appended. OrthoCos+BN denotes a variant of OrthoCos, and with BN layer appended. OrthoArc+BN denotes a variant of OrthoCos+BN, trained with angular margin. Overall. It can be observed that both our OrthoCos+BN and OrthoArc+BN perform better than recent state-of-the-art, DPN dpn2020fan and CSQ csq2020yuan . On multi-labeled datasets (i.e., NUS-WIDE and MS COCO), DTSH dtsh2016wang (triplet based method) performed the best with 0.851 and 0.862 with 64 and 128-bits hash codes in NUS-WIDE followed by our method (e.g., OrthoCos+BN achieves 0.850 and 0.856 in the same settings), while OrthoCos+BN and OrthoArc+BN performed the best on MS-COCO with at most 1% improvement over previous deep hashing methods. Code Balance. Although retrieval performance of CE models performed the worst, but by appending BN layer after the latent layer (CE+BN), we were able to observe 5-20% improvement over all settings (dataset and number of bits). Bihalf bihalf2021li layer (zero-median features) has a proxy derivative to learn hash features, hence getting 0.1-4.9% improvement than CE+BN. This indicates that without sophisticated training objectives, code balance itself is a very important factor in improving Hamming distance based retrieval. However, OrthoCos+Bihalf does not show significant improvement over OrthoCos+BN, but is comparable with OrthoCos. We thus conclude that our method can achieve code balance without explicitly engineering the computational graph. Cosine and Angular Margin. In our experiments, we observed that cosine margin (OrthoCos+BN) slightly outperform angular margin (OrthoArc+BN) by about 0.2% on average.

4.2 Results on Instance-Level Retrieval

max width= Methods GLDv2 (mAP@100) Oxf-Hard (mAP@all) Paris-Hard (mAP@all) 128 512 2048 128 512 2048 128 512 2048 HashNet hashnet2017cao 0.018 0.069 0.111 0.034 0.058 0.307 0.133 0.190 0.490 DPN dpn2020fan 0.021 0.089 0.133 0.053 0.184 0.303 0.224 0.399 0.562 GreedyHash greedyhash2018su 0.029 0.108 0.144 0.032 0.251 0.373 0.128 0.531 0.652 CSQ csq2020yuan 0.023 0.086 0.114 0.093 0.284 0.398 0.245 0.541 0.649 OrthoCos+BN 0.035 0.111 0.147 0.184 0.359 0.447 0.416 0.608 0.669 R50-DELG-H - - 0.125* - - 0.471 - - 0.682 R50-DELG-C - - 0.138* - - 0.510 - - 0.715

Table 2: Performance of different methods for 3 different numbers of bits on different instance-level benchmark datasets. All results are run by us. The superscript and indicate point-wise and pair-wise method respectively. Bold values indicate best performance in the column. * indicates using 512 512 image inputs, hence different performance as reported by DELG delg2020cao . R50-DELG-H denotes Hamming distance retrieval using the sign of extracted descriptors. R50-DELG-C denotes Cosine distance retrieval using the extracted descriptors.

For evaluation metrics, we adapt the evaluation protocol of

delg2020cao ; roxfparis2018filip . The baseline performance of GLDv2, Oxf-Hard and Par-Hard from the pre-trained R50-DELG-GLDv2-clean are 0.138, 0.510 and 0.715 respectively. Table 2 summarizes the performance of different deep hashing methods and our method. For all the 3 datasets, our method outperforms all previous deep hashing methods on all bits. This suggests that our method has a better generalization ability on unseen instances than previous deep hashing methods. In particular, our model significantly outperforms previous deep hashing models by 0.6%, 9.1% and 17.1% respectively on the 3 datasets with 128-bits hash codes. Orthogonal Transformation. For GLDv2 2048-bits hash codes, surprisingly it can achieve a much better performance than the pre-trained 2048-dimensions descriptors (by 1.1% improvement over R50-DELG-C). We then analyze the separability in cosine distances, i.e., the difference in the mean of intra-class cosine distance and the mean of inter-class cosine distance before and after the transformation (similar to Figure 3). We observe that the separability in cosine distances increases after the orthogonal transformation, i.e., before it is 0.142 and after it increases to 0.167. The results thus show that learning orthogonal hash codes can transform the inputs to be more discriminative. Domain shifting with BN. As the model is trained with GLDv2, the running mean and variance in the BN layer might experience domain shifting problem abn2018li when testing directly on different datasets (e.g., Oxf and Par). We empirically found that using running mean and variance from GLDv2 will lead to a large performance drop in Hamming distance retrieval666See supplementary material for details.. One simple solution is to recompute the mean and variance from all continuous codes in the database, then update the running mean and variance with the computed mean and variance. The performances of Oxf and Par in Table 2 are obtained with running mean and variance of the respective database.

4.3 Further Analysis

Histogram of Hamming Distances. Figure 3 summarizes the histogram of intra-class and inter-class distances. We compare our method OrthoCos+BN with pair-wise method HashNet hashnet2017cao and point-wise classification based GreedyHash greedyhash2018su . Although the distribution of inter-class distances are about the same for all the 3 methods (close to Hamming distance of ), we can see that the larger the separability i.e., the difference in the mean of intra-class distance (the blue dotted line) with the mean of inter-class distance (the orange dotted line), the better the performance.

(a) HashNet hashnet2017cao
(b) GreedyHash greedyhash2018su
(c) OrthoCos+BN
Figure 3: Histogram of intra-class and inter-class Hamming distances with 64-bits ImageNet100. The arrow annotation is the separability in Hamming distances, . We normalized the frequency so that sum of all bins equal to 1.
(a) Quantization error
(b) Separability
(c) Orthogonality
Figure 4: Analysis of retrieval performance of 64-bits ImageNet100. (a) Quantization error: . (b) Separability: . (c) Orthogonality: . Blue solid line denotes mean average precision (mAP@1000) and orange dotted line denotes the respective analysis score.

Performance Improvement Analysis. We further analyze the reasons behind performance improvements of different deep hashing methods, and summarizes the results in Figure 4. We conclude 3 main reasons that contribute to the improvement in deep hashing methods: i) quantization error; ii) the separability in Hamming distances; and iii) orthogonality in hash centers. For quantization error, we measure the angle between the continuous codes and the hash codes , i.e., . For separability, we measure the difference in the mean of inter-class distances and the mean of intra-class distances, i.e., . For orthogonality, we first compute the hash centers for every class (by taking the sign of average hash codes in every class), then we measure the orthogonality with (lower is better). When the quantization error reduces, the separability increases and the hash centers has better orthogonality, resulting in better performance.

5 Conclusion & Future Work

We propose to unify training objectives of deep hashing under a single classification objective. We show this can be achieved by maximizing the cosine similarity between the continuous codes and binary orthogonal target under a cross entropy loss. For that, we first reformulated the problem of deep hashing in the lens of cosine similarity. We then demonstrated that if we perform L

-normalization on the continuous codes, then end-to-end training of deep hashing is possible without any extra sophisticated constraints. Moreover, we leverage the concept of Label Smoothing to train multi-labels classification with cross-entropy loss and batch normalization for code balancing. Extensive experiments validated the efficiency of our method in both category-level and instance-level retrieval benchmarks. As part of the future work, we are exploring how to learn better feature representations to improve the retrieval performance by using hash codes through unsupervised learning.

Broader Impact

Hashing remains a key bottleneck in practical deployments of large-scale retrieval systems. Recent deep hashing frameworks have shown great promise in learning code that are both compact and discriminative. Yet state-of-the-art frameworks are known to be difficult to train and to reproduce – largely owing to their complex loss designs that dictates hyperparameter tuning and multi-stage training. In this work, we set out to change that – we attempt to unify deep hashing under

a single objective, therefore simplifying training and help reproducibility. Our key intuition lies with reformulating hashing in the lens of cosine similarity. We report competitive hashing performance on all common datasets, and significant improvements over state-of-the-arts on the more challenging task of instance-level retrieval. This research is partly supported by the Fundamental Research Grant Scheme (FRGS) MoHE Grant FP021-2018A, from the Ministry of Education Malaysia. We also thank Kilho Shin for helpful discussions and recommendations.


  • [1] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [2] Bingyi Cao, André Araujo, and Jack Sim. Unifying deep local and global features for image search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 726–743, Cham, 2020. Springer International Publishing.
  • [3] Yue Cao, Mingsheng Long, Bin Liu, and Jianmin Wang. Deep cauchy hashing for hamming space retrieval. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 1229–1237, 2018.
  • [4] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. Hashnet: Deep learning to hash by continuation. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5609–5618, 2017.
  • [5] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In

    Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

    , pages 380–388, 2002.
  • [6] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
  • [7] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pages 1–9, Santorini, Greece., July 8-10, 2009.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • [9] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou.

    Arcface: Additive angular margin loss for deep face recognition.

    In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [10] Shiv Ram Dubey. A decade survey of content based image retrieval using deep learning, 2020.
  • [11] Lixin Fan, Kam Woh Ng, Ce Ju, Tianyu Zhang, and Chee Seng Chan. Deep polarized network for supervised learning of accurate binary hashing codes. In Christian Bessiere, editor,

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20

    , pages 825–831. International Joint Conferences on Artificial Intelligence Organization, 7 2020.
    Main track.
  • [12] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518–529, 1999.
  • [13] Yunchao Gong, Sanjiv Kumar, Vishal Verma, and Svetlana Lazebnik. Angular quantization-based binary codes for fast similarity search. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  • [14] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE transactions on pattern analysis and machine intelligence, 35(12):2916–2929, 2012.
  • [15] Piotr Indyk and Rajeev Motwani.

    Approximate nearest neighbors: Towards removing the curse of dimensionality.

    In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY, USA, 1998. Association for Computing Machinery.
  • [16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors,

    Proceedings of the 32nd International Conference on Machine Learning

    , volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  • [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [18] Weihao Kong and Wu-jun Li. Isotropic hashing. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  • [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
  • [20] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
  • [21] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In 2009 IEEE 12th international conference on computer vision, pages 2130–2137. IEEE, 2009.
  • [22] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3270–3278, 2015.
  • [23] Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. Deep supervised discrete hashing. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [24] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, page 1711–1717. AAAI Press, 2016.
  • [25] Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80:109–117, 2018.
  • [26] Yunqiang Li and Jan van Gemert. Deep unsupervised image hashing by maximizing bit entropy. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • [27] Guosheng Lin, Chunhua Shen, David Suter, and Anton van den Hengel. A general two-step approach to learning-based hashing. In 2013 IEEE International Conference on Computer Vision, pages 2552–2559, 2013.
  • [28] Mingbao Lin, Rongrong Ji, Hong Liu, and Yongjian Wu. Supervised online hashing via hadamard codebook learning. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 1635–1643, New York, NY, USA, 2018. Association for Computing Machinery.
  • [29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  • [30] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2475–2483, 2015.
  • [31] Bin Liu, Yue Cao, Mingsheng Long, Jianmin Wang, and Jingdong Wang. Deep triplet quantization. In Proceedings of the 26th ACM international conference on Multimedia, pages 755–763, 2018.
  • [32] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2064–2072, 2016.
  • [33] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2074–2081. IEEE, 2012.
  • [34] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6738–6746, 2017.
  • [35] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 353–360, Madison, WI, USA, 2011. Omnipress.
  • [36] Mohammad Norouzi, David J Fleet, and Russ R Salakhutdinov. Hamming distance metric learning. In Advances in neural information processing systems, pages 1061–1069, 2012.
  • [37] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
  • [38] Filip Radenovic, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5706–5715, 2018.
  • [39] Yuming Shen, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, and Ziyi Shen. Embarrassingly simple binary representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
  • [40] Shupeng Su, Chao Zhang, Kai Han, and Yonghong Tian. Greedy hash: Towards fast optimization for accurate hash coding in cnn. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [41] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
  • [42] Jingdong Wang, Ting Zhang, jingkuan song, Nicu Sebe, and Heng Tao Shen. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790, 2018.
  • [43] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data - a survey, 2015.
  • [44] Xiaofang Wang, Yi Shi, and Kris M Kitani. Deep supervised hashing with triplet labels. Asian Conference on Computer Vision, 2016.
  • [45] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
  • [46] Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2575–2584, 2020.
  • [47] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In Proceedings of the AAAI conference on artificial intelligence, volume 28, 2014.
  • [48] Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 40(2):437–451, 2017.
  • [49] Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, and Jiashi Feng. Central similarity quantization for efficient image and video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3083–3092, 2020.
  • [50] Tianyu Zhang, Lei Zhu, Qian Zhao, and Kilho Shin. Neural networks weights quantization: Target none-retraining ternary (tnt), 2019.
  • [51] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1556–1564, 2015.
  • [52] Xiangtao Zheng, Yichao Zhang, and Xiaoqiang Lu. Deep balanced discrete hashing for image retrieval. Neurocomputing, 403:224–236, 2020.
  • [53] Chang Zhou, Lai-Man Po, Wilson Y. F. Yuen, Kwok Wai Cheung, Xuyuan Xu, Kin Wai Lau, Yuzhi Zhao, Mengyang Liu, and Peter H. W. Wong. Angular deep supervised hashing for image retrieval. IEEE Access, 7:127521–127532, 2019.