orthohash
Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"
view repo
A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash
READ FULL TEXT VIEW PDFOfficial implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"
A key building block of a real-world large-scale image retrieval system is hashing. The objective of image hashing is to represent the content of an image using a binary code for efficient storage and accurate retrieval. Recently, deep hashing methods
cnnh2014xia ; simultaneous2015lai have shown great improvements over conventional hashing methods spectral09weiss ; itq2012gong ; lsh1998indyk ; klsh2009kulis ; minlosshash2011mohammad ; hammingmetric2012mohammad ; isotropic2012kong . Furthermore, deep hashing methods can be grouped by how the similarity of the learned hashing codes are measured, namely pointwise ssdh2017yang ; adsh2019zhou ; jmlh2019shen ; dpn2020fan ; csq2020yuan , pairwise dpsh2016li ; simultaneous2015lai ; hashnet2017cao ; dch2018cao , triplet-wise dtsh2016wang ; dtq2018liu , or listwise semantic2015zhao . Among them, pointwise methods have a computational complexity, whilst the complexity of the others are of at least for N data points. This means that for large-scale problems, only the pointwise methods are tractable ssdh2017yang . They are thus the focus of most recent studies.
|
![]() |
![]() |
for each class. It can be seen that the continuous codes exhibit lower intra-class variance and quantization error as compared with the
CE&BN models.A deep hashing neural network naturally has multiple learning objectives. Specifically, given an image input, the network outputs a continuous code (feature vector) which is then converted into a binary hash code using a quantization layer (usually a
signfunction). There are thus two main objectives. First, the final model output, i.e., the binary codes must be discriminative, meaning the intra-class hamming distances are small, while the inter-class ones are big. Second, a quantization error minimization objective is needed to regularize the continuous codes. But the learning is constrained by the vanishing gradient problem caused by the quantization layer. Although the problem can be avoided by deploying some relaxation schemes
hashnet2017cao ; simultaneous2015lai ; dpsh2016li , these schemes often produce sub-optimal hash codes due to the introduction of quantization error (see Figure 1). Hence, most recently deep hashing methods greedyhash2018su ; dpsh2016li ; dch2018cao ; dbdh2020zheng ; csq2020yuan has an explicit quantization error minimization learning objective. Having these two main objectives/losses are still not enough. In particular, to ensure the quality of hash codes, many other losses are employed by existing methods. These include bit balance loss dbdh2020zheng ; ssdh2017yang ; jmlh2019shen , weights constraints to maximize Hamming distance adsh2019zhou , code orthogonality sdh2015liong ; dtq2018liu . Further, losses are designed to address the vanishing gradient problem caused by the sign function used to obtain binary codes from the continuous ones greedyhash2018su ; jmlh2019shen ; bihalf2021li . As a result, the state-of-the-art hashing models typically have a large number (>4) losses. This means difficulties in optimization which in turn hamper their effectiveness. In this work, for the first time, a deep hashing model with a single loss is developed which removes any needs for loss weight tuning and is thus much easier to optimize. As mentioned earlier, a deep hashing model needs to be trained with at least two objectives, namely binary code discriminativenss and quantization error minimization. So how could one use one loss only? The answer lies in the fact that the two objectives are closely related and can be unified into one. More concretely, we show that both objectives can be satisfied by maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal target, which can be formulated as a cross-entropy (CE) loss. Our model, dubbed OrthoHash has one loss only which maximizes the cosine similarity between the L-normalized continuous codes and binary orthogonal target to maximize inter-class Hamming distance and minimize quantization error simultaneously. We show that this single unifying loss has a number of additional benefits. First, we can leverage the benefit of margin cosface2018wang ; arcface2019deng to further improve the intra-class variance. Second, since conventional CE loss only works for single-label classification, we can easily leverage Label Smoothing labelsmooth2017gabriel to modify the CE loss to tackle multi-labels classification. Finally, we show that code balancing can now be enforced by introducing a batch normalization bn2015ioffe(BN) layer rather than requiring a different loss. Extensive experiment results suggest that on conventional category-level retrieval tasks using ImageNet100, NUS-WIDE and MS-COCO, our model is on par with the SOTA. More importantly, on the large-scale instance-level retrieval tasks, our method achieves the new SOTA, beating the best results obtained so far on GLDv2,
Oxf and Paris by 0.6%, 9.1% and 17.1% respectively.Hashing methods. Conventional hashing methods can be categorized into many streams. Data-independent methods such as Locality-sensitive Hashing (LsH) lsh1998indyk ; lsh1999gionis , and its kernelized version (KLsH) klsh2009kulis have contributed many of the fundamental concepts for hashing such as the requirement of code balance, uncorrelated bit, and similarity preserving. In contrast, data-dependent methods spectral09weiss ; bre2009kulis ; itq2012gong ; isotropic2012kong ; minlosshash2011mohammad ; hammingmetric2012mohammad aim to learn hash codes that are more compact yet more dataset-specific return2014chatfield
. Recently, deep learning based hashing methods
dpsh2016li ; cnnh2014xia ; simultaneous2015lai dominated the hashing research due to the superior learning ability of DNN. Various learning objectives are developed to learn hash codes using a training dataset. The objective functions include 1) task learning objective which can be further categorized into pointwise ssdh2017yang ; adsh2019zhou ; jmlh2019shen ; greedyhash2018su ; dpn2020fan ; csq2020yuan , pairwise hashnet2017cao ; dpsh2016li ; simultaneous2015lai , triplet-wise dtsh2016wang ; dtq2018liu , listwise semantic2015zhao and unsupervised bihalf2021li ; angular2012gong ; 2) quantization error minimization such as the loss designed to minimize the -norm (usually ) between continuous codes and hash codes; 3) code balancing bihalf2021li ; jmlh2019shen . We refer readers to learning to hashing surveys lths2015wang ; lths2018wang ; decade2020dubey for more detailed review. Binary optimization. Hashing is a NP-hard binary optimization problem spectral09weiss , and is prone to the vanishing gradient problem due to the discrete and non-differentiable binary hash functions. Early methods solved the problem by discarding the discrete constraints (e.g., designing a penalty loss term to generate feature as binary as possible dpsh2016li ; simultaneous2015lai ; solve with continuous relaxation, i.e., to optimize in a continuous space using sigmoid or tanh for approximation hashnet2017cao ). Some methods also utilized coordinate descent method in the training twostep2013lin ; dsdh2017li . Nevertheless, these methods have increased the complexity of learning due to need for tuning of hyper-parameters balancing different learning objectives. Bypassing vanishing gradient. Greedy Hash greedyhash2018su designed a new coding layer which uses the signfunction in the forward pass to generate binary codes, and gradients are backpropagated using straight-through estimator
ste2013bengio during optimization. bihalf2021li designed a parameter-free coding layer – Bi-half, to maximize the bit capacity by shifting the network output by median (each bit can have a 50% chance of being or ) . These methods typically requires the modification of computational graphs, in the sense that the original graph is no longer end-to-end trained, hence further complicates the original optimization objective. Ours on the other hand incorporates a neat one-loss design that removes all such complications. Learning hash codes with pre-defined target. Deep Polarized Network (DPN) dpn2020fan used a random assignment scheme to generate target vectors with maximal inter-class distance, then optimized with hinge-like polarized loss. Central Similarity Quantization (CSQ) csq2020yuan uses Hadamard matrix as "hash centers", then optimized with binary cross entropy. Both methods have similar overall objective, i.e., the continuous codes are learned to be as similar as the target vectors (or "hash centers"). Our model also employs a hash target, but uniquely it is used in a single cosine similarity based single objective. Cosine similarity. While most works focus on hashing images with various constraints, we reformulate the problem of deep hashing in the lens of cosine similarity. As inspired by tnt2019zhang ; angular2012gong which utilize cosine similarity to find closest approximate binary or ternary representation, we also interpret the quantization error in terms of cosine similarity. Moreover, deep hypersphere embedding learning methods (e.g., SphereFace sphere2017liu , CosFace cosface2018wang and ArcFace arcface2019deng ) imposed discriminative constraints on a hypersphere manifold and proposed to improve decision boundary by cosine or angular margin. Inspired by them, we also leverage the benefit of margin to improve intra-class variance.In Section 3.1, we reformulate the problem of deep hashing in the lens of cosine similarity, i.e., interpreting both Hamming distance retrieval and quantization error in cosine similarity. In Section 3.2, we propose to maximize cosine similarity between the continuous codes and binary orthogonal target under a single classification objective (for both single-label and multi-labels classification). Finally, we describe why adding a batch normalization layer after the continuous codes will achieve code balance in Section 3.2.3. Our method is illustrated in Figure 2. Let us first formally define the deep hashing problem. Let -dimensional data, where is the number of training samples, and as one-hot training labels of classes (for multi-labels, , whose if any -th class are assigned to the -th sample and otherwise). Our objective is to learn a set of -bit binary codes for each training point , which is converted from the continuous codes through a sgn function. can be computed by a latent layer , is a deep neural network (backbone network) to compute -dimensional nonlinear feature representation , is the weights of the latent layer and if -th bit of and otherwise. In our work, binary orthogonal targets , where denotes a column vector belongs to -th class. Ideally, for any two rows, , and are orthogonal to each other. We use or to represent scalar, to represent column vector, and to represent matrix. Both often used as index.
Interpreting Hamming Distance as Cosine Similarity. Typically, Hamming distance can be computed using logical xor operation between binary codes and , followed by popcount. If is represented by , then Hamming distance can also be computed mathematically as:
(1) |
Geometrically, the dot product can be interpreted as:
(2) |
in which is the Euclidean norm and is the angle between and . As both and are constant (i.e., ), equation (1) can then be viewed as:
(3) |
Since is a constant, we can see that the retrieval is now will be only based on the angle between two hash codes i.e., similar hash codes will have a similar direction, yield a lower angle between them, and hence a lower hamming distance. Interpreting Quantization Error as Cosine Similarity. Typically, converting continuous codes to binary codes will lead to information loss, which is also known as quantization error. Therefore, most of the existing hashing methods have included quantization error minimization in their learning objective such as L-norm, L-norm and p-norm (e.g., in Greedy Hash greedyhash2018su ), usually in the form of:
(4) |
where
is the supervised learning objective such as Cross Entropy and
is the quantization error between and . However, it is difficult to control the scale , i.e. a low might not be effective, while a high might lead to underfitting. As a result of this, careful tuning is needed and yet the tuned may varies in different tasks. To overcome this cumbersome practise, let us first interpret quantization error geometrically:(5) |
in which is in continuous space, is in binary space. We expand equation (5) to get:
(6) |
According to equation (3), retrieval is only based on the similarity in the direction of two hash codes. Hence, we can ignore the magnitude of by normalizing it to have the same norm with , i.e., and interpret the quantization error as to only the angle between and 111See supplementary material for proof.:
(7) |
Since is a constant, we can then conclude that maximize the cosine similarity between and will lead to a low quantization error, leading to a better approximation in the hash codes.
According to similarity2002charikar
, the probability of two samples
and to have the same hash code under a familyof hash functions using random hyperplane technique can be described as
, where is a hash function and is the angle between and . Therefore, based on the same principle, it can be derived that if the two continuous codes and from latent layer have high cosine similarity, then the hash codes and should also have high chance of obtaining the same hash codes. Beside that, as described in Section 3.1, cosine similarity can also be used to justify the retrieval performance using both the hash codes and quantization error between the continuous codes and hash codes. Given these two circumstances, we therefore propose to maximize the cosine similarity of the continuous codes and its corresponding binary orthogonal target,, where this can be achieved by maximizing the posterior probability of the ground-truth class using softmax (cross-entropy) loss:
(8) |
where denotes the deep continuous codes of the -th samples from DNN and both , denote the ground-truth class and the -th class of the binary orthogonal targets. For simplicity, we omit the bias term from equation (8). It follows that under the framework of deep hypersphere embedding sphere2017liu ; cosface2018wang ; arcface2019deng
, we can transform the logit
where is the angle between the continuous codes and the binary orthogonal target . Next, we perform L normalization on so that , andsince it is in binary form. Now our loss function can be rewritten as:
(9) |
As such, instead of introducing the quantization error minimization in the learning objective (equation (4)), our proposed method unifies both the learning objective and quantization error minimization together under a single classification objective as shown in the loss function (equation (9)). Furthermore, since the binary orthogonal targets attain maximal inter-class Hamming distance and that our loss function also aims to minimize the intra-class variance, we can leverage on cosine or angular margin222Cosine margin will transform to and angular margin will transform the same to . that have been proven to be beneficial in CosFace cosface2018wang and ArcFace arcface2019deng , to further improve the minimization of intra-class variance (we set in all of our experiments unless mentioned explicitly). With this, our method is able to perform end-to-end training to learn highly discriminative hash codes without both the sophisticated training objectives and computational graph modifications.
The maximization of the expectation of inter-class Hamming distance will help to increase the recall rate during retrieval as there will be lesser chance to retrieve incorrect items, because the aim is to retrieve more similar items (intra-class), and avoid to retrieve incorrect items (inter-class). That is, given a K-bit Hamming space , for any two binary vectors sampled with probability for on each bit, the expectation of Hamming distance is and it achieves the upper bound of with dpn2020fan ; csq2020yuan (See supplementary material for details.). Hence, hash codes and must be orthogonal so that we can get in equation (3). Orthogonal Targets Generation. Hadamard matrix naturally contains orthogonal rows and columns, which guarantees the maximum Hamming distance of between any two rows csq2020yuan ; hcoh2018lin . However, it is restricted when is not 1, 2, or a multiple of 4. Hence, a simple solution is to sample the targets from which every sampled bit has the probability to be . The result is the expectation of Hamming distance between any two rows equals to which indicates orthogonality. One limitation is that if , the nearest rows in the sampled targets will be identical, which causes performance degrade. Hence, the solution is to increase . In supplementary material, we show that the two nearest rows has Hamming distance closed to as
is higher. We also generate the targets with the objective of maximum inter-class Hamming distance heuristically, it indeed improved the performance at lower
, but the improvement in higher are negligible.As conventional cross-entropy loss only works for single-label classification, we leverage the concept of Label Smoothing labelsmooth2017gabriel to generate labels for multi-labels classification. A standard cross entropy (CE) loss is mathematically formulated as:
(10) |
in which if -th class is assigned to the -th sample in a single label multiclass classification task. In labelsmooth2017gabriel , the target label becomes soft-target such that non-target class has a small "smoothing" value to regularize overconfident samples and we leverage this concept for multi-labels. To adopt CE for multi labels classification, we set if any -th class are assigned to -th sample. The constant is determined such that , e.g., and when the 2 and the 4 classes are the assigned classes. Our motivation is that the model should maximize the probabilities of the target classes, which can optimize the hash codes to be as similar as the binary targets from assigned classes333Note, we cannot guarantee that the final hash codes are the center of hash codes of the target classes. Instead we let the optimization algorithm to find the best hash codes.. In our experiments, we found out empirically that replacing softmax with sigmoid for multi-labels are not effective444See supplementary material for details.. A likely explanation is that softmax will intrinsically suppress the lower activated class unit (i.e., scaled cosine similarity) with lower probability and increase the highly activated class unit with higher probability, while sigmoid will treat each class unit as an individual unit. As a result, maximizing probability of a class might not lead to minimizing the probability of other classes. Therefore, we propose to leverage the concept of Label Smoothing to generate labels so that we can use cross entropy loss for learning.
Although binary orthogonal target helps in code balancing, since every bit has 50% of chance being or , there is no guarantee that the model will learn to output a balanced code. Therefore, we propose to add a batch normalization (BN) layer after the continuous codes to ensure the code balance. If , then we can see that for the -th bit. Because the distribution of has been normalized to have zero-mean and variance of 1, with , the hash codes will follow a uniform binary distribution with 50% chances on both and . Empirically, we found that it improves the retrieval performance on ImageNet100 by about 17-20% as compared with a model with normal cross entropy loss (see Table 1). Note that the Bi-half method bihalf2021li shifts the continuous codes by their median, followed by converting the continuous codes to binary codes for optimization. However, it will have to modify the computational graph in order to have a proxy derivative to the solve vanishing gradient problem. In contrast, appending BN layer will not modify the computational graph, therefore enabling straightforward end-to-end training.
Training Setup. We select 7 different deep hashing methods for comparison (5 point-wise, 1 pair-wise and 1 triplet-wise). For a fair comparison, we use the same learning rate of , Adam optimizer adam2014kingma and epochs for all methods. For SDH-C sdh2015liong , we have modified it from pair-wise objective to point-wise objective, while all penalty terms are kept (i.e., quantization loss, bit variance loss and orthogonality on projection weights). Datasets. We follow prior works hashnet2017cao ; dpn2020fan ; greedyhash2018su ; dsh2016liu ; jmlh2019shen ; cnnh2014xia ; simultaneous2015lai ; ksh2012liu and choose ImageNet100 imagenet2009deng , NUS-WIDE nuswide2009chua and MS-COCO coco2014lin for category-level retrieval experiments. For a more practical yet challenging large-scale instance-level retrieval task (i.e., tremendous number of classes), we evaluate on the popular GLDv2 gldv22020weyand , Oxf and Par roxfparis2018filip . Architecture. For category-level retrieval, following the settings in hashnet2017cao ; greedyhash2018su ; jmlh2019shen ; dpn2020fan , we use pre-trained AlexNet alexnet2012krizhevsky
as the network backbone initialization. The output from last fully-connected with ReLU (4096-dimension vector) acts as input to the latent layer; various supervised deep hashing methods are then applied to generate binary codes. The image size is 224
224. For instance-level retrieval, due to the expensive cost of training from scratch, we use pre-trained model555https://github.com/tensorflow/models/tree/master/research/delf (R50-DELG-GLDv2-clean) from DELG delg2020cao to compute the 2048-dimension global descriptors. We then train a latent layer to compute hash codes where inputs are the global descriptors. For GLDv2, the images input are 512 512. For Oxf and Par, we use 3 scales to produce multi-scale representations. These are subject to L normalization, and then average-pooled to obtain a single descriptor as done by delg2020cao . A GLDv2-trained latent layer is used to compute hash codes for the evaluations. Details of training setups, datasets and architecture can be found in the supplementary material.max width= Methods ImageNet100 (mAP@1K) NUS-WIDE (mAP@5K) MS COCO (mAP@5K) 16 32 64 128 16 32 64 128 16 32 64 128 HashNet hashnet2017cao 0.343 0.480 0.573 0.612 0.814 0.831 0.842 0.847 0.663 0.693 0.713 0.727 DTSH dtsh2016wang 0.442 0.528 0.581 0.612 0.816 0.836 0.851 0.862 0.699 0.732 0.753 0.770 SDH-C sdh2015liong 0.584 0.649 0.664 0.662 0.763 0.792 0.816 0.832 0.671 0.710 0.733 0.742 GreedyHash greedyhash2018su 0.570 0.639 0.659 0.659 0.771 0.797 0.815 0.832 0.677 0.722 0.740 0.746 JMLH jmlh2019shen 0.517 0.621 0.662 0.678 0.791 0.825 0.836 0.843 0.689 0.733 0.758 0.768 DPN dpn2020fan 0.592 0.670 0.703 0.714 0.783 0.818 0.838 0.842 0.668 0.721 0.752 0.773 CSQ csq2020yuan 0.586 0.666 0.693 0.700 0.797 0.824 0.835 0.839 0.693 0.762 0.781 0.789 CE 0.350 0.379 0.406 0.445 0.744 0.770 0.796 0.813 0.602 0.639 0.658 0.676 CE+BN 0.533 0.586 0.612 0.617 0.801 0.814 0.823 0.825 0.697 0.721 0.729 0.726 CE+Bihalf bihalf2021li 0.541 0.630 0.661 0.662 0.802 0.825 0.836 0.839 0.674 0.728 0.755 0.757 OrthoCos 0.583 0.660 0.702 0.714 0.795 0.826 0.842 0.851 0.690 0.745 0.772 0.784 OrthoCos+Bihalf 0.562 0.656 0.698 0.711 0.804 0.834 0.846 0.852 0.690 0.746 0.775 0.782 OrthoCos+BN 0.606 0.679 0.711 0.717 0.804 0.836 0.850 0.856 0.709 0.762 0.787 0.797 OrthoArc+BN 0.614 0.681 0.709 0.714 0.806 0.833 0.850 0.856 0.708 0.762 0.785 0.794
For performance evaluation, we use mean average precision (mAP@R) which is the mean of average precision scores of the top R retrieved items. Table 1 offers performance comparison amongst all selected hashing methods and our methods (+variants). CE denotes model trained with cross entropy only, the hash codes are computed from sign of continuous codes. CE+BN denotes CE model with BN layer bn2015ioffe appended after the latent layer. CE+Bihalf denotes CE model with Bihalf bihalf2021li layer appended after the latent layer. OrthoCos denotes model trained with cosine margin and binary orthogonal target. OrthoCos+Bihalf denotes a variant of OrthoCos, and with Bihalf layer appended. OrthoCos+BN denotes a variant of OrthoCos, and with BN layer appended. OrthoArc+BN denotes a variant of OrthoCos+BN, trained with angular margin. Overall. It can be observed that both our OrthoCos+BN and OrthoArc+BN perform better than recent state-of-the-art, DPN dpn2020fan and CSQ csq2020yuan . On multi-labeled datasets (i.e., NUS-WIDE and MS COCO), DTSH dtsh2016wang (triplet based method) performed the best with 0.851 and 0.862 with 64 and 128-bits hash codes in NUS-WIDE followed by our method (e.g., OrthoCos+BN achieves 0.850 and 0.856 in the same settings), while OrthoCos+BN and OrthoArc+BN performed the best on MS-COCO with at most 1% improvement over previous deep hashing methods. Code Balance. Although retrieval performance of CE models performed the worst, but by appending BN layer after the latent layer (CE+BN), we were able to observe 5-20% improvement over all settings (dataset and number of bits). Bihalf bihalf2021li layer (zero-median features) has a proxy derivative to learn hash features, hence getting 0.1-4.9% improvement than CE+BN. This indicates that without sophisticated training objectives, code balance itself is a very important factor in improving Hamming distance based retrieval. However, OrthoCos+Bihalf does not show significant improvement over OrthoCos+BN, but is comparable with OrthoCos. We thus conclude that our method can achieve code balance without explicitly engineering the computational graph. Cosine and Angular Margin. In our experiments, we observed that cosine margin (OrthoCos+BN) slightly outperform angular margin (OrthoArc+BN) by about 0.2% on average.
max width= Methods GLDv2 (mAP@100) Oxf-Hard (mAP@all) Paris-Hard (mAP@all) 128 512 2048 128 512 2048 128 512 2048 HashNet hashnet2017cao 0.018 0.069 0.111 0.034 0.058 0.307 0.133 0.190 0.490 DPN dpn2020fan 0.021 0.089 0.133 0.053 0.184 0.303 0.224 0.399 0.562 GreedyHash greedyhash2018su 0.029 0.108 0.144 0.032 0.251 0.373 0.128 0.531 0.652 CSQ csq2020yuan 0.023 0.086 0.114 0.093 0.284 0.398 0.245 0.541 0.649 OrthoCos+BN 0.035 0.111 0.147 0.184 0.359 0.447 0.416 0.608 0.669 R50-DELG-H - - 0.125* - - 0.471 - - 0.682 R50-DELG-C - - 0.138* - - 0.510 - - 0.715
For evaluation metrics, we adapt the evaluation protocol of
delg2020cao ; roxfparis2018filip . The baseline performance of GLDv2, Oxf-Hard and Par-Hard from the pre-trained R50-DELG-GLDv2-clean are 0.138, 0.510 and 0.715 respectively. Table 2 summarizes the performance of different deep hashing methods and our method. For all the 3 datasets, our method outperforms all previous deep hashing methods on all bits. This suggests that our method has a better generalization ability on unseen instances than previous deep hashing methods. In particular, our model significantly outperforms previous deep hashing models by 0.6%, 9.1% and 17.1% respectively on the 3 datasets with 128-bits hash codes. Orthogonal Transformation. For GLDv2 2048-bits hash codes, surprisingly it can achieve a much better performance than the pre-trained 2048-dimensions descriptors (by 1.1% improvement over R50-DELG-C). We then analyze the separability in cosine distances, i.e., the difference in the mean of intra-class cosine distance and the mean of inter-class cosine distance before and after the transformation (similar to Figure 3). We observe that the separability in cosine distances increases after the orthogonal transformation, i.e., before it is 0.142 and after it increases to 0.167. The results thus show that learning orthogonal hash codes can transform the inputs to be more discriminative. Domain shifting with BN. As the model is trained with GLDv2, the running mean and variance in the BN layer might experience domain shifting problem abn2018li when testing directly on different datasets (e.g., Oxf and Par). We empirically found that using running mean and variance from GLDv2 will lead to a large performance drop in Hamming distance retrieval666See supplementary material for details.. One simple solution is to recompute the mean and variance from all continuous codes in the database, then update the running mean and variance with the computed mean and variance. The performances of Oxf and Par in Table 2 are obtained with running mean and variance of the respective database.Histogram of Hamming Distances. Figure 3 summarizes the histogram of intra-class and inter-class distances. We compare our method OrthoCos+BN with pair-wise method HashNet hashnet2017cao and point-wise classification based GreedyHash greedyhash2018su . Although the distribution of inter-class distances are about the same for all the 3 methods (close to Hamming distance of ), we can see that the larger the separability i.e., the difference in the mean of intra-class distance (the blue dotted line) with the mean of inter-class distance (the orange dotted line), the better the performance.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Performance Improvement Analysis. We further analyze the reasons behind performance improvements of different deep hashing methods, and summarizes the results in Figure 4. We conclude 3 main reasons that contribute to the improvement in deep hashing methods: i) quantization error; ii) the separability in Hamming distances; and iii) orthogonality in hash centers. For quantization error, we measure the angle between the continuous codes and the hash codes , i.e., . For separability, we measure the difference in the mean of inter-class distances and the mean of intra-class distances, i.e., . For orthogonality, we first compute the hash centers for every class (by taking the sign of average hash codes in every class), then we measure the orthogonality with (lower is better). When the quantization error reduces, the separability increases and the hash centers has better orthogonality, resulting in better performance.
We propose to unify training objectives of deep hashing under a single classification objective. We show this can be achieved by maximizing the cosine similarity between the continuous codes and binary orthogonal target under a cross entropy loss. For that, we first reformulated the problem of deep hashing in the lens of cosine similarity. We then demonstrated that if we perform L
-normalization on the continuous codes, then end-to-end training of deep hashing is possible without any extra sophisticated constraints. Moreover, we leverage the concept of Label Smoothing to train multi-labels classification with cross-entropy loss and batch normalization for code balancing. Extensive experiments validated the efficiency of our method in both category-level and instance-level retrieval benchmarks. As part of the future work, we are exploring how to learn better feature representations to improve the retrieval performance by using hash codes through unsupervised learning.
Hashing remains a key bottleneck in practical deployments of large-scale retrieval systems. Recent deep hashing frameworks have shown great promise in learning code that are both compact and discriminative. Yet state-of-the-art frameworks are known to be difficult to train and to reproduce – largely owing to their complex loss designs that dictates hyperparameter tuning and multi-stage training. In this work, we set out to change that – we attempt to unify deep hashing under
a single objective, therefore simplifying training and help reproducibility. Our key intuition lies with reformulating hashing in the lens of cosine similarity. We report competitive hashing performance on all common datasets, and significant improvements over state-of-the-arts on the more challenging task of instance-level retrieval. This research is partly supported by the Fundamental Research Grant Scheme (FRGS) MoHE Grant FP021-2018A, from the Ministry of Education Malaysia. We also thank Kilho Shin for helpful discussions and recommendations.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 1229–1237, 2018.Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
, pages 380–388, 2002.Arcface: Additive angular margin loss for deep face recognition.
In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20
, pages 825–831. International Joint Conferences on Artificial Intelligence Organization, 7 2020. Main track.Approximate nearest neighbors: Towards removing the curse of dimensionality.
In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY, USA, 1998. Association for Computing Machinery.Proceedings of the 32nd International Conference on Machine Learning
, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.