Collaborative Learning for Extremely Low Bit Asymmetric Hashing

09/25/2018 ∙ by Yadan Luo, et al. ∙ The University of Queensland 0

Extremely low bit (e.g., 4-bit) hashing is in high demand for retrieval and network compression, yet it could hardly guarantee a manageable convergence or performance due to its severe information loss and shrink of discrete solution space. In this paper, we propose a novel Collaborative Learning strategy for high-quality low-bit deep hashing. The core idea is to distill bit-specific representations for low-bit codes with a group of hashing learners, where hash codes of various length actively interact by sharing and accumulating knowledge. To achieve that, an asymmetric hashing framework with two variants of multi-head embedding structures is derived, termed as Multi-head Asymmetric Hashing (MAH), leading to great efficiency of training and querying. Multiple views from different embedding heads provide supplementary guidance as well as regularization for extremely low bit hashing, hence making convergence faster and more stable. Extensive experiments on three benchmark datasets have been conducted to verify the superiority of the proposed MAH, and show that 8-bit hash codes generated by MAH achieve 94.4 surpasses the performance of 48-bit codes by the state-of-the-arts for image retrieval.



There are no comments yet.


page 6

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To diminish the computational and storage cost and make optimal use of rapidly emerging multimedia data, hashing [1, 2]

has attracted much attention from machine learning community, with wide applications on information retrieval

[3, zeroshot], person re-identification [4, 5] and network compression [6, 7, 8] etc. The goal of hashing is to transform original data structures and semantic affinity into compact binary codes, thereby substantially accelerating the computation with efficient xor operations and saving the storage.

There are mainly two branches of hashing, i.e., data-independent hashing and data-dependent hashing. For data-independent hashing, such as Locality Sensitive Hashing [9], no prior knowledge (e.g., supervised information) about data is available, and hash functions are randomly generated. Nonetheless, huge storage and computational overhead might be cost since more than 1, 000 bits are usually required to achieve acceptable performance. To address this problem, research directions turn to data-dependent hashing, which leverages information inside data itself. Roughly, data-dependent hashing can be divided into two categories: unsupervised hashing (e.g., Iterative Quantization (ITQ) [10]), and (semi-)supervised hashing (e.g., Supervised Hashing with Kernels (KSH), Supervised Discrete Hashing (SDH) [11], Supervised Hashing with Latent Factor Models (LFH) [12], Column Sampling based Discrete Supervised Hashing (COSDISH) [13] and Semi-Supervised Hashing (SSH)  [14]). In general, supervised hashing usually achieves better performance than unsupervised ones because supervised information (e.g., semantic labels and/or pair-wise data relationship) can help to better explore intrinsic data property, thereby generating superior hash codes and hash functions.

With the rapid development of deep learning techniques, deep hashing 

[15, 16, 17, 18, 17, 19] trained with an end-to-end scheme has been proposed. From the perspective of training strategy, deep hashing could be roughly grouped into two categories: symmetric and asymmetric deep hashing. By assuming both query and database samples share the same distribution, symmetric deep hashing [16, 20, 21] leverages a single network to preserve pair-wise or triplet-wise neighbor affinity, which inevitably results in high complexity, i.e., or even , where denotes the number of database points. Asymmetric deep hashing treats query samples and database samples separately based on the asymmetric theory [22]. Deep asymmetric pair-wise hashing (DAPH) [23]

utilizes two distinct mappings to capture variances and discrepancies between query and database sets, while asymmetric deep supervised hashing (ADSH)

[17] learns a hash function only for query points, thus reducing the training time complexity.

However, most of existing deep hashing models could hardly guarantee a manageable convergence or performance for extremely low bit hash codes (e.g., 4-bit) even of capacity to convey sufficient semantics, mainly due to its severe information loss and shrink of discrete solution space. For example, considering the CIFAR-10 database including images of ten classes, theoretically the minimum number of bits required to represent full semantics is (), while the 4-bit hash codes generated by ADSH could only achieve of MAP score, which is far from satisfactory. Besides, to reach the performance and storage requirements, it is inevitable and tedious to adjust the default code length and re-tune network hyper-parameters multiple times in practice. The bit-scalable deep hashing (DRSCH) [18] is proposed to learn the hash codes of variable length by unequally weighting each bit and then truncating insignificant bits. Nevertheless, DRSCH is still suboptimal as it ignores the distribution adaptation to fit vertices of the Hamming hypercube and fails to suppress information loss and quantization error.

Motivated by the aforementioned observations and analyses, in this paper, we propose a novel collaborative learning strategy for extremely low bit (e.g., 4-bit) hashing by simultaneously learning a group of hash codes with various lengths (e.g., -bit). Different from conventional knowledge distillation that adopts student-teacher training strategy [24], collaborative learning distills the bit-specific knowledge in a unified framework and squeezes redundant bits out. To achieve this goal, the Multi-head Asymmetric Hashing (MAH) is derived, which is a deep asymmetry hashing framework equipped with the multi-head embedding structure. As shown in Figures 1, two variants (i.e., flat and cascaded) of multi-head structures are explored. The flat one explicitly guides the low-bit embedding branch with multiple supplementary views, while the cascaded one adapts the distribution of low-bit hashing based on the consensus of other hashing learners. The multi-head structure benefits extremely low bit hashing in two perspectives, i.e., 1) it enables the shared intermediate layers to aggregate the gradient flows from all heads and the penultimate layer to select bit-specific representations, adapting the feature distribution to compensate for information loss; 2) multiple views from different embedding heads on the same training sample provide regularization to extremely low bit hashing, thereby making its convergence faster and more stable. Our main contributions are summarized as follows:

  • We propose a novel collaborative learning strategy for deep hashing, aiming to distill knowledge for low-bit hash codes from a group of hashing learners and gain a performance boost. To the best of our knowledge, it is the very first work of introducing model distillation to address the code compression of deep hashing.

  • Two variants of multi-head structures are derived to efficiently enhance the power of supervision on shared intermediate layer and benefit bit-specific representation learning. Besides, a group of hash codes with various lengths are jointly learned, which might suit for different platforms without extra inference cost or network re-tuning.

  • Experiments on three benchmark datasets demonstrate that the proposed MAH significantly outperforms existing deep hashing methods especially for low-bit retrieval task and saves up to training time and the storage by to .

The rest of paper is organized as follows. Section 2 presents a brief review of related work, and Section 3 introduces the details of our hashing learning framework. The experimental results, comparison and component analysis are presented in Section 4, followed by the conclusion and future work plan in Section 5.

2 Related Work

In this section, we mainly introduce three aspects relevant to our work: scalable hashing, knowledge distillation and asymmetric hashing. We also explain the differences between our work and existing works.

2.1 Scalable Hashing

Traditional hashing learns hash codes with a default code length (e.g., 64-bit), which highly restricts the practical flexibility and scalability. For example, low-bit hash codes suit devices with limited computational resources well, while high-bit hash codes are usually applied in high-performance servers for higher accuracy. Therefore, it is inevitable and tedious for engineers to adjust the default code length and re-tune hyper-parameters of networks to meet the performance and storage requirements. To address this issue, asymmetric cyclical hashing [25] is exploited to measure hash codes with different lengths for query and database images with the weighted Hamming distance. With the development of deep learning techniques, an end-to-end bit-scalable deep hashing framework [18] is proposed, which learns the hash codes of variable length by unequally weighting each bit and then truncating insignificant bits. Nevertheless, DRSCH is still suboptimal as it ignores the distribution adaptation to fit vertices of the Hamming hypercube and fails to suppress information loss and quantization error.

Fig. 1: The proposed framework of MAH. Based on different assumptions, two variants of multi-head structures are exploited for collaborative learning, i.e., flat multi-head and cascaded multi-head. Notably, c-, c and c+ indicate different code lengths in ascending order. Our objective is to learn the shortest -bit hash codes to achieve optimal performance with auxiliary embeddings of and .

2.2 Knowledge Distillation

For saving computational cost for inferences under various settings, several knowledge distillation strategies have been explored on classification. While general knowledge distillation [24] requires two stages, that means, pre-training a large highly regularized model first and then teaching the smaller model, two-way distillation [26]

leverages an ensemble of students to learn mutually and teach each other throughout the training process. Nevertheless, using the Kullback Leibler (KL) Divergence to constrain the consensus of prediction and weights of different networks, two-way distillation limits its deployment on the collaborative hashing problem as it ignores the selection of bit-specific features and probably leads to severe information loss during quantization.

2.3 Asymmetric Hashing

Asymmetric hashing could be grouped into two types: dual projection based [23, 27] and sampling based [17, asy_dis_graph]. Dual projection methods aim to capture the distribution differences between database points and query points by learning two distinctive hash functions so that original data relationships could be well preserved. Rather than learning full pairwise relationships () or triplet relationships () among points dataset, sampling based methods select anchors (

) to approximate query datasets and construct an asymmetric affinity to supervise learning, which significantly reduces the training time complexity to


3 Methodology

3.1 Problem Definition

Without loss of generality, we focus on extremely low bit hashing for image retrieval task with pair-wise supervision. We assume that there are query data points denoted as and database points denoted as . Furthermore, pairwise supervised information between and are provided as , where if and are similar, otherwise . The goal of conventional deep hashing is to learn a nonlinear hash function and generate hash codes for query points , and for database points with minimum information loss, where is the hash code length. Different from existing deep hashing methods, which learns fixed-length binary codes, we explore how to jointly optimize hash codes with various lengths, i.e., , where , , denote the extremely low-, low- and anchor-length of binary codes respectively. Our objective is to learn the shortest (-bit) hash codes to achieve the best retrieval performance among the learner group.

3.2 Model Formulation

The general framework of the proposed MAH is shown in Figure 1.

3.2.1 Discrete Feature Learning

The whole end-to-end architecture for feature learning is mainly based on the deep residual neural network 

[28], which accepts pair-wise inputs, i.e.,

database and query image. For transforming latent Euclidean space to the Hamming space, we replace the top layer of the softmax classifier from vanilla ResNet with the multi-head embedding structure that will be elaborated later. Please note that the feature learning is only performed for all query points but not for database points.

3.2.2 Multi-head Embedding

In this work, we construct two variants of multi-head structures, i.e., flat multi-head and cascaded multi-head to implement collaborative learning based on different assumptions. By acquiring the advantages from two-way distillation, the flat multi-head increases the posterior entropy of low-bit branch, which helps it converge to a more robust and flatter minima with complementary views. Differently, leveraging the consensus from long bit learners which convey more original data structure and semantics, the cascaded multi-head adjusts the manifold layer by layer to approximate vertices of the targeting Hamming hypercube, consequently offsetting information loss with distribution adaptation. For simplicity, we only construct three heads to validate multi-head embedding, which could be further generalized to the n-head structure.

3.2.3 Objective Formulation for MAH

To learn the hash codes that can maximumly preserve the semantics and affinity, objectives are usually formulated to minimize the loss between the pair-wise supervision and the inner product of query-database binary code pairs and :


where denotes the signum function and is the output of the penultimate layer, and is the hyper-parameter of the neural networks to be learned. However, there exists ill-posed gradient problem in Equation (1) caused by non-smooth function, the gradient of which is zero for all nonzero inputs, making standard back-propagation infeasible. Therefore we adopt function to approximate it and apply a further optimization strategy, thus Equation (1) could be rewritten as:


In practice, we might be only given a set of database points without query points. In this case, we can randomly sample data points from database to construct the query set following [17]. More specifically, we set , where . The objective function can be rewritten as follows,


where is a constant parameter. In real applications, if we are given both and , we use the problem defined in Equation (2) for training MAH, otherwise Equation (3) is used as the objectives.

In this case, MAH treats database and query data in an asymmetric way based on the assumption that the distribution of both sets are not exactly the same. The sampling strategy not only avoids overfitting on the training dataset and but also reduces training complexity, thus significantly improving its robustness and practicality.

Regarding to the proposed multi-head embedding structure, we modify the core loss function to collaboratively learn binary codes

, , with , , code lengths respectively. Notably, we aim to learn the lowest -bit hash codes to achieve the optimal retrieval performance.


where and denote the hyper-parameters of networks including the -bit and -bit embedding heads respectively. Moreover, and are the coefficients for balancing multi-head loss.

3.3 Optimization

To achieve the above objectives, we need to learn parameters of neural networks and infer database hash codes , yet it is NP hard to solve hash codes directly due to its discrete constraints. Inspired by the discrete cyclic coordinate descent (DCC),the binary codes could be learned bit by bit iteratively.

3.3.1 Learn with Fixed

In order to update the neural network parameter , given are fixed, we use back-propagation (BP) for gradient calculation. Specifically, we sample a mini-batch of the query points, then update the parameter based on the sampled data. For clarity and simplicity, we use and respectively. Therefore, we can calculate the gradient as follows,


We use the chain rule to compute

based on , and the BP algorithm is used to update . Then and are updated asynchronously in a similar way to Equation (5).

3.3.2 Learn with Fixed

Firstly we target at optimizing B and fix all other variables, and rewrite the Equation (3) as follows,


where , denotes the binary codes for the database points indexed by . We define , where is defined as follows,


where .

As is discrete and non-convex, we choose to learn the binary codes by the discrete cyclic coordinate descent (DCC) method. In other words, we learn bit by bit. Let denote the -th column of , and denote the matrix of excluding the -th column. Let denote the -th column of , and denote the matrix of excluding the -th column. Let denote the -th column of , and denote the matrix of excluding the -th column. To optimize , we can calculate the objective function,


Consequently, the optimal solution could be achieved,


With the similar calculation, the solution of and can be formed as,


where and . The complete training procedure for the proposed MAH is described by Algorithm 1.

0:  : data points; : supervised similarity matrix; : binary code lengths.
0:  : Binary code for database; : neural network parameter;
1:  Initialize , , batch size

, number of epochs

; number of sampling query sets .
2:  for  to  do
3:     Randomly sample the index and set , ;
4:     for  to  do
5:        for  to  do
6:           Construct a mini-batch and calculate and for each data point in the mini-batch by forward propagation.
7:           Calculate the gradient of according to Eq.(5) and update them by using back propagation.
8:        end for
9:        for  to  do
10:           Update the according to Eq.(10)
11:        end for
12:        for  to  do
13:           Update the according to Eq.(9)
14:        end for
15:        for  to  do
16:           Update the according to Eq.(10)
17:        end for
18:     end for
19:  end for
20:  return   and ;
Algorithm 1 Pseudocode of optimizing our MAH

3.4 Out-of-Sample Extension

After training the MAH, the learned deep neural networks can be applied for generating compact binary codes for query points including unseen query points (e.g., ) during training. Specifically, we can use the following equation for ,


3.5 Complexity Analysis

For each epoch, the time cost is analyzed as follows. The computation of gradient of in Equation (5) is . To apply DCC algorithm for updating , in Equation (7) to be calculated with the cost of . As to the optimization of the sub-problem in Equation (9) and (10), the time cost is . In practice, are much smaller than the database size . Hence the overall computational complexity of algorithm is .

4 Experiments

4 bits 6 bits 8 bits 10 bits 12 bits 4 bits 6 bits 8 bits 10 bits 12 bits 4 bits 6 bits 8 bits 10 bits 12 bits
LSH 11.28 11.71 12.64 12.13 12.09 45.65 45.89 43.76 44.38 44.11 54.26 56.73 55.16 57.16 56.49
ITQ 18.22 18.00 19.39 19.76 20.16 62.56 67.07 69.06 70.39 70.82 71.56 71.69 71.46 72.54 73.13
SDH 34.79 43.23 44.83 52.49 56.01 62.92 65.98 68.53 73.33 71.73 76.09 78.14 80.15 80.38 81.58
KSH 34.81 40.98 44.96 46.23 47.22 59.19 59.76 59.85 59.40 59.66 66.29 66.06 66.36 68.99 68.33
LFH 30.06 15.34 36.81 37.93 45.14 58.48 68.90 68.08 72.81 72.25 68.71 74.12 81.34 81.55 84.07
COSDISH 59.24 65.55 70.31 73.99 74.81 58.49 66.16 70.62 73.57 73.62 68.73 80.31 79.28 77.83 82.06
DPSH 38.59 50.09 54.12 61.23 65.34 52.63 57.68 64.43 67.14 70.15 39.96 52.19 55.73 59.83 65.86
DRSCH 40.23 49.87 56.92 60.15 67.54 48.07 49.53 52.42 53.60 55.35 42.35 51.92 58.76 63.51 67.21
ADSH 42.69 50.06 87.30 92.48 92.84 73.31 74.63 76.85 78.68 79.13 72.68 82.13 83.61 85.58 86.54
DAPH 53.48 61.88 66.48 73.28 75.69 53.26 56.70 64.76 68.30 71.67 70.25 78.30 80.12 85.43 87.12
MAH-1 47.59 81.97 93.39 93.35 95.03 70.95 75.37 76.79 78.87 79.52 82.64 85.39 86.90 87.82 89.20
MAH-2 74.60 89.50 94.29 94.89 95.37 76.47 76.85 79.47 82.15 84.65 82.75 86.09 86.76 86.81 89.39
TABLE I: The MAP of the proposed MAH and baselines on three large-scale dataset. MAH-1 and MAH-2 denote the proposed method equipped with flat multi-head and cascaded multi-head structure, respectively.

4.1 Settings

4.1.1 Datasets

We have conducted extensive image retrieval experiments on three public benchmark datasets, i.e., CIFAR-10, NUS-WIDE, and MIRFlickr.
CIFAR-10 [single-label] [29] labeled subsets of 80 million tiny images dataset, which consists of 60,000 3232 color images in 10 classes, with 6,000 images per class.
NUS-WIDE [multi-label] [30] is a web image dataset containing 269,648 images from Flickr, where 81 semantic concepts is provided for evaluation. We eliminate all empty images and use the rest 195,834 images from the 21 most frequent concepts, where each concept consists of at least 5,000 images.
MIRFlickr [multi-label] [31] is a collection of 25,000 images from Flickr, where each instance is manually annotated with at least one of 38 labels.

Each dataset is randomly split into a query set with 1,000 samples and a database set with the remaining samples for evaluation. For single-label datasets, if two samples have the same class label, they are considered to be semantically similar, and dissimilar otherwise. For multi-label datasets, if two samples share at least one semantic label, they are considered to be semantically similar.

4.1.2 Evaluation Metric

The Hamming ranking is used as the search protocol to evaluate our proposed approach, and two indicators are reported.

1) Mean Average Precision (MAP): The average precision (AP) is defined as,


where is the number of ground-truth neighbors of the query in a database, is the number of samples in the database. Precision(r) denotes the precision of the top retrieved entities, if the -th retrieved entity is a ground-truth neighbor and otherwise. For a query set whose size is , the is defined as the mean of the average precision scores for all the queries in the query set,


2) Top5000-precision (Precision@5000): the Top5000-precision curve reflects the change of precision with respect to the number of top-ranked instances returned to the users, which is expressive for retrieval.

4.1.3 Baselines

To evaluate MAH and baselines, we select several related hashing methods as baselines for comparison, including data-independent hashing method LSH [9], unsupervised hashing method ITQ [10], four supervised but non-deep supervised hashing methods [32, 11, 12, 13], four deep supervised hashing methods including DPSH [16], DRSCH [18], ADSH [17] and DAPH [23]

. For non-deep hashing methods, we utilize 4,096-dim deep features which are extracted from the pre-trained ResNet50 model on ImageNet dataset for fair comparison. KSH and SDH are kernel based methods, for which we randomly select 1,000 data points as anchors to construct the kernels by following the suggestion of the authors.

Fig. 2: Precision@5000 of MAH and other baselines on three datasets.

4.1.4 Implementation Details

Our algorithm is implemented with Pytorch framework. Training is conducted on a server with two Tesla K40c GPUs with 12GB memory. We employ the deep residual network (ResNet50) architecture, and initialize our ResNet50 using the pre-trained and fine-tune the convolutional layers and fully-connected layers on the corresponding training set. During the training, we use the stochastic gradient descent with momentum to 0.9 and weight decay to

. We set the batch size to 64 and learning rate .

Fig. 3:

Ablation study on the large-scale multi-head strides (

, , , , ). The performance of bit hash codes learned by MAH with the flat multi-head (Left) and cascaded multi-head (Right) on CIFAR-10 dataset. The loss coefficients and are fixed at 4 and 2 respectively.

4.2 Comparisons with the State-of-the-art Algorithms

Table I illustrates the scores of MAP of compared methods using various code lengths, and Figure 2 displays the Precision@5000 curve with 4, 6, 8, 10, 12 bits. Empirically, the loss coefficient and of MAH are fixed at 6 and 2 respectively and the code length sets are set as , . From the Table I and Figure 2, we can observe that,

  • The proposed MAH outperforms all baselines on three datasets in all cases, which verifies its validity and effectiveness. Especially when embedding extremely low bit (e.g., 4-bit) hash codes, MAH achieves at least (CIFAR-10), (NUS-WIDE), (MIRFLickr) higher performance compared with other deep hashing approaches.

  • Data-independent and unsupervised hashing methods, i.e., LSH and ITQ achieves relatively much lower performance compared with all supervised methods on single-label dataset (CIFAR-10), while they gain competitive scores on multi-label datasets. The major reason we infer is the pre-defined similarity measurement, which highly restrains the entropy of the ground-truths. For multi-label datasets, if two samples share at least one semantic label, they are considered to be semantically similar. However, it makes pair-wise relationship vague and fuzzy, hence easier for unsupervised method to predict.

  • Supervised yet non-deep methods, i.e., SDH, KSH, LFH, COSDISH, generally achieve a stable increase on MAP as the hash code length goes up. Notably, supervised methods that adopt discrete optimization, i.e., SDH, COSDISH perform relatively better compared with those utilize continuous relaxation for learning.

  • DPSH, DRSCH, ADSH, DAPH and MAH are all trained in an end-to-end scheme. Compared with the classic DPSH framework, DRSCH applies a bit-wise weight layer to truncate insignificant bits, which benefits the low bit learning and increases MAP by . DAPH and ADSH are deep asymmetric hashing, which losses 21.12% and 31.91% on MAP in comparison with MAH as their fix-length embedding has no guarantee of converging to a global minimum.

  • The underlying principle is the collaborative learning strategy adopted by MAH achieves the consensus of multiple views from embedding heads on the same training sample. The consensus provides the supplementary information as well as the regularization to each embedding head, therefore enhancing the generalization and robustness. Besides, the intermediate-level representation sharing with back-propagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the latent features.

Method CIFAR-10
12 bits 24 bits 36 bits 48 bits
DPSH 65.34 67.29 70.13 71.25
DRSCH 67.54 67.89 68.32 68.57
ADSH 92.84 94.21 94.32 93.75
DAPH 75.69 82.13 83.07 84.48
MAH-2 79.86 (-bit) 94.35 (-bit)
TABLE II: The MAP of the proposed MAH method equipped with the cascaded multi-head structure and deep hashing baselines with various code lengths of {16, 24, 36, 48}.

4.3 Component Analysis

Regarding to the impact of each component and parameter setting, we set up ablation studies on stride of multi-head structures, loss coefficients, and hyper-parameters respectively, with the CIFAR-10 dataset.

Fig. 4: Ablation study on the large-scale multi-head strides (, , , , ). The performance of bit hash codes learned by MAH with the flat multi-head (Left) and cascaded multi-head (Right) on CIFAR-10 dataset. The loss coefficients and are fixed at 4 and 2 respectively.

4.3.1 Multi-Head Structures

Flat Multi-head (MAH-1)
MAP Precision@5000
4 bits 6 bits 8 bits 10 bits 12 bits 14 bits 4 bits 6 bits 8 bits 10 bits 12 bits 14 bits
74.60 88.88 92.49 - - - 76.89 87.39 91.24 - - -
- 86.22 94.29 94.89 - - - 91.31 93.23 93.80 - -
- - 94.60 94.30 94.33 - - - 93.65 93.12 93.20 -
- - - 94.93 94.83 94.86 - - - 93.82 93.49 93.89
Cascaded Multi-head (MAH-2)
MAP Precision@5000
4 bits 6 bits 8 bits 10 bits 12 bits 14 bits 4 bits 6 bits 8 bits 10 bits 12 bits 14 bits
51.30 81.06 91.38 - - - 63.62 83.77 89.59 - - -
- 76.12 92.90 93.69 - - - 82.69 91.90 92.11 - -
- - 93.42 94.38 94.50 - - - 92.42 93.15 93.16 -
- - - 94.77 95.37 95.34 - - - 93.71 94.39 94.05
TABLE III: The ablation study of multi-head embedding branch given a fixed code length stride on CIFAR-10 dataset.

In this subsection, we conduct the ablation study on the effect of the stride of multi-head structure. We explore the retrieval performance with various strides of our multi-head structures and report the MAP and Precision@5000 in Figure 3 and Figure 4. The Figure 3 shows a local relationship, i.e., and . Extra experiments in a macro view, i.e., and are shown in Figure 3. The loss coefficient and are fixed at and . From the Figure 3, it is clearly observed that,

Fig. 5: MAP of MAH-2 with various settings of loss coefficient and . (Left) , (Middle) , and (Right) .
Fig. 6: Precision@5000 of MAH-2 with various loss coefficient and . (Left) , (Middle) , and (Right) .
  • Given the fixed loss coefficients, the cascaded multi-head achieves a better score with the length combination of (74.60% of MAP, 80.93% of Precision@5000). In the horizontal and vertical directions, the indicator value gradually falls before slightly going up. We analyze this phenomenon results from the trade-off between decreasing fitness of the loss coefficient and increasing explicit knowledge from high-bit learning. More specifically, the learning of multiple hash codes will be dominated by a greater potion of high-bit learning when or is enlarged, which probably weakens -bit learning.

  • Regarding to the MAH with flat multi-head, it peaks at of MAP and of Precision@5000, with the code length combination of . The performance fluctuated more considerably in comparison to that with the cascaded multi-head.

Moreover, we investigate the impact of each embedding branch given a fixed code length stride, which is shown in Table III. From the Table III, it is observed that,

  • Given a target code length, the -bit embedding branch achieves high scores in most of cases. As usually longer hash codes preserve more original data structure and semantics, they pass shorter hash codes with positive guidance and regularization.

As illustrated in Figure 4,

  • The cascaded distiller reaches a higher score when . For example, MAP and Top-5K are up to and respectively when , , of which the major reason we believe is the consistency with the loss coefficient . The low bit learning could benefit from the normalized and balanced gradient of high-bit learning with different levels.

  • Dissimilar pattern is observed from MAP and Precision@5000 matrix of flat distiller. It climbs to on MAP and on Precision@5000 precision, when . Generally, it performs relatively well as in contrast to other cases, which may indicate that a close stride between and is required when applying flat distiller into the end-to-end training.

To draw a conclusion, an excessive large stride of multi-head structures will not bring about a boost on retrieval performance, as its enlarged loss will dominate and overshadow the learning of low bit embedding.

4.3.2 Loss Coefficients

The impact of the loss coefficients on embedding quality is reported in Figure 5 and Figure 6. We fix the code length to . The ablation study is conducted on CIFAR-10 dataset with the MAH equipped with the cascaded multi-head. It is observed from Figure 5 and 6 that,

  • MAP of 4-bit hash codes rises to when and . Recall that are fixed to , which is reverse to the ratio of loss coefficients . Please note that our learned -bit codes outperform most of the other -bit codes learned by deep hashing methods or even -bit codes (See Table II).

  • Most of high scores achieved in last three column, i.e., , which shows weighting low bit embedding head gives a positive impact on its retrieval performance.

  • Regarding to auxiliary tasks, i.e, - and -bit learning, they have positive correlation with the low bit embedding, reaching , of MAP and , of Precision@5000.

To conclude, the setting of the loss coefficients directly influence the quality of the learned embeddings. The ratio of is suggested to be for the better performance, where is applied as a minor adjustment.

4.4 The Study of Efficiency

The proposed MAH algorithm is further studied with regard to training time and storage cost.

4.4.1 Training Efficiency

In terms of training efficiency, we explore the training time of learning 4-bit hash codes by the proposed MAH and other deep hashing baselines on CIFAR-10. The results are shown in Figure 7. From the Figure 7, it is clearly observed that,

Fig. 7: The training time and MAP of deep supervised hashing on CIFAR-10 dataset.
  • The MAH achieves the first convergence at around the epoch. Both the MAP and Precision@5000 are superior to the other state-of-the-arts.

  • Due to its simplicity of structure, DPSH averagely consumes the least time per epoch (), while DAPH consumes , MAH , DRSCH . Training DAPH costs longer time in comparison with MAH, leading to for each epoch. Note that MAH simultaneously learns hash codes with 3 different lengths, which is feasible especially when applied into practice.

  • DAPH holds a stable performance before suddenly climbing up at approximately epoch, achieving on MAP, followed by fluctuates in a slight decreasing trend.

4.4.2 Storage Cost

Fig. 8: The storage cost of CIFAR-10 database hash codes with various code lengths.

As provided in Table II, the learned 4-bit hash codes by MAH achieve a better performance compared with 48-bit binary codes learned by DPSH and DRSCH, and 8-bit codes surpass the 48-bit codes by other deep hashing methods. Generally, the storage of database hash codes grows up linearly corresponding to code lengths, which is shown in Figure 8. Therefore, we could infer that, by applying the proposed MAH into practice, the overall storage cost of hash codes will diminish by to without compromising on performance.

5 Conclusion and Future Work

In this paper, we propose the Multi-head Asymmetric Hashing (MAH) framework, pursuing maximum semantic information with minimum binary bits. By leveraging the flat and cascaded multi-head structures, the proposed MAH distills the bit-specific knowledge for low-bit codes with the guidance of other hashing learners and achieves promising performance in low-bit retrieval task. Extensive experiments on three datasets have proven the superiority of our MAH to the existing deep hashing methods, increasing MAP by and saving storage by to . Our collaborative learning strategy is planned to extend on neural network quantization task in a near future, where a significant compression and acceleration is expected.


This work is partially supported by ARC FT130101530 and NSFC No. 61628206.


  • [1] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 769–790, 2018. [Online]. Available:
  • [2] M. Hu, Y. Yang, F. Shen, N. Xie, and H. T. Shen, “Hashing with angular reconstructive embeddings,” IEEE Trans. Image Processing, vol. 27, no. 2, pp. 545–555, 2018. [Online]. Available:
  • [3] Y. Luo, Y. Yang, F. Shen, Z. Huang, P. Zhou, and H. T. Shen, “Robust discrete code modeling for supervised hashing,” Pattern Recognition, vol. 75, pp. 128–135, 2018. [Online]. Available:
  • [4] F. Zhu, X. Kong, L. Zheng, H. Fu, and Q. Tian, “Part-based deep hashing for large-scale person re-identification,” IEEE Trans. Image Processing, vol. 26, no. 10, pp. 4806–4817, 2017. [Online]. Available:
  • [5] J. Chen, Y. Wang, J. Qin, L. Liu, and L. Shao, “Fast person re-identification via cross-camera semantic binary transformation,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    , 2017, pp. 5330–5339. [Online]. Available:
  • [6] Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training binary weight networks via hashing,” in

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

    , 2018. [Online]. Available:
  • [7] R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized hashing,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, 2017, pp. 445–454. [Online]. Available:
  • [8] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 2285–2294. [Online]. Available:
  • [9] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, 1999, pp. 518–529. [Online]. Available:
  • [10] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2916–2929, 2013. [Online]. Available:
  • [11] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition,CVPR , Boston, MA, USA, June 7-12, 2015, 2015, pp. 37–45. [Online]. Available:
  • [12] P. Zhang, W. Zhang, W. Li, and M. Guo, “Supervised hashing with latent factor models,” in The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, 2014, pp. 173–182. [Online]. Available:
  • [13] W. Kang, W. Li, and Z. Zhou, “Column sampling based discrete supervised hashing,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., 2016, pp. 1230–1236. [Online]. Available:
  • [14] J. Wang, S. Kumar, and S. Chang, “Semi-supervised hashing for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp. 2393–2406, 2012. [Online]. Available:
  • [15] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen, “Unsupervised deep hashing with similarity-adaptive and discrete optimization,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [16] W. Li, S. Wang, and W. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, 2016, pp. 1711–1717. [Online]. Available:
  • [17] Q. Jiang and W. Li, “Asymmetric deep supervised hashing,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. [Online]. Available:
  • [18] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,” IEEE Trans. Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
  • [19] Y. Guo, X. Zhao, G. Ding, and J. Han, “On trivial solution and high correlation problems in deep supervised hashing,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. [Online]. Available:
  • [20] Y. Cao, M. Long, B. Liu, J. Wang, and M. KLiss, “Deep cauchy hashing for hamming space retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1229–1237.
  • [21] Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 5609–5618. [Online]. Available:
  • [22] B. Neyshabur, N. Srebro, R. Salakhutdinov, Y. Makarychev, and P. Yadollahpour, “The power of asymmetry in binary hashing,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, December 5-8, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 2823–2831. [Online]. Available:
  • [23] F. Shen, X. Gao, L. Liu, Y. Yang, and H. T. Shen, “Deep asymmetric pairwise hashing,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 1522–1530. [Online]. Available:
  • [24] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [Online]. Available:
  • [25] Y. Lv, W. W. Y. Ng, Z. Zeng, D. S. Yeung, and P. P. K. Chan, “Asymmetric cyclical hashing for large scale image retrieval,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1225–1235, 2015. [Online]. Available:
  • [26] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, USA, 2018, 2018.
  • [27] X. Gao, F. Shen, Y. Yang, X. Xu, H. Li, and H. T. Shen, “Asymmetric sparse hashing,” in 2017 IEEE International Conference on Multimedia and Expo, ICME 2017, Hong Kong, China, July 10-14, 2017, 2017, pp. 127–132. [Online]. Available:
  • [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778. [Online]. Available:
  • [29] A. Krizhevsky, “Learning multiple layers of features from tiny images,” in Master’s thesis, University of Toronto, 2009.
  • [30] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: a real-world web image database from national university of singapore,” in Proceedings of the 8th ACM International Conference on Image and Video Retrieval, CIVR 2009, Santorini Island, Greece, July 8-10, 2009, 2009. [Online]. Available:
  • [31] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval.   New York, NY, USA: ACM, 2008.
  • [32] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang, “Supervised hashing with kernels,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition,CVPR, Providence, RI, USA, June 16-21, 2012, 2012, pp. 2074–2081.