1 Introduction
To diminish the computational and storage cost and make optimal use of rapidly emerging multimedia data, hashing [1, 2]
has attracted much attention from machine learning community, with wide applications on information retrieval
[3, zeroshot], person reidentification [4, 5] and network compression [6, 7, 8] etc. The goal of hashing is to transform original data structures and semantic affinity into compact binary codes, thereby substantially accelerating the computation with efficient xor operations and saving the storage.There are mainly two branches of hashing, i.e., dataindependent hashing and datadependent hashing. For dataindependent hashing, such as Locality Sensitive Hashing [9], no prior knowledge (e.g., supervised information) about data is available, and hash functions are randomly generated. Nonetheless, huge storage and computational overhead might be cost since more than 1, 000 bits are usually required to achieve acceptable performance. To address this problem, research directions turn to datadependent hashing, which leverages information inside data itself. Roughly, datadependent hashing can be divided into two categories: unsupervised hashing (e.g., Iterative Quantization (ITQ) [10]), and (semi)supervised hashing (e.g., Supervised Hashing with Kernels (KSH), Supervised Discrete Hashing (SDH) [11], Supervised Hashing with Latent Factor Models (LFH) [12], Column Sampling based Discrete Supervised Hashing (COSDISH) [13] and SemiSupervised Hashing (SSH) [14]). In general, supervised hashing usually achieves better performance than unsupervised ones because supervised information (e.g., semantic labels and/or pairwise data relationship) can help to better explore intrinsic data property, thereby generating superior hash codes and hash functions.
With the rapid development of deep learning techniques, deep hashing
[15, 16, 17, 18, 17, 19] trained with an endtoend scheme has been proposed. From the perspective of training strategy, deep hashing could be roughly grouped into two categories: symmetric and asymmetric deep hashing. By assuming both query and database samples share the same distribution, symmetric deep hashing [16, 20, 21] leverages a single network to preserve pairwise or tripletwise neighbor affinity, which inevitably results in high complexity, i.e., or even , where denotes the number of database points. Asymmetric deep hashing treats query samples and database samples separately based on the asymmetric theory [22]. Deep asymmetric pairwise hashing (DAPH) [23]utilizes two distinct mappings to capture variances and discrepancies between query and database sets, while asymmetric deep supervised hashing (ADSH)
[17] learns a hash function only for query points, thus reducing the training time complexity.However, most of existing deep hashing models could hardly guarantee a manageable convergence or performance for extremely low bit hash codes (e.g., 4bit) even of capacity to convey sufficient semantics, mainly due to its severe information loss and shrink of discrete solution space. For example, considering the CIFAR10 database including images of ten classes, theoretically the minimum number of bits required to represent full semantics is (), while the 4bit hash codes generated by ADSH could only achieve of MAP score, which is far from satisfactory. Besides, to reach the performance and storage requirements, it is inevitable and tedious to adjust the default code length and retune network hyperparameters multiple times in practice. The bitscalable deep hashing (DRSCH) [18] is proposed to learn the hash codes of variable length by unequally weighting each bit and then truncating insignificant bits. Nevertheless, DRSCH is still suboptimal as it ignores the distribution adaptation to fit vertices of the Hamming hypercube and fails to suppress information loss and quantization error.
Motivated by the aforementioned observations and analyses, in this paper, we propose a novel collaborative learning strategy for extremely low bit (e.g., 4bit) hashing by simultaneously learning a group of hash codes with various lengths (e.g., bit). Different from conventional knowledge distillation that adopts studentteacher training strategy [24], collaborative learning distills the bitspecific knowledge in a unified framework and squeezes redundant bits out. To achieve this goal, the Multihead Asymmetric Hashing (MAH) is derived, which is a deep asymmetry hashing framework equipped with the multihead embedding structure. As shown in Figures 1, two variants (i.e., flat and cascaded) of multihead structures are explored. The flat one explicitly guides the lowbit embedding branch with multiple supplementary views, while the cascaded one adapts the distribution of lowbit hashing based on the consensus of other hashing learners. The multihead structure benefits extremely low bit hashing in two perspectives, i.e., 1) it enables the shared intermediate layers to aggregate the gradient flows from all heads and the penultimate layer to select bitspecific representations, adapting the feature distribution to compensate for information loss; 2) multiple views from different embedding heads on the same training sample provide regularization to extremely low bit hashing, thereby making its convergence faster and more stable. Our main contributions are summarized as follows:

We propose a novel collaborative learning strategy for deep hashing, aiming to distill knowledge for lowbit hash codes from a group of hashing learners and gain a performance boost. To the best of our knowledge, it is the very first work of introducing model distillation to address the code compression of deep hashing.

Two variants of multihead structures are derived to efficiently enhance the power of supervision on shared intermediate layer and benefit bitspecific representation learning. Besides, a group of hash codes with various lengths are jointly learned, which might suit for different platforms without extra inference cost or network retuning.

Experiments on three benchmark datasets demonstrate that the proposed MAH significantly outperforms existing deep hashing methods especially for lowbit retrieval task and saves up to training time and the storage by to .
The rest of paper is organized as follows. Section 2 presents a brief review of related work, and Section 3 introduces the details of our hashing learning framework. The experimental results, comparison and component analysis are presented in Section 4, followed by the conclusion and future work plan in Section 5.
2 Related Work
In this section, we mainly introduce three aspects relevant to our work: scalable hashing, knowledge distillation and asymmetric hashing. We also explain the differences between our work and existing works.
2.1 Scalable Hashing
Traditional hashing learns hash codes with a default code length (e.g., 64bit), which highly restricts the practical flexibility and scalability. For example, lowbit hash codes suit devices with limited computational resources well, while highbit hash codes are usually applied in highperformance servers for higher accuracy. Therefore, it is inevitable and tedious for engineers to adjust the default code length and retune hyperparameters of networks to meet the performance and storage requirements. To address this issue, asymmetric cyclical hashing [25] is exploited to measure hash codes with different lengths for query and database images with the weighted Hamming distance. With the development of deep learning techniques, an endtoend bitscalable deep hashing framework [18] is proposed, which learns the hash codes of variable length by unequally weighting each bit and then truncating insignificant bits. Nevertheless, DRSCH is still suboptimal as it ignores the distribution adaptation to fit vertices of the Hamming hypercube and fails to suppress information loss and quantization error.
2.2 Knowledge Distillation
For saving computational cost for inferences under various settings, several knowledge distillation strategies have been explored on classification. While general knowledge distillation [24] requires two stages, that means, pretraining a large highly regularized model first and then teaching the smaller model, twoway distillation [26]
leverages an ensemble of students to learn mutually and teach each other throughout the training process. Nevertheless, using the Kullback Leibler (KL) Divergence to constrain the consensus of prediction and weights of different networks, twoway distillation limits its deployment on the collaborative hashing problem as it ignores the selection of bitspecific features and probably leads to severe information loss during quantization.
2.3 Asymmetric Hashing
Asymmetric hashing could be grouped into two types: dual projection based [23, 27] and sampling based [17, asy_dis_graph]. Dual projection methods aim to capture the distribution differences between database points and query points by learning two distinctive hash functions so that original data relationships could be well preserved. Rather than learning full pairwise relationships () or triplet relationships () among points dataset, sampling based methods select anchors (
) to approximate query datasets and construct an asymmetric affinity to supervise learning, which significantly reduces the training time complexity to
.3 Methodology
3.1 Problem Definition
Without loss of generality, we focus on extremely low bit hashing for image retrieval task with pairwise supervision. We assume that there are query data points denoted as and database points denoted as . Furthermore, pairwise supervised information between and are provided as , where if and are similar, otherwise . The goal of conventional deep hashing is to learn a nonlinear hash function and generate hash codes for query points , and for database points with minimum information loss, where is the hash code length. Different from existing deep hashing methods, which learns fixedlength binary codes, we explore how to jointly optimize hash codes with various lengths, i.e., , where , , denote the extremely low, low and anchorlength of binary codes respectively. Our objective is to learn the shortest (bit) hash codes to achieve the best retrieval performance among the learner group.
3.2 Model Formulation
The general framework of the proposed MAH is shown in Figure 1.
3.2.1 Discrete Feature Learning
The whole endtoend architecture for feature learning is mainly based on the deep residual neural network
[28], which accepts pairwise inputs, i.e.,database and query image. For transforming latent Euclidean space to the Hamming space, we replace the top layer of the softmax classifier from vanilla ResNet with the multihead embedding structure that will be elaborated later. Please note that the feature learning is only performed for all query points but not for database points.
3.2.2 Multihead Embedding
In this work, we construct two variants of multihead structures, i.e., flat multihead and cascaded multihead to implement collaborative learning based on different assumptions. By acquiring the advantages from twoway distillation, the flat multihead increases the posterior entropy of lowbit branch, which helps it converge to a more robust and flatter minima with complementary views. Differently, leveraging the consensus from long bit learners which convey more original data structure and semantics, the cascaded multihead adjusts the manifold layer by layer to approximate vertices of the targeting Hamming hypercube, consequently offsetting information loss with distribution adaptation. For simplicity, we only construct three heads to validate multihead embedding, which could be further generalized to the nhead structure.
3.2.3 Objective Formulation for MAH
To learn the hash codes that can maximumly preserve the semantics and affinity, objectives are usually formulated to minimize the loss between the pairwise supervision and the inner product of querydatabase binary code pairs and :
(1) 
where denotes the signum function and is the output of the penultimate layer, and is the hyperparameter of the neural networks to be learned. However, there exists illposed gradient problem in Equation (1) caused by nonsmooth function, the gradient of which is zero for all nonzero inputs, making standard backpropagation infeasible. Therefore we adopt function to approximate it and apply a further optimization strategy, thus Equation (1) could be rewritten as:
(2) 
In practice, we might be only given a set of database points without query points. In this case, we can randomly sample data points from database to construct the query set following [17]. More specifically, we set , where . The objective function can be rewritten as follows,
(3) 
where is a constant parameter. In real applications, if we are given both and , we use the problem defined in Equation (2) for training MAH, otherwise Equation (3) is used as the objectives.
In this case, MAH treats database and query data in an asymmetric way based on the assumption that the distribution of both sets are not exactly the same. The sampling strategy not only avoids overfitting on the training dataset and but also reduces training complexity, thus significantly improving its robustness and practicality.
Regarding to the proposed multihead embedding structure, we modify the core loss function to collaboratively learn binary codes
, , with , , code lengths respectively. Notably, we aim to learn the lowest bit hash codes to achieve the optimal retrieval performance.(4) 
where and denote the hyperparameters of networks including the bit and bit embedding heads respectively. Moreover, and are the coefficients for balancing multihead loss.
3.3 Optimization
To achieve the above objectives, we need to learn parameters of neural networks and infer database hash codes , yet it is NP hard to solve hash codes directly due to its discrete constraints. Inspired by the discrete cyclic coordinate descent (DCC),the binary codes could be learned bit by bit iteratively.
3.3.1 Learn with Fixed
In order to update the neural network parameter , given are fixed, we use backpropagation (BP) for gradient calculation. Specifically, we sample a minibatch of the query points, then update the parameter based on the sampled data. For clarity and simplicity, we use and respectively. Therefore, we can calculate the gradient as follows,
(5) 
We use the chain rule to compute
based on , and the BP algorithm is used to update . Then and are updated asynchronously in a similar way to Equation (5).3.3.2 Learn with Fixed
Firstly we target at optimizing B and fix all other variables, and rewrite the Equation (3) as follows,
(6) 
where , denotes the binary codes for the database points indexed by . We define , where is defined as follows,
(7) 
where .
As is discrete and nonconvex, we choose to learn the binary codes by the discrete cyclic coordinate descent (DCC) method. In other words, we learn bit by bit. Let denote the th column of , and denote the matrix of excluding the th column. Let denote the th column of , and denote the matrix of excluding the th column. Let denote the th column of , and denote the matrix of excluding the th column. To optimize , we can calculate the objective function,
(8) 
Consequently, the optimal solution could be achieved,
(9) 
With the similar calculation, the solution of and can be formed as,
(10) 
where and . The complete training procedure for the proposed MAH is described by Algorithm 1.
3.4 OutofSample Extension
After training the MAH, the learned deep neural networks can be applied for generating compact binary codes for query points including unseen query points (e.g., ) during training. Specifically, we can use the following equation for ,
(11) 
3.5 Complexity Analysis
For each epoch, the time cost is analyzed as follows. The computation of gradient of in Equation (5) is . To apply DCC algorithm for updating , in Equation (7) to be calculated with the cost of . As to the optimization of the subproblem in Equation (9) and (10), the time cost is . In practice, are much smaller than the database size . Hence the overall computational complexity of algorithm is .
4 Experiments
Method  CIFAR10  NUSWIDE  MIRFLickr  

4 bits  6 bits  8 bits  10 bits  12 bits  4 bits  6 bits  8 bits  10 bits  12 bits  4 bits  6 bits  8 bits  10 bits  12 bits  
LSH  11.28  11.71  12.64  12.13  12.09  45.65  45.89  43.76  44.38  44.11  54.26  56.73  55.16  57.16  56.49 
ITQ  18.22  18.00  19.39  19.76  20.16  62.56  67.07  69.06  70.39  70.82  71.56  71.69  71.46  72.54  73.13 
SDH  34.79  43.23  44.83  52.49  56.01  62.92  65.98  68.53  73.33  71.73  76.09  78.14  80.15  80.38  81.58 
KSH  34.81  40.98  44.96  46.23  47.22  59.19  59.76  59.85  59.40  59.66  66.29  66.06  66.36  68.99  68.33 
LFH  30.06  15.34  36.81  37.93  45.14  58.48  68.90  68.08  72.81  72.25  68.71  74.12  81.34  81.55  84.07 
COSDISH  59.24  65.55  70.31  73.99  74.81  58.49  66.16  70.62  73.57  73.62  68.73  80.31  79.28  77.83  82.06 
DPSH  38.59  50.09  54.12  61.23  65.34  52.63  57.68  64.43  67.14  70.15  39.96  52.19  55.73  59.83  65.86 
DRSCH  40.23  49.87  56.92  60.15  67.54  48.07  49.53  52.42  53.60  55.35  42.35  51.92  58.76  63.51  67.21 
ADSH  42.69  50.06  87.30  92.48  92.84  73.31  74.63  76.85  78.68  79.13  72.68  82.13  83.61  85.58  86.54 
DAPH  53.48  61.88  66.48  73.28  75.69  53.26  56.70  64.76  68.30  71.67  70.25  78.30  80.12  85.43  87.12 
MAH1  47.59  81.97  93.39  93.35  95.03  70.95  75.37  76.79  78.87  79.52  82.64  85.39  86.90  87.82  89.20 
MAH2  74.60  89.50  94.29  94.89  95.37  76.47  76.85  79.47  82.15  84.65  82.75  86.09  86.76  86.81  89.39 
4.1 Settings
4.1.1 Datasets
We have conducted extensive image retrieval experiments on three public benchmark datasets, i.e., CIFAR10, NUSWIDE, and MIRFlickr.
CIFAR10 [singlelabel] [29] labeled subsets of 80 million tiny images dataset, which consists of 60,000 3232 color images in 10 classes, with 6,000 images per class.
NUSWIDE [multilabel] [30] is a web image dataset containing 269,648 images from Flickr, where 81 semantic concepts is provided for evaluation. We eliminate all empty images and use the rest 195,834 images from the 21 most frequent concepts, where each concept consists of at least 5,000 images.
MIRFlickr [multilabel] [31] is a collection of 25,000 images from Flickr, where each instance is manually annotated with at least one of 38 labels.
Each dataset is randomly split into a query set with 1,000 samples and a database set with the remaining samples for evaluation. For singlelabel datasets, if two samples have the same class label, they are considered to be semantically similar, and dissimilar otherwise. For multilabel datasets, if two samples share at least one semantic label, they are considered to be semantically similar.
4.1.2 Evaluation Metric
The Hamming ranking is used as the search protocol to evaluate our proposed approach, and two indicators are reported.
1) Mean Average Precision (MAP): The average precision (AP) is defined as,
(12) 
where is the number of groundtruth neighbors of the query in a database, is the number of samples in the database. Precision(r) denotes the precision of the top retrieved entities, if the th retrieved entity is a groundtruth neighbor and otherwise. For a query set whose size is , the is defined as the mean of the average precision scores for all the queries in the query set,
(13) 
2) Top5000precision (Precision@5000): the Top5000precision curve reflects the change of precision with respect to the number of topranked instances returned to the users, which is expressive for retrieval.
4.1.3 Baselines
To evaluate MAH and baselines, we select several related hashing methods as baselines for comparison, including dataindependent hashing method LSH [9], unsupervised hashing method ITQ [10], four supervised but nondeep supervised hashing methods [32, 11, 12, 13], four deep supervised hashing methods including DPSH [16], DRSCH [18], ADSH [17] and DAPH [23]
. For nondeep hashing methods, we utilize 4,096dim deep features which are extracted from the pretrained ResNet50 model on ImageNet dataset for fair comparison. KSH and SDH are kernel based methods, for which we randomly select 1,000 data points as anchors to construct the kernels by following the suggestion of the authors.
4.1.4 Implementation Details
Our algorithm is implemented with Pytorch framework. Training is conducted on a server with two Tesla K40c GPUs with 12GB memory. We employ the deep residual network (ResNet50) architecture, and initialize our ResNet50 using the pretrained and finetune the convolutional layers and fullyconnected layers on the corresponding training set. During the training, we use the stochastic gradient descent with momentum to 0.9 and weight decay to
. We set the batch size to 64 and learning rate .Ablation study on the largescale multihead strides (
, , , , ). The performance of bit hash codes learned by MAH with the flat multihead (Left) and cascaded multihead (Right) on CIFAR10 dataset. The loss coefficients and are fixed at 4 and 2 respectively.4.2 Comparisons with the Stateoftheart Algorithms
Table I illustrates the scores of MAP of compared methods using various code lengths, and Figure 2 displays the Precision@5000 curve with 4, 6, 8, 10, 12 bits. Empirically, the loss coefficient and of MAH are fixed at 6 and 2 respectively and the code length sets are set as , . From the Table I and Figure 2, we can observe that,

The proposed MAH outperforms all baselines on three datasets in all cases, which verifies its validity and effectiveness. Especially when embedding extremely low bit (e.g., 4bit) hash codes, MAH achieves at least (CIFAR10), (NUSWIDE), (MIRFLickr) higher performance compared with other deep hashing approaches.

Dataindependent and unsupervised hashing methods, i.e., LSH and ITQ achieves relatively much lower performance compared with all supervised methods on singlelabel dataset (CIFAR10), while they gain competitive scores on multilabel datasets. The major reason we infer is the predefined similarity measurement, which highly restrains the entropy of the groundtruths. For multilabel datasets, if two samples share at least one semantic label, they are considered to be semantically similar. However, it makes pairwise relationship vague and fuzzy, hence easier for unsupervised method to predict.

Supervised yet nondeep methods, i.e., SDH, KSH, LFH, COSDISH, generally achieve a stable increase on MAP as the hash code length goes up. Notably, supervised methods that adopt discrete optimization, i.e., SDH, COSDISH perform relatively better compared with those utilize continuous relaxation for learning.

DPSH, DRSCH, ADSH, DAPH and MAH are all trained in an endtoend scheme. Compared with the classic DPSH framework, DRSCH applies a bitwise weight layer to truncate insignificant bits, which benefits the low bit learning and increases MAP by . DAPH and ADSH are deep asymmetric hashing, which losses 21.12% and 31.91% on MAP in comparison with MAH as their fixlength embedding has no guarantee of converging to a global minimum.

The underlying principle is the collaborative learning strategy adopted by MAH achieves the consensus of multiple views from embedding heads on the same training sample. The consensus provides the supplementary information as well as the regularization to each embedding head, therefore enhancing the generalization and robustness. Besides, the intermediatelevel representation sharing with backpropagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the latent features.
Method  CIFAR10  

12 bits  24 bits  36 bits  48 bits  
DPSH  65.34  67.29  70.13  71.25 
DRSCH  67.54  67.89  68.32  68.57 
ADSH  92.84  94.21  94.32  93.75 
DAPH  75.69  82.13  83.07  84.48 
MAH2  79.86  (bit)  94.35  (bit) 
4.3 Component Analysis
Regarding to the impact of each component and parameter setting, we set up ablation studies on stride of multihead structures, loss coefficients, and hyperparameters respectively, with the CIFAR10 dataset.
4.3.1 MultiHead Structures
Flat Multihead (MAH1)  

MAP  Precision@5000  
4 bits  6 bits  8 bits  10 bits  12 bits  14 bits  4 bits  6 bits  8 bits  10 bits  12 bits  14 bits 
74.60  88.88  92.49        76.89  87.39  91.24       
  86.22  94.29  94.89        91.31  93.23  93.80     
    94.60  94.30  94.33        93.65  93.12  93.20   
      94.93  94.83  94.86        93.82  93.49  93.89 
Cascaded Multihead (MAH2)  
MAP  Precision@5000  
4 bits  6 bits  8 bits  10 bits  12 bits  14 bits  4 bits  6 bits  8 bits  10 bits  12 bits  14 bits 
51.30  81.06  91.38        63.62  83.77  89.59       
  76.12  92.90  93.69        82.69  91.90  92.11     
    93.42  94.38  94.50        92.42  93.15  93.16   
      94.77  95.37  95.34        93.71  94.39  94.05 
In this subsection, we conduct the ablation study on the effect of the stride of multihead structure. We explore the retrieval performance with various strides of our multihead structures and report the MAP and Precision@5000 in Figure 3 and Figure 4. The Figure 3 shows a local relationship, i.e., and . Extra experiments in a macro view, i.e., and are shown in Figure 3. The loss coefficient and are fixed at and . From the Figure 3, it is clearly observed that,

Given the fixed loss coefficients, the cascaded multihead achieves a better score with the length combination of (74.60% of MAP, 80.93% of Precision@5000). In the horizontal and vertical directions, the indicator value gradually falls before slightly going up. We analyze this phenomenon results from the tradeoff between decreasing fitness of the loss coefficient and increasing explicit knowledge from highbit learning. More specifically, the learning of multiple hash codes will be dominated by a greater potion of highbit learning when or is enlarged, which probably weakens bit learning.

Regarding to the MAH with flat multihead, it peaks at of MAP and of Precision@5000, with the code length combination of . The performance fluctuated more considerably in comparison to that with the cascaded multihead.
Moreover, we investigate the impact of each embedding branch given a fixed code length stride, which is shown in Table III. From the Table III, it is observed that,

Given a target code length, the bit embedding branch achieves high scores in most of cases. As usually longer hash codes preserve more original data structure and semantics, they pass shorter hash codes with positive guidance and regularization.
As illustrated in Figure 4,

The cascaded distiller reaches a higher score when . For example, MAP and Top5K are up to and respectively when , , of which the major reason we believe is the consistency with the loss coefficient . The low bit learning could benefit from the normalized and balanced gradient of highbit learning with different levels.

Dissimilar pattern is observed from MAP and Precision@5000 matrix of flat distiller. It climbs to on MAP and on Precision@5000 precision, when . Generally, it performs relatively well as in contrast to other cases, which may indicate that a close stride between and is required when applying flat distiller into the endtoend training.
To draw a conclusion, an excessive large stride of multihead structures will not bring about a boost on retrieval performance, as its enlarged loss will dominate and overshadow the learning of low bit embedding.
4.3.2 Loss Coefficients
The impact of the loss coefficients on embedding quality is reported in Figure 5 and Figure 6. We fix the code length to . The ablation study is conducted on CIFAR10 dataset with the MAH equipped with the cascaded multihead. It is observed from Figure 5 and 6 that,

MAP of 4bit hash codes rises to when and . Recall that are fixed to , which is reverse to the ratio of loss coefficients . Please note that our learned bit codes outperform most of the other bit codes learned by deep hashing methods or even bit codes (See Table II).

Most of high scores achieved in last three column, i.e., , which shows weighting low bit embedding head gives a positive impact on its retrieval performance.

Regarding to auxiliary tasks, i.e,  and bit learning, they have positive correlation with the low bit embedding, reaching , of MAP and , of Precision@5000.
To conclude, the setting of the loss coefficients directly influence the quality of the learned embeddings. The ratio of is suggested to be for the better performance, where is applied as a minor adjustment.
4.4 The Study of Efficiency
The proposed MAH algorithm is further studied with regard to training time and storage cost.
4.4.1 Training Efficiency
In terms of training efficiency, we explore the training time of learning 4bit hash codes by the proposed MAH and other deep hashing baselines on CIFAR10. The results are shown in Figure 7. From the Figure 7, it is clearly observed that,

The MAH achieves the first convergence at around the epoch. Both the MAP and Precision@5000 are superior to the other stateofthearts.

Due to its simplicity of structure, DPSH averagely consumes the least time per epoch (), while DAPH consumes , MAH , DRSCH . Training DAPH costs longer time in comparison with MAH, leading to for each epoch. Note that MAH simultaneously learns hash codes with 3 different lengths, which is feasible especially when applied into practice.

DAPH holds a stable performance before suddenly climbing up at approximately epoch, achieving on MAP, followed by fluctuates in a slight decreasing trend.
4.4.2 Storage Cost
As provided in Table II, the learned 4bit hash codes by MAH achieve a better performance compared with 48bit binary codes learned by DPSH and DRSCH, and 8bit codes surpass the 48bit codes by other deep hashing methods. Generally, the storage of database hash codes grows up linearly corresponding to code lengths, which is shown in Figure 8. Therefore, we could infer that, by applying the proposed MAH into practice, the overall storage cost of hash codes will diminish by to without compromising on performance.
5 Conclusion and Future Work
In this paper, we propose the Multihead Asymmetric Hashing (MAH) framework, pursuing maximum semantic information with minimum binary bits. By leveraging the flat and cascaded multihead structures, the proposed MAH distills the bitspecific knowledge for lowbit codes with the guidance of other hashing learners and achieves promising performance in lowbit retrieval task. Extensive experiments on three datasets have proven the superiority of our MAH to the existing deep hashing methods, increasing MAP by and saving storage by to . Our collaborative learning strategy is planned to extend on neural network quantization task in a near future, where a significant compression and acceleration is expected.
Acknowledgments
This work is partially supported by ARC FT130101530 and NSFC No. 61628206.
References
 [1] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 769–790, 2018. [Online]. Available: https://doi.org/10.1109/TPAMI.2017.2699960
 [2] M. Hu, Y. Yang, F. Shen, N. Xie, and H. T. Shen, “Hashing with angular reconstructive embeddings,” IEEE Trans. Image Processing, vol. 27, no. 2, pp. 545–555, 2018. [Online]. Available: https://doi.org/10.1109/TIP.2017.2749147
 [3] Y. Luo, Y. Yang, F. Shen, Z. Huang, P. Zhou, and H. T. Shen, “Robust discrete code modeling for supervised hashing,” Pattern Recognition, vol. 75, pp. 128–135, 2018. [Online]. Available: https://doi.org/10.1016/j.patcog.2017.02.034
 [4] F. Zhu, X. Kong, L. Zheng, H. Fu, and Q. Tian, “Partbased deep hashing for largescale person reidentification,” IEEE Trans. Image Processing, vol. 26, no. 10, pp. 4806–4817, 2017. [Online]. Available: https://doi.org/10.1109/TIP.2017.2695101

[5]
J. Chen, Y. Wang, J. Qin, L. Liu, and L. Shao, “Fast person reidentification
via crosscamera semantic binary transformation,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017
, 2017, pp. 5330–5339. [Online]. Available: https://doi.org/10.1109/CVPR.2017.566 
[6]
Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training binary weight
networks via hashing,” in
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018
, 2018. [Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16466  [7] R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized hashing,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13  17, 2017, 2017, pp. 445–454. [Online]. Available: http://doi.acm.org/10.1145/3097983.3098035
 [8] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, 2015, pp. 2285–2294. [Online]. Available: http://jmlr.org/proceedings/papers/v37/chenc15.html
 [9] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 710, 1999, Edinburgh, Scotland, UK, 1999, pp. 518–529. [Online]. Available: http://www.vldb.org/conf/1999/P49.pdf
 [10] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2916–2929, 2013. [Online]. Available: https://doi.org/10.1109/TPAMI.2012.193
 [11] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition,CVPR , Boston, MA, USA, June 712, 2015, 2015, pp. 37–45. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298598
 [12] P. Zhang, W. Zhang, W. Li, and M. Guo, “Supervised hashing with latent factor models,” in The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia  July 06  11, 2014, 2014, pp. 173–182. [Online]. Available: http://doi.acm.org/10.1145/2600428.2609600
 [13] W. Kang, W. Li, and Z. Zhou, “Column sampling based discrete supervised hashing,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA., 2016, pp. 1230–1236. [Online]. Available: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12353
 [14] J. Wang, S. Kumar, and S. Chang, “Semisupervised hashing for largescale search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp. 2393–2406, 2012. [Online]. Available: https://doi.org/10.1109/TPAMI.2012.48
 [15] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen, “Unsupervised deep hashing with similarityadaptive and discrete optimization,” IEEE transactions on pattern analysis and machine intelligence, 2018.
 [16] W. Li, S. Wang, and W. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016, 2016, pp. 1711–1717. [Online]. Available: http://www.ijcai.org/Abstract/16/245
 [17] Q. Jiang and W. Li, “Asymmetric deep supervised hashing,” in Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018, 2018. [Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17296
 [18] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification,” IEEE Trans. Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
 [19] Y. Guo, X. Zhao, G. Ding, and J. Han, “On trivial solution and high correlation problems in deep supervised hashing,” in Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018, 2018. [Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16351
 [20] Y. Cao, M. Long, B. Liu, J. Wang, and M. KLiss, “Deep cauchy hashing for hamming space retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1229–1237.
 [21] Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, 2017, pp. 5609–5618. [Online]. Available: https://doi.org/10.1109/ICCV.2017.598
 [22] B. Neyshabur, N. Srebro, R. Salakhutdinov, Y. Makarychev, and P. Yadollahpour, “The power of asymmetry in binary hashing,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, December 58, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 2823–2831. [Online]. Available: http://papers.nips.cc/paper/5017thepowerofasymmetryinbinaryhashing
 [23] F. Shen, X. Gao, L. Liu, Y. Yang, and H. T. Shen, “Deep asymmetric pairwise hashing,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 2327, 2017, 2017, pp. 1522–1530. [Online]. Available: http://doi.acm.org/10.1145/3123266.3123345
 [24] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
 [25] Y. Lv, W. W. Y. Ng, Z. Zeng, D. S. Yeung, and P. P. K. Chan, “Asymmetric cyclical hashing for large scale image retrieval,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1225–1235, 2015. [Online]. Available: https://doi.org/10.1109/TMM.2015.2437712
 [26] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, USA, 2018, 2018.
 [27] X. Gao, F. Shen, Y. Yang, X. Xu, H. Li, and H. T. Shen, “Asymmetric sparse hashing,” in 2017 IEEE International Conference on Multimedia and Expo, ICME 2017, Hong Kong, China, July 1014, 2017, 2017, pp. 127–132. [Online]. Available: https://doi.org/10.1109/ICME.2017.8019306
 [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, 2016, pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
 [29] A. Krizhevsky, “Learning multiple layers of features from tiny images,” in Master’s thesis, University of Toronto, 2009.
 [30] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUSWIDE: a realworld web image database from national university of singapore,” in Proceedings of the 8th ACM International Conference on Image and Video Retrieval, CIVR 2009, Santorini Island, Greece, July 810, 2009, 2009. [Online]. Available: http://doi.acm.org/10.1145/1646396.1646452
 [31] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. New York, NY, USA: ACM, 2008.
 [32] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang, “Supervised hashing with kernels,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition,CVPR, Providence, RI, USA, June 1621, 2012, 2012, pp. 2074–2081.
Comments
There are no comments yet.