1 Introduction
By projecting highdimensional data to compact binary hash codes in the Hamming space via a proper hashing function
[24], hashing methods offer remarkable efficiency for data storage and retrieval. Recently, “deep learning to hash” methods
[12, 22, 15, 13, 16, 29]have shown that deep neural networks can naturally represent a nonlinear hashing function to generate hash codes for input data and be applied to image retrieval
[29, 30] and video retrieval [6, 20, 15].Most previous methods [1, 30, 18, 13] learn the deep hashing functions by utilizing pairwise data similarity that captures data relationships from a local aspect. As shown in the upper panel of Fig. 1, the pairwise similarity learning aims at obtaining a hashing function that generates hash codes with minimal distance for similar data and maximal distance for dissimilar data in the Hamming space. However, it intrinsically suffers the following issues: 1) Lowefficiency in profiling the whole distribution of training data. The commonly used doublet similarity [1, 30, 13] or triplet similarity metrics [18, 12] have a time complexity at the order of for learning hashing functions from data points, which means it is almost impractical to exhaustively learn from all possible data pairs. 2) Insufficient coverage of data distribution. Pairwise similarity based methods utilize only partial relationships between data pairs, which may harm the discriminability of the generated hash codes. 3) Low effectiveness on imbalanced data. In real world scenarios, the number of dissimilar pairs is much larger than that of similar pairs. Thus pairwise similarity learning based hashing methods cannot learn similarity relationships adequately to generate sufficiently good hash codes, leading to limited performance.
To solve the above issues, we propose a new global similarity metric, termed as central similarity, and we learn it to obtain better hashing functions. In particular, central similarity measures the Hamming distance between hash codes and hash center which is defined as a set of points in the Hamming space with a sufficient mutual distance. Central similarity learning targets at encouraging the generated hash codes to be close to the corresponding hash centers. In this way, the various hash codes in the Hamming space would concentrate around different hash centers thus can be well discriminated and benefit the retrieval accuracy. From the bottom panel of Fig. 1, it can be intuitively seen that through learning central similarity, the hash codes of similar data pairs concentrate around their common hash centers (the black stars) and those of dissimilar pairs distribute around different hash centers. The central similarity learning is with time complexity of only for data points and centers. Even in presence of severe data imbalance, the hashing function can still be well learned from the global relationships.
To obtain suitable hash centers for similar and dissimilar data pairs, two systematic approaches can be adopted. One is to use Hadamard matrix to construct hash centers; the other is to generate hash centers with random sampling from Bernoulli distributions. We prove that both approaches can generate proper hash centers that are separated from each other with sufficient Hamming distance.
With central similarity learning, we propose a novel network architecture, the Hash Center Network (HCN), based on Convolutional Neural Network (CNN), to learn a deep hashing function in the Hamming space. In particular, HCN consists of convolution layers for deep feature learning and a hash layer for generating hash codes. After identifying the hash centers, we train HCN to generate hash codes with the target of making similar pairs converge to the same hash center. HCN is generic and compatible with both 2D and 3D CNNs for learning hash codes for both images and videos.
Our contributions are threefold. 1) We introduce a novel central similarity learning as a new hashing method to capture global data distribution and generate highquality hashing functions. 2) We propose a novel concept of hash center to facilitate the central similarity learning. We also present two methods to generate proper hash centers. To our best knowledge, this is the first work to utilize central similarity and hash center for deep hashing function learning. 3) We propose a unified deep hash network architecture for both image and video hashing that establishes new stateoftheart.
2 Related Work
Deep network based hashing methods such as CNNH [29], DNNH [12], DHN [30] and HashNet [1]
have been proposed for image hashing. These “deep learning to hash” methods adopt 2D CNNs to learn image features and then hash layers to learn hash codes. Recent hashing methods for images focus on how to design a more efficient pairwisesimilarity loss function. DNNH
[12] proposes to use a triplet ranking loss for similarity learning. DHN [30]uses Maximum a Posterior (MAP) estimation to obtain the pairwise similarity loss function. HashNet
[1] adopts Weighted Maximum Likelihood (WML) estimation to alleviate the severe data imbalance by adding weights in pairwise loss functions. Different from previous works, we propose to use central similarity to model the relationships between similar and dissimilar pairs and improve the discriminability of generated hash codes.Compared with image hashing, some video hashing methods such as DH [20], SRH [6], DVH [15] propose to capture the temporal information in videos. For instance, [20] utilizes Disaggregation Hashing to exploit the correlations among different feature dimensions. [6] presents an LSTMbased method to capture the temporal information between video frames. Recently, [15] attempts to fuse the temporal information by using fullyconnected layers and frame pooling. Different from these hashing methods, our proposed HCN is a unified architecture for both image and video hashing. Via directly replacing 2D CNNs with 3D CNNs, the proposed HCN can capture the temporal information for video hashing.
Our central similarity is partially related with [28]
which uses a center loss to learn more discriminative representation for face recognition (classification). The centers in
[28] are derived from the feature representation of the corresponding categories, which is unstable with intraclass variations. Different from this center loss in recognition [28], our proposed hash centers help generate highquality hash codes in the Hamming space. It is defined over the hash codes instead of feature representations.3 Proposed Method
We consider learning a hashing function in a supervised manner from a training set of data , where each is a datum to hash and denotes the semantic label for data . Let denote the nonlinear hashing function from input space to bit Hamming space . Similar to other supervised “deep learning to hash” methods [1, 30], we aim for such a hashing function that the generated hash codes ’s for the data ’s are close if they share similar labels.
As aforementioned, most existing methods learn the hashing function by encouraging the generated hash codes to preserve the raw data pairwise similarity. However, they suffer from learning inefficiency and inability of modeling the global relations of the whole dataset, leading to inaccurate hash codes and degraded retrieval performance. Instead of learning from the local pairwise relations, we propose to utilize global relations to enhance the quality of the hashing function. Specifically, we supervise hashing function learning by encouraging the generated hash codes to concentrate around a common center, termed a hash center, if the input data pairs are similar.
We define a set of points with sufficient distance in the Hamming space as hash centers, and propose to learn the hashing functions supervised by the central similarity w.r.t. . The central similarity would encourage similar data pairs to be close to a common hash center and dissimilar data pairs to concentrate around different hash centers. Through such central similarity learning, the global similarity information between data pairs can be preserved in , giving highquality hash codes.
In below, we first give a formal definition of hash center and explain how to generate proper hash centers systematically. Then we elaborate on details of the central similarity learning. The HCN framework will be described in the end.
3.1 Definition of Hash Center
The first step of our proposed method is to position a set of good hash centers to anchor the following central similarity based hashing function learning. To ensure the generated hash codes for dissimilar data are sufficiently distant from each other, each center should be more distant from the other centers than to the hash codes associated with it. As such, the dissimilar pairs can be better separated and similar pairs can be aggregated cohesively. Based on this intuition, we formally define a set of points in the Hamming space as valid hash centers with following properties.
Definition 1 (Hash Center).
We define hash centers as a set of points in the dimensional Hamming space whose average pairwise distance satisfies::
(1) 
where is the Hamming distance, is the number of hash centers, and is the number of combinations of different and .
For better clarity, we illustrate some examples of the desired hash centers in the 3d and 4d Hamming space in Fig. 2. In Fig. 2(a), the hash center of the hash codes , and is , and the three hash codes have the same Hamming distance from . In Fig. 2(b), we use 4d hypercube to represent the 4d Hamming space. The two stars and are the hash centers given in Definition 1. The distance between and is , and the distance between the green dots and the center is the same (). However, we do not strictly require all points to have the same distance from the corresponding center. Instead, we define the nearest center as the corresponding hash center for one hash code.
3.2 Generation of Hash Center
To obtain the hash centers with the above properties, in this subsection, we develop two systematic generation approaches based on the following observation. In the dimensional Hamming space, if a set of points are mutually orthogonal, they will have an equal distance of to each other. Namely they are valid hash centers satisfying Definition 1.
Accordingly, our first approach is to generate hash centers by leveraging the following nice properties of a Hadamard matrix. It is known that a Hadamard matrix satisfies: 1) It is a squared matrix whose rows are mutually orthogonal, i.e.
, the inner products of any two row vectors
. The Hamming distance between any two row vectors is . Therefore, we can choose hash centers from these row vectors. 2) Its size is a power of (i.e., ), which is consistent with the usual number of bits of hash codes. 3) It is a binary matrix whose entries are either 1 or +1. We can simply replace all 1 with 0 to obtain hash centers in .To sample the hash centers from the Hadamard matrix, we first build a Hadamard matrix by Sylvester’s construction [27] as follows:
(2) 
where represents the Hadamard product, and . The two factors within the initial Hadamard matrix are and . When the number of centers , we directly choose each row to be a hash center. When , we use a combination of two Hadamard matrices to construct hash centers^{1}^{1}1We prove that the rows of can also be valid hash centers in the dimensional Hamming space in supplementary material..
Though applicable in most cases, the number of valid centers generated by the above approach is constrained by the fact that the Hadamard matrix is a squared one. If is larger than or is not the power of 2, the first approach is inapplicable. We thus propose the second generation approach based on randomly sampling the bits of each center vector. In particular, each bit of a center is sampled from a Bernoulli distribution where if . We can easily prove that the distance between these centers is in expectation. Namely, if . We summarize these two approaches in Alg. 1.
Once obtaining a set of hash centers, the next step is to associate the training data with their individual corresponding centers to compute the central similarity. Recall is the semantic label for , and usually , where is the number of categories. For singlelabel data, each datum belongs to one category, while each multilabel datum belongs to more than one category. We term the hash centers that are generated from Alg. 1 and associated with semantic labels as semantic hash centers. We now explain how to obtain the semantic hash centers for single and multilabel data separately.
Semantic hash centers for singlelabel data
For singlelabel data, we assign one hash center for each category. That is, we generate hash centers by Alg. 1 corresponding to labels . Thus, data pairs with the same label share a common center and are encouraged to be close to each other. Because each datum is assigned to one hash center, we obtain the semantic hash centers , where is the hash center of .
Semantic hash centers for multilabel data
For multilabel data, HashNet [1] and DHN [30] directly make data pairs be similar if they share at least one category. However, they ignore the transitive similarity when data pairs share more than one category. In this paper, we generate transitive centers for data pairs sharing multiple labels. First, we generate hash centers by Alg. 1 corresponding to semantic labels . Then for data including two or more categories, we calculate the centroid of these centers, each of which corresponds to a single category. For example, suppose one datum has three categories , and . The centers of the three categories are , and , as shown in Fig 3. We calculate the centroid of the three centers as the hash center of . To ensure the elements to be binary, we calculate each bit by voting at the same bit of the three centers and taking the value that dominates, as shown in the right panel of Fig 3. If the number of 0 is equal to the number of 1 at some bits (i.e., the voting result is a draw), we sample from for these bits. Finally, for each , we generate the centroid as its semantic hash center, and then obtain semantic hash centers , where is the hash center of .
3.3 Central Similarity Learning
Given the generated centers for training data with categories, we obtain the semantic hash centers for single or multilabel data, where denotes the hash center of the datum . We derive the central similarity learning objective by maximizing the logarithm posterior of the hash codes w.r.t. the semantic hash centers. Formally, the logarithm Maximum a Posterior (MAP) estimation of hash codes
for all the training data can be obtained by maximizing the following likelihood probability:
(3)  
where is the prior distribution over hash codes, and is the likelihood function. is the conditional probability of center given hash code . We model as a Gibbs distribution:
(4) 
where and are constants, and measures the Hamming distance between a hash code and its hash center. Since hash centers are binary vectors, we use Binary Cross Entropy (BCE) to measure the Hamming distance between the hash code and its center, . So the conditional probability is
(5)  
We can see that the larger the conditional probability is, the smaller the Hamming distance will be between hash code and its hash center , meaning the hash code is close to its corresponding center; otherwise the hash code is further away from its corresponding center. By substituting Eqn. (5) into MAP estimation, we obtain the optimization objective of the central similarity loss :
(6) 
Since each hash center is binary, existing optimization cannot guarantee that the generated hash codes completely converge on hash centers [25] due to the inherent optimization difficulty. So we introduce a quantization loss to refine the generated hash codes and . Similar with DHN [30], we use bimodal Laplacian prior for quantization, which is defined as
(7) 
where is an allone vector. As is a nonsmooth function which makes it difficult to calculate its derivative, we adopt the smooth function [8] to replace it. So . Then the quantization loss becomes
(8) 
Finally, we obtain the central similarity optimization problem:
(9) 
where is the set of all parameters for deep hashing function learning, and is the hyperparameter to balance the central similarity estimation and quantization processing.^{2}^{2}2We provide the formulation for jointly estimating central similarity and pairwise similarity to learn deep hashing functions in supplementary material. And pairwise loss function is also given.
3.4 Architecture of HCN
Base on these definitions and designs, we propose a Hash Center Network (HCN) to learn central similarity for image and video hashing. The network architecture is shown in Fig. 4. The input of HCN is . Here and are the hash centers for and
respectively. HCN takes this input and outputs compact hash codes through the following deep hashing pipeline: 1) a 2D or 3D CNN subnetwork to extract the data representation for image or video data, 2) a hash layer with three fullyconnected layers and activation functions to project high dimensional data features to hash codes in the Hamming space, 3) a central similarity loss
for central similaritypreserving learning, where all hash centers are defined in the Hamming space, making hash codes converge on corresponding centers. and 4) a quantization lossfor improving binarization.
4 Experiments
We conduct experiments for both image and video retrieval to evaluate our central similarity and HCN against several stateofthearts.
4.1 Experiment Setting
Five benchmark datasets are used in our experiments and their statistics used in this paper are summarized in Table 1.
4.1.1 Settings for Image Hashing and Retrieval
We use three standard image retrieval datasets, ImageNet, NUS_WIDE and MS COCO. On
ImageNet [21], we follow the settings in [1] and sample all images from 100 categories. As it is a singlelabel dataset, we directly generate 100 hash centers, with one for each category. MS COCO [14] is a multilabel image dataset with 80 categories. NUS_WIDE [3] is also a multilabel image dataset. Following [30, 12], we choose images from the 21 most frequent categories for evaluation. For MS COCO and NUS_WIDE datasets, we first generate 80 and 21 hash centers for all categories respectively, and then calculate the centroid of the multicenters as the semantic hash centers for each image with multiple labels, following the approach in Sec. 3.2. The visualization of all generated hash centers is given in supplementary material.We compare retrieval performance of proposed HCN with ten classical or stateoftheart hashing methods, including unsupervised methods LSH [4], SH [26], ITQ [5], supervised shallow methods ITQCCA [5], BRE [11], SDH [22] and supervised deep methods HashNet [1], DHN [30], CNNH [29], DNNH [12]. For shallow hashing methods, we adopt the result from latest works [1, 30]
to make them directly comparable. We evaluate image retrieval performance based on four standard evaluation metrics: Mean Avearage Precision (MAP), PrecisionRecall curves (PR), Precision curves w.r.t. different numbers of returned samples (P@N), Precision curves within Hamming distance 2 (P@H=2). We adopt MAP@1000 for ImageNet as every category has 1,300 images, and adopt MAP@5000 for MS COCO and NUS_WIDE.
4.1.2 Settings for Video Hashing and Retrieval
Two video retrieval datasets, UCF101 [23] and HMDB51 [10], are used and we directly use their default settings. On UCF101, we use 9.5k videos for training and retrieval, and 3.8k queries in every split. For HMDB51, we have 3.5k videos for training and retrieval, 1.5k videos for testing (queries) in each split.
In video retrieval experiments, HCN adopts a lightweight 3D CNN, MultiFiber 3D CNN [2], as the convolution layers to learn the representation for videos. We compare retrieval performance of proposed HCN with three deep supervised video hashing methods: DH [20], DLSTM [31] and SRH [6] based on the same evaluation metrics with image retrieval experiments.
Implementation
Due to space limit, we defer implementation details of HCN for image and video hashing to supplementary material.
Dataset  Data Type  #Train  #Test  #Retrieval  DI 

ImageNet  image  10,000  5,000  128,495  100:1 
MS COCO  image  10,000  5,000  112,217  1:1 
NUS_WIDE  image  10,000  2,040  149,685  5:1 
UCF101  video  9.5k  3.8k  9.5k  101:1 
HMDB51  video  3.5k  1.5k  3.5k  51:1 
4.2 Quantitative Results
ImageNet (MAP@1000)  MS COCO (MAP@5000)  NUSWIDE (MAP@5000)  

16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
LSH [4] 
0.101  0.235  0.360  0.460  0.485  0.586  0.403  0.421  0.441 
SH [26]  0.207  0.328  0.419  0.495  0.507  0.510  0.433  0.426  0.423 
ITQ [5]  0.326  0.462  0.552  0.582  0.624  0.657  0.452  0.468  0.477 
ITQCCA [5]  0.266  0.436  0.576  0.566  0.562  0.502  0.435  0.435  0.435 
BRE [11]  0.063  0.253  0.358  0.592  0.622  0.634  0.485  0.525  0.544 
SDH [22]  0.299  0.455  0.585  0.554  0.564  0.580  0.575  0.590  0.613 
CNNH [29]  0.315  0.473  0.596  0.599  0.617  0.620  0.655  0.659  0.647 
DNNH [12]  0.353  0.522  0.610  0.644  0.651  0.647  0.703  0.738  0.754 
DHN [30] 
0.367  0.522  0.627  0.719  0.731  0.745  0.712  0.759  0.771 
HashNet [1] 
0.622  0.701  0.739  0.745  0.773  0.788  0.757  0.775  0.790 
HCN (Ours) 
0.851  0.865  0.873  0.796  0.838  0.861  0.810  0.825  0.839 
UCF101 (MAP@100)  HMDB51 (MAP@70)  
16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
DH [20]  0.300  0.290  0.470  0.360  0.360  0.310 
SRH [6] 
0.716  0.692  0.754  0.491  0.503  0.509 
DVH [15] 
0.701  0.705  0.712  0.441  0.456  0.518 
HCN (Ours)  0.838  0.875  0.874  0.527  0.565  0.579 
Results in terms of Mean Average Precision (MAP) for image retrieval and video retrieval are shown in Table 2 and 3. From Table 2, one can observe that our HCN achieves the best performance for the image retrieval task. Compared with the stateoftheart deep hashing method HashNet, our HCN brings an increase of at least 13.4%, 6.5%, 4.9% in MAP for different bits on ImageNet, MS COCO and NUS_WIDE respectively. Specifically, the MAP boost on ImageNet is much larger than that on the other two datasets by about 7%9%. Note ImageNet has the most severe data imbalance among the three image retrieval datasets (Table 1). This proves that central similarity learning can efficiently relieve the data imbalance problem.
In Table 3, our HCN also achieves significant performance boost for video retrieval. It achieves an impressive MAP increase of over 12.0% and 4.8% for different bits on UCF101 and HMDB51 respectively. We achieve larger increases on UCF101 because it also suffers severe data imbalance.
Fig. 5 and Fig. 6 show retrieval performance in PrecisionRecall curves (PR curve), Precision curves w.r.t. different numbers of returned samples (P@N) and Precision curves with Hamming distance 2(P@H=2) respectively, for one image dataset (ImageNet) and one video dataset (UCF101). From these two figures, we can find HCN also outperforms all comparison methods by large margins on ImageNet and UCF101 w.r.t. the three performance metrics.
4.3 Visualization Results
Visualization of retrieval results
We illustrate the retrieval results on ImageNet, MS COCO, UCF101 and HMDB51 in Fig. 7. It can be seen that HCN can return much more relevant results. On MS COCO, HCN uses the centroid of multiple centers as the hashing target for multilabel data, thus the returned images of HCN share more common labels with the query compared with HashNet.
Visualization of hash codes
To have an intuitive view of generated hash codes by HCN, we visualize some examples in tSNE [17] in Fig. 8. We sample 10k generated hash codes in ImageNet, so Fig. 8(a) and Fig. 8(b) have the same number of points. As can be seen, HCN generates more cohensive hash codes for similar pairs (images from the same category) and dispersed hash codes for dissimilar pairs. This is desirable because the retrieval system can receive more relevant data and easily exclude irrelevant data by using Hamming ranking.
Visualization of hash code distance
We visualize the Hamming distance between 20 hash centers and generated hash codes of ImageNet and UCF101 by heat maps in Fig. 9. The columns represent the 20 hash centers of test data in ImageNet (sampled 1k test images) or UCF101 (sampled 0.6k test videos). The rows are the generated hash codes assigned to these 20 centers. We calculate the average Hamming distance between hash centers and hash codes assigned to different centers. The diagonal values in the heat maps are the average Hamming distances of the hash codes with the corresponding hash center. We find the diagonal values are small, meaning the generated hash codes “cluster” to the corresponding hash centers in the Hamming space. Most offdiagonal values are very large, meaning dissimilar data pairs spread sufficiently. We also find most offdiagonal values are around 32, which is exactly the Hamming distance between different hash centers in a 64 bits space.
(a) ImageNet  (b) UCF101 
4.4 Ablation Study
Ablation study I
We investigate effects of the proposed central similarity, traditional pairwise similarity and quantization process for hashing function learning, by evaluating different combinations of central similarity loss , pairwise similarity loss , and quantization loss . Results are summarized in Table 4. Our HCN includes and , corresponding to the 1st row in Table 4. When we add to HCN (2nd row), MAP only increases for some bits. This shows pairwise similarity has limited effects on further improving over central similarity learning. We add while removing (3rd row), and find the MAP decreases significantly for various bits. When only using , the MAP just decreases slightly. These two results show the positive effects of central similarity learning.
ImageNet (MAP@1000)  MS COCO (MAP@5000)  
16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
✓  ✓  0.851  0.865  0.873  0.796  0.838  0.861  
✓ 
✓  ✓  0.847  0.870  0.871  0.798  0.835  0.863 

✓  ✓  0.551  0.629  0.655  0.631  0.725  0.746 
✓ 
0.841  0.864  0.870  0.781  0.834  0.843  

Ablation study II
When applying Alg. 1, we can sample different rows of the Hadamard matrix to generate hash centers. To show HCN performs consistently well for different hash center choices, we evaluate its performance for five different combinations of hash centers. From the results in Table 5, we can validate the robustness of HCN to hash center choices.
Dataset  ImageNet  MS COCO  UCF101  HMDB51 

MAP 
5 Conclusion
In this paper, we propose a novel concept “Hash Center” to formulate the central similarity for deep hash learning. The proposed Hash Center Network (HCN) architecture can learn hash codes by optimizing the Hamming distance between hash codes with corresponding centers. We conduct extensive experiments to validate that HCN can generate high quality hash codes and yield stateoftheart performance for both image and video retrieval.
References
 [1] (2017) HashNet: deep learning to hash by continuation.. In ICCV, pp. 5609–5618. Cited by: §1, §2, §3.2, §3, §4.1.1, §4.1.1, Table 2.
 [2] (2018) Multifiber networks for video recognition. arXiv preprint arXiv:1807.11195. Cited by: §4.1.2, §6.2.
 [3] (2009) NUSwide: a realworld web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 48. Cited by: §4.1.1.
 [4] (1999) Similarity search in high dimensions via hashing. In Vldb, Vol. 99, pp. 518–529. Cited by: §4.1.1, Table 2.
 [5] (2013) Iterative quantization: a procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §4.1.1, Table 2.
 [6] (2016) Supervised recurrent hashing for large scale video retrieval. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 272–276. Cited by: §1, §2, §4.1.2, Table 3.

[7]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.1.1, §6.2.  [8] Natural image statistics: a probabilistic approach to early computational vision. Springer. Cited by: §3.3.
 [9] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
 [10] (2013) Hmdb51: a large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12, pp. 571–582. Cited by: §4.1.2.
 [11] (2009) Learning to hash with binary reconstructive embeddings. In Advances in neural information processing systems, pp. 1042–1050. Cited by: §4.1.1, Table 2.
 [12] (2015) Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3270–3278. Cited by: §1, §1, §2, §4.1.1, §4.1.1, Table 2.
 [13] (2015) Feature learning based deep supervised hashing with pairwise labels. arXiv preprint arXiv:1511.03855. Cited by: §1, §1.
 [14] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.1.
 [15] (2017) Deep video hashing. IEEE Transactions on Multimedia 19 (6), pp. 1209–1219. Cited by: §1, §2, Table 3.
 [16] (2016) Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2064–2072. Cited by: §1.

[17]
(2008)
Visualizing data using tsne.
Journal of machine learning research
9 (Nov), pp. 2579–2605. Cited by: §4.3.  [18] (2012) Hamming distance metric learning. In Advances in neural information processing systems, pp. 1061–1069. Cited by: §1.
 [19] (2017) Automatic differentiation in pytorch. Cited by: §4.1.1, §6.2.
 [20] (2017) Fast action retrieval from videos via feature disaggregation. Computer Vision and Image Understanding 156, pp. 104–116. Cited by: §1, §2, §4.1.2, Table 3.
 [21] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.1.1.
 [22] (2015) Supervised discrete hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 37–45. Cited by: §1, §4.1.1, Table 2.
 [23] (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.2.
 [24] (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927. Cited by: §1.
 [25] (2009) Why is optimization difficult?. In NatureInspired Algorithms for Optimisation, pp. 1–50. Cited by: §3.3.
 [26] (2009) Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760. Cited by: §4.1.1, Table 2.
 [27] (2002) Hadamard matrix. Cited by: §3.2.
 [28] (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §2.
 [29] (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2. Cited by: §1, §2, §4.1.1, Table 2.
 [30] (2016) Deep hashing network for efficient similarity retrieval.. In AAAI, pp. 2415–2421. Cited by: §1, §1, §2, §3.2, §3.3, §3, §4.1.1, §4.1.1, Table 2.
 [31] (2016) Dlstm approach to video modeling with hashing for largescale video retrieval. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3222–3227. Cited by: §4.1.2.
6 Supplementary Material
6.1 Jointly Learning with Pairwise Similarity
Given the semantic hash centers and pairwise similarity label , we can formulate central similarity and pairwise similarity based learning together to optimize the deep hashing functions. Recall the similarity label indicates the data pairs and are similar. The Maximum Likelihood (ML) estimation of hash codes for all training data with label can be obtained by maximizing the following likelihood probability:
(10) 
Since we build the hash centers based on , the is known and can be treated as constant. Equation (10) thus becomes . Then the log likelihood can be written as
(11)  
where the first RHS term represent the central similarity and the second RHS term is the pairwise similarity. The central similarity loss has been given in Sec. 3.3. For the pairwise similarity term in Equation (11), we use the inner product of the hash codes to measure the probability of the similarity labels.
Recall the Hamming distance and inner product for any two hash codes and satisfies: , where 1 is the allone vector. We use inner product to replace the Hamming distance and define , the conditional probability of , as follows:
(12) 
or equivalently,
(13)  
where
is the Sigmoid function. This logistic regressionalike formulation satisfies that the smaller the Hamming distance
, the larger the inner product and the larger the conditional probability . This means that the pairs andhave a large probability to be classified as similar. Otherwise, the pairs would be classified to be dissimilar (
is large). After algebraic calculations, maximizing the above likelihood can be equivalently written as minimizing the following the pairwise similarity loss is computed as:(14)  
Putting all the pieces together, we obtain the following jointly optimization
(15) 
where is the quantization loss, which has been given in the main text. In the experiment section of the main text, we also present and discuss performance of jointly learning by combining both and in the first ablation study.
6.2 Implementation Details
Implementation details for image retrieval
We implement the HCN model based on Pytorch [19] framework and employ ResNet [7]
architecture as 2D CNN for image feature learning. For fair comparison, the four baseline deep methods also use the same feature extraction network with the same configurations. We finetune the four convolution layers
conv1 to conv4 with learning rate 1e5, which inherits from ResNet model pretrained on the ImageNet. We never touch the test data in pretraining. We train the hash layer from scratch with 20 times learning rate than the convolution layers. We use the Adam solver [9] with a batch size of 64 and fix the hyperparameters and .Implementation details for video retrieval
We employ MFN [2] as 3D CNN for video feature learning. The HCN is first pretrained on action classification task to learn video features, and we copy the parameters of 3D convolution layers. Then we fine tune the convlutional layers with learning rate 5e4, and train the hash layer with 5 times learning rate than the 3D convolution layers. We use minibatch stochastic gradient decent (SGD) with 0.9 momentum. The batch size is 32 and weight decay parameters is 0.0001. We train on two TITAN X GPU (12G) and takes around 16 hours for UCF101 and 9 hours for HMDB51.
6.3 Visualization of Hash Centers
We visualize some generated hash centers from algorithm 1 in this section. The Hadamard matrix is shown as Fig. 10. The hash centers of 64bit for the five datasets we used are constructed by and as Algorithm 1.
For NUS_WIDE, we only sample 21 most frequent categories for experiments. Because of , all the hash centers with bits of 16, 32 and 64 for NUS_WIDE is constructed by Hadamard matrix , and . For the other four datasets, the 64bit hash centers are constructed by and , but 16bit and 32bit hash centers are constructed by sampling from Bernoulli distributions. We give the illustration of the hash centers of 16bit, 32bit and 64bit for ImageNet in Fig. 12 and Fig. 13. The hash centers for other three datasets are similar. In Fig. 12 and Fig. 13, every row means one hash center for one category in ImageNet.
6.4 Proof on Hash Center Validity from
When in Algorithm 1, we use the combination of two Hadamard matrices to construct the hash centers. Here, we prove that the rows of can also be valid hash centers in the dimensional Hamming space. According to Definition 1, we know if the Hamming distance between any two row vectors of is equal to or lager than , the row vectors of are valid hash centers.
We consider following three cases for Hamming distance between any two row vectors and in :

Both of the two row vectors and belong to the upper half or below half of , i.e., or . So and are still orthgonal with each other with an inner product of . We get the Hamming distance ;

One of the two row vectors belongs to , and the other one belongs to . We assume and . If , the two row vectors are still orthgonal with each other, thus .

One of the two row vectors belongs to , and the other one belongs to , but . Thus the inner product is , and .
We summarize these three situations as following:
(16) 
So the average Hamming distance is larger than , and the row vectors in are valid hash centers in Kdimensional Hamming space.