Central Similarity Hashing via Hadamard matrix

Hashing has been widely used for efficient large-scale multimedia data retrieval. Most existing methods learn hashing functions from data pairwise similarity to generate binary hash codes. However, in practice we find only learning from the local relationships of pairwise similarity cannot capture the global distribution of large-scale data, which would degrade the discriminability of the generated hash codes and harm the retrieval performance. To overcome this limitation, we propose a new global similarity metric, termed as central similarity, to learn better hashing functions. The target of central similarity learning is to encourage hash codes for similar data pairs to be close to a common center and those for dissimilar pairs to converge to different centers in the Hamming space, which substantially improves retrieval accuracy. In order to principally formulate the central similarity learning, we define a new concept, hash center, to be a set of points scattered in the Hamming space with a sufficient distance between each other, and propose to use Hadamard matrix to construct high-quality hash centers efficiently. Based on these definitions and designs, we devise a new hash center network (HCN) that learns hashing functions by optimizing the central similarity w.r.t. these hash centers. The central similarity learning and HCN are generic and can be applied for both image and video hashing. Extensive experiments for both image and video retrieval demonstrate HCN can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs, and achieve noticeable boost in retrieval performance, i.e. 4%-13% in MAP over latest state-of-the-arts. The codes are in: <https://github.com/yuanli2333/Hadamard-Matrix-for-hashing>


page 8

page 11

page 12


Instance-weighted Central Similarity for Multi-label Image Retrieval

Deep hashing has been widely applied to large-scale image retrieval by e...

Mean Local Group Average Precision (mLGAP): A New Performance Metric for Hashing-based Retrieval

The research on hashing techniques for visual data is gaining increased ...

EcoFormer: Energy-Saving Attention with Linear Complexity

Transformer is a transformative framework that models sequential data an...

Binary Subspace Coding for Query-by-Image Video Retrieval

The query-by-image video retrieval (QBIVR) task has been attracting cons...

CIMON: Towards High-quality Hash Codes

Recently, hashing is widely-used in approximate nearest neighbor search ...

Conv-codes: Audio Hashing For Bird Species Classification

In this work, we propose a supervised, convex representation based audio...

Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Conventional fake video detection methods outputs a possibility value or...

1 Introduction

Figure 1: Comparison of traditional pairwise similarity learning and our central similarity learning. We visualize the high-dimensional Hamming space in the 2d Euclidean space. Each point represents a hash code of an image in the Hamming space, generated by a hashing function . Points with the same color denote hash codes of similar pairs. The distance between points reflects the Hamming distance.

By projecting high-dimensional data to compact binary hash codes in the Hamming space via a proper hashing function 


, hashing methods offer remarkable efficiency for data storage and retrieval. Recently, “deep learning to hash” methods 

[12, 22, 15, 13, 16, 29]

have shown that deep neural networks can naturally represent a nonlinear hashing function to generate hash codes for input data and be applied to image retrieval 

[29, 30] and video retrieval [6, 20, 15].

Most previous methods [1, 30, 18, 13] learn the deep hashing functions by utilizing pairwise data similarity that captures data relationships from a local aspect. As shown in the upper panel of Fig. 1, the pairwise similarity learning aims at obtaining a hashing function that generates hash codes with minimal distance for similar data and maximal distance for dissimilar data in the Hamming space. However, it intrinsically suffers the following issues: 1) Low-efficiency in profiling the whole distribution of training data. The commonly used doublet similarity [1, 30, 13] or triplet similarity metrics [18, 12] have a time complexity at the order of for learning hashing functions from data points, which means it is almost impractical to exhaustively learn from all possible data pairs. 2) Insufficient coverage of data distribution. Pairwise similarity based methods utilize only partial relationships between data pairs, which may harm the discriminability of the generated hash codes. 3) Low effectiveness on imbalanced data. In real world scenarios, the number of dissimilar pairs is much larger than that of similar pairs. Thus pairwise similarity learning based hashing methods cannot learn similarity relationships adequately to generate sufficiently good hash codes, leading to limited performance.

To solve the above issues, we propose a new global similarity metric, termed as central similarity, and we learn it to obtain better hashing functions. In particular, central similarity measures the Hamming distance between hash codes and hash center which is defined as a set of points in the Hamming space with a sufficient mutual distance. Central similarity learning targets at encouraging the generated hash codes to be close to the corresponding hash centers. In this way, the various hash codes in the Hamming space would concentrate around different hash centers thus can be well discriminated and benefit the retrieval accuracy. From the bottom panel of Fig. 1, it can be intuitively seen that through learning central similarity, the hash codes of similar data pairs concentrate around their common hash centers (the black stars) and those of dissimilar pairs distribute around different hash centers. The central similarity learning is with time complexity of only for data points and centers. Even in presence of severe data imbalance, the hashing function can still be well learned from the global relationships.

To obtain suitable hash centers for similar and dissimilar data pairs, two systematic approaches can be adopted. One is to use Hadamard matrix to construct hash centers; the other is to generate hash centers with random sampling from Bernoulli distributions. We prove that both approaches can generate proper hash centers that are separated from each other with sufficient Hamming distance.

With central similarity learning, we propose a novel network architecture, the Hash Center Network (HCN), based on Convolutional Neural Network (CNN), to learn a deep hashing function in the Hamming space. In particular, HCN consists of convolution layers for deep feature learning and a hash layer for generating hash codes. After identifying the hash centers, we train HCN to generate hash codes with the target of making similar pairs converge to the same hash center. HCN is generic and compatible with both 2D and 3D CNNs for learning hash codes for both images and videos.

Our contributions are three-fold. 1) We introduce a novel central similarity learning as a new hashing method to capture global data distribution and generate high-quality hashing functions. 2) We propose a novel concept of hash center to facilitate the central similarity learning. We also present two methods to generate proper hash centers. To our best knowledge, this is the first work to utilize central similarity and hash center for deep hashing function learning. 3) We propose a unified deep hash network architecture for both image and video hashing that establishes new state-of-the-art.

2 Related Work

Deep network based hashing methods such as CNNH [29], DNNH [12], DHN [30] and HashNet [1]

have been proposed for image hashing. These “deep learning to hash” methods adopt 2D CNNs to learn image features and then hash layers to learn hash codes. Recent hashing methods for images focus on how to design a more efficient pairwise-similarity loss function. DNNH 

[12] proposes to use a triplet ranking loss for similarity learning. DHN [30]

uses Maximum a Posterior (MAP) estimation to obtain the pairwise similarity loss function. HashNet 

[1] adopts Weighted Maximum Likelihood (WML) estimation to alleviate the severe data imbalance by adding weights in pairwise loss functions. Different from previous works, we propose to use central similarity to model the relationships between similar and dissimilar pairs and improve the discriminability of generated hash codes.

Compared with image hashing, some video hashing methods such as DH [20], SRH [6], DVH [15] propose to capture the temporal information in videos. For instance, [20] utilizes Disaggregation Hashing to exploit the correlations among different feature dimensions. [6] presents an LSTM-based method to capture the temporal information between video frames. Recently, [15] attempts to fuse the temporal information by using fully-connected layers and frame pooling. Different from these hashing methods, our proposed HCN is a unified architecture for both image and video hashing. Via directly replacing 2D CNNs with 3D CNNs, the proposed HCN can capture the temporal information for video hashing.

Our central similarity is partially related with [28]

which uses a center loss to learn more discriminative representation for face recognition (classification). The centers in 

[28] are derived from the feature representation of the corresponding categories, which is unstable with intra-class variations. Different from this center loss in recognition [28], our proposed hash centers help generate high-quality hash codes in the Hamming space. It is defined over the hash codes instead of feature representations.

3 Proposed Method

We consider learning a hashing function in a supervised manner from a training set of data , where each is a datum to hash and denotes the semantic label for data . Let denote the nonlinear hashing function from input space to -bit Hamming space . Similar to other supervised “deep learning to hash” methods [1, 30], we aim for such a hashing function that the generated hash codes ’s for the data ’s are close if they share similar labels.

As aforementioned, most existing methods learn the hashing function by encouraging the generated hash codes to preserve the raw data pairwise similarity. However, they suffer from learning inefficiency and inability of modeling the global relations of the whole dataset, leading to inaccurate hash codes and degraded retrieval performance. Instead of learning from the local pairwise relations, we propose to utilize global relations to enhance the quality of the hashing function. Specifically, we supervise hashing function learning by encouraging the generated hash codes to concentrate around a common center, termed a hash center, if the input data pairs are similar.

We define a set of points with sufficient distance in the Hamming space as hash centers, and propose to learn the hashing functions supervised by the central similarity w.r.t. . The central similarity would encourage similar data pairs to be close to a common hash center and dissimilar data pairs to concentrate around different hash centers. Through such central similarity learning, the global similarity information between data pairs can be preserved in , giving high-quality hash codes.

In below, we first give a formal definition of hash center and explain how to generate proper hash centers systematically. Then we elaborate on details of the central similarity learning. The HCN framework will be described in the end.

3.1 Definition of Hash Center

The first step of our proposed method is to position a set of good hash centers to anchor the following central similarity based hashing function learning. To ensure the generated hash codes for dissimilar data are sufficiently distant from each other, each center should be more distant from the other centers than to the hash codes associated with it. As such, the dissimilar pairs can be better separated and similar pairs can be aggregated cohesively. Based on this intuition, we formally define a set of points in the Hamming space as valid hash centers with following properties.

Definition 1 (Hash Center).

We define hash centers as a set of points in the -dimensional Hamming space whose average pairwise distance satisfies::


where is the Hamming distance, is the number of hash centers, and is the number of combinations of different and .

For better clarity, we illustrate some examples of the desired hash centers in the 3d and 4d Hamming space in Fig. 2. In Fig. 2(a), the hash center of the hash codes , and is , and the three hash codes have the same Hamming distance from . In Fig. 2(b), we use 4d hypercube to represent the 4d Hamming space. The two stars and are the hash centers given in Definition 1. The distance between and is , and the distance between the green dots and the center is the same (). However, we do not strictly require all points to have the same distance from the corresponding center. Instead, we define the nearest center as the corresponding hash center for one hash code.

(a) 3d Hamming space
(b) 4d Hamming space
Figure 2: Illustration of hash centers in 3d and 4d Hamming space. The coordinates of each point in the Hamming space are exactly the hash codes and given in the figures. Star points are hash centers; colored ones are hash codes assigned to the connected center.

3.2 Generation of Hash Center

To obtain the hash centers with the above properties, in this subsection, we develop two systematic generation approaches based on the following observation. In the -dimensional Hamming space, if a set of points are mutually orthogonal, they will have an equal distance of to each other. Namely they are valid hash centers satisfying Definition 1.

Accordingly, our first approach is to generate hash centers by leveraging the following nice properties of a Hadamard matrix. It is known that a Hadamard matrix satisfies: 1) It is a squared matrix whose rows are mutually orthogonal, i.e.

, the inner products of any two row vectors

. The Hamming distance between any two row vectors is . Therefore, we can choose hash centers from these row vectors. 2) Its size is a power of (i.e., ), which is consistent with the usual number of bits of hash codes. 3) It is a binary matrix whose entries are either -1 or +1. We can simply replace all -1 with 0 to obtain hash centers in .

To sample the hash centers from the Hadamard matrix, we first build a Hadamard matrix by Sylvester’s construction [27] as follows:


where represents the Hadamard product, and . The two factors within the initial Hadamard matrix are and . When the number of centers , we directly choose each row to be a hash center. When , we use a combination of two Hadamard matrices to construct hash centers111We prove that the rows of can also be valid hash centers in the -dimensional Hamming space in supplementary material..

input : the number of hash centers , the dimension of Hamming space (hash codes)
initialization: construct a Hadamard matrix and construct matrix
for iteration , to  do
        if  &  then   // is any
        end if
       else if  &  then
        end if
end for
Replace all -1 with 0 in these centers
output : hash centers:
Algorithm 1 Generation of Hash Center

Though applicable in most cases, the number of valid centers generated by the above approach is constrained by the fact that the Hadamard matrix is a squared one. If is larger than or is not the power of 2, the first approach is inapplicable. We thus propose the second generation approach based on randomly sampling the bits of each center vector. In particular, each bit of a center is sampled from a Bernoulli distribution where if . We can easily prove that the distance between these centers is in expectation. Namely, if . We summarize these two approaches in Alg. 1.

Once obtaining a set of hash centers, the next step is to associate the training data with their individual corresponding centers to compute the central similarity. Recall is the semantic label for , and usually , where is the number of categories. For single-label data, each datum belongs to one category, while each multi-label datum belongs to more than one category. We term the hash centers that are generated from Alg. 1 and associated with semantic labels as semantic hash centers. We now explain how to obtain the semantic hash centers for single- and multi-label data separately.

Semantic hash centers for single-label data

For single-label data, we assign one hash center for each category. That is, we generate hash centers by Alg. 1 corresponding to labels . Thus, data pairs with the same label share a common center and are encouraged to be close to each other. Because each datum is assigned to one hash center, we obtain the semantic hash centers , where is the hash center of .

Figure 3: The center from multi-centers for multi-label data.
Semantic hash centers for multi-label data

For multi-label data, HashNet [1] and DHN [30] directly make data pairs be similar if they share at least one category. However, they ignore the transitive similarity when data pairs share more than one category. In this paper, we generate transitive centers for data pairs sharing multiple labels. First, we generate hash centers by Alg. 1 corresponding to semantic labels . Then for data including two or more categories, we calculate the centroid of these centers, each of which corresponds to a single category. For example, suppose one datum has three categories , and . The centers of the three categories are , and , as shown in Fig 3. We calculate the centroid of the three centers as the hash center of . To ensure the elements to be binary, we calculate each bit by voting at the same bit of the three centers and taking the value that dominates, as shown in the right panel of Fig 3. If the number of 0 is equal to the number of 1 at some bits (i.e., the voting result is a draw), we sample from for these bits. Finally, for each , we generate the centroid as its semantic hash center, and then obtain semantic hash centers , where is the hash center of .

Figure 4: Architecture of proposed Hash Center Network (HCN). HCN takes as input similar and dissimilar pairs (images or videos). For image or video data, we use different types of CNNs for feature learning. After passing through a hash layer, similar pairs converge to a common center and dissimilar pairs converge to different centers by adding central similarity constraint. The convergent targets are the hash centers (Center1 and Center2) in the Hamming space.

3.3 Central Similarity Learning

Given the generated centers for training data with categories, we obtain the semantic hash centers for single- or multi-label data, where denotes the hash center of the datum . We derive the central similarity learning objective by maximizing the logarithm posterior of the hash codes w.r.t. the semantic hash centers. Formally, the logarithm Maximum a Posterior (MAP) estimation of hash codes

for all the training data can be obtained by maximizing the following likelihood probability:


where is the prior distribution over hash codes, and is the likelihood function. is the conditional probability of center given hash code . We model as a Gibbs distribution:


where and are constants, and measures the Hamming distance between a hash code and its hash center. Since hash centers are binary vectors, we use Binary Cross Entropy (BCE) to measure the Hamming distance between the hash code and its center, . So the conditional probability is


We can see that the larger the conditional probability is, the smaller the Hamming distance will be between hash code and its hash center , meaning the hash code is close to its corresponding center; otherwise the hash code is further away from its corresponding center. By substituting Eqn. (5) into MAP estimation, we obtain the optimization objective of the central similarity loss :


Since each hash center is binary, existing optimization cannot guarantee that the generated hash codes completely converge on hash centers [25] due to the inherent optimization difficulty. So we introduce a quantization loss to refine the generated hash codes and . Similar with DHN [30], we use bi-modal Laplacian prior for quantization, which is defined as


where is an all-one vector. As is a non-smooth function which makes it difficult to calculate its derivative, we adopt the smooth function  [8] to replace it. So . Then the quantization loss becomes


Finally, we obtain the central similarity optimization problem:


where is the set of all parameters for deep hashing function learning, and is the hyper-parameter to balance the central similarity estimation and quantization processing.222We provide the formulation for jointly estimating central similarity and pairwise similarity to learn deep hashing functions in supplementary material. And pairwise loss function is also given.

3.4 Architecture of HCN

Base on these definitions and designs, we propose a Hash Center Network (HCN) to learn central similarity for image and video hashing. The network architecture is shown in Fig. 4. The input of HCN is . Here and are the hash centers for and

respectively. HCN takes this input and outputs compact hash codes through the following deep hashing pipeline: 1) a 2D or 3D CNN sub-network to extract the data representation for image or video data, 2) a hash layer with three fully-connected layers and activation functions to project high dimensional data features to hash codes in the Hamming space, 3) a central similarity loss

for central similarity-preserving learning, where all hash centers are defined in the Hamming space, making hash codes converge on corresponding centers. and 4) a quantization loss

for improving binarization.

4 Experiments

We conduct experiments for both image and video retrieval to evaluate our central similarity and HCN against several state-of-the-arts.

4.1 Experiment Setting

Five benchmark datasets are used in our experiments and their statistics used in this paper are summarized in Table 1.

4.1.1 Settings for Image Hashing and Retrieval

We use three standard image retrieval datasets, ImageNet, NUS_WIDE and MS COCO. On

ImageNet [21], we follow the settings in [1] and sample all images from 100 categories. As it is a single-label dataset, we directly generate 100 hash centers, with one for each category. MS COCO [14] is a multi-label image dataset with 80 categories. NUS_WIDE [3] is also a multi-label image dataset. Following [30, 12], we choose images from the 21 most frequent categories for evaluation. For MS COCO and NUS_WIDE datasets, we first generate 80 and 21 hash centers for all categories respectively, and then calculate the centroid of the multi-centers as the semantic hash centers for each image with multiple labels, following the approach in Sec. 3.2. The visualization of all generated hash centers is given in supplementary material.

We compare retrieval performance of proposed HCN with ten classical or state-of-the-art hashing methods, including unsupervised methods LSH [4], SH [26], ITQ [5], supervised shallow methods ITQ-CCA [5], BRE [11], SDH [22] and supervised deep methods HashNet [1], DHN [30], CNNH [29], DNNH [12]. For shallow hashing methods, we adopt the result from latest works [1, 30]

to make them directly comparable. We evaluate image retrieval performance based on four standard evaluation metrics: Mean Avearage Precision (MAP), Precision-Recall curves (PR), Precision curves w.r.t. different numbers of returned samples (P@N), Precision curves within Hamming distance 2 (P@H=2). We adopt MAP@1000 for ImageNet as every category has 1,300 images, and adopt MAP@5000 for MS COCO and NUS_WIDE.

We implement the HCN model on Pytorch 

[19] framework and employ 2D convolution layers from ResNet [7] as image feature learning. For fair comparison, the four deep methods use the same feature learning network.

4.1.2 Settings for Video Hashing and Retrieval

Two video retrieval datasets, UCF101 [23] and HMDB51 [10], are used and we directly use their default settings. On UCF101, we use 9.5k videos for training and retrieval, and 3.8k queries in every split. For HMDB51, we have 3.5k videos for training and retrieval, 1.5k videos for testing (queries) in each split.

In video retrieval experiments, HCN adopts a lightweight 3D CNN, Multi-Fiber 3D CNN [2], as the convolution layers to learn the representation for videos. We compare retrieval performance of proposed HCN with three deep supervised video hashing methods: DH [20], DLSTM [31] and SRH [6] based on the same evaluation metrics with image retrieval experiments.


Due to space limit, we defer implementation details of HCN for image and video hashing to supplementary material.

Dataset Data Type #Train #Test #Retrieval DI
ImageNet image 10,000 5,000 128,495 100:1
MS COCO image 10,000 5,000 112,217 1:1
NUS_WIDE image 10,000 2,040 149,685 5:1
UCF101 video 9.5k 3.8k 9.5k 101:1
HMDB51 video 3.5k 1.5k 3.5k 51:1
Table 1: Experimental settings for each dataset. DI (Data Imbalance) denotes the ratio between the number of dissimilar pairs and similar pairs.

4.2 Quantitative Results

ImageNet (MAP@1000) MS COCO (MAP@5000) NUS-WIDE (MAP@5000)
16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits

LSH [4]
0.101 0.235 0.360 0.460 0.485 0.586 0.403 0.421 0.441
SH [26] 0.207 0.328 0.419 0.495 0.507 0.510 0.433 0.426 0.423
ITQ [5] 0.326 0.462 0.552 0.582 0.624 0.657 0.452 0.468 0.477
ITQ-CCA [5] 0.266 0.436 0.576 0.566 0.562 0.502 0.435 0.435 0.435
BRE [11] 0.063 0.253 0.358 0.592 0.622 0.634 0.485 0.525 0.544
SDH [22] 0.299 0.455 0.585 0.554 0.564 0.580 0.575 0.590 0.613
CNNH [29] 0.315 0.473 0.596 0.599 0.617 0.620 0.655 0.659 0.647
DNNH [12] 0.353 0.522 0.610 0.644 0.651 0.647 0.703 0.738 0.754

DHN [30]
0.367 0.522 0.627 0.719 0.731 0.745 0.712 0.759 0.771

HashNet [1]
0.622 0.701 0.739 0.745 0.773 0.788 0.757 0.775 0.790

HCN (Ours)
0.851 0.865 0.873 0.796 0.838 0.861 0.810 0.825 0.839
Table 2: Comparison in MAP of Hamming Ranking for different bits on image retrieval.
UCF-101 (MAP@100) HMDB51 (MAP@70)
16 bits 32 bits 64 bits 16 bits 32 bits 64 bits
DH [20] 0.300 0.290 0.470 0.360 0.360 0.310

SRH [6]
0.716 0.692 0.754 0.491 0.503 0.509

DVH [15]
0.701 0.705 0.712 0.441 0.456 0.518
HCN (Ours) 0.838 0.875 0.874 0.527 0.565 0.579
Table 3: Comparison in MAP of Hamming Ranking for different bits on video retrieval.
(a) P-R curve @64bits
(b) P@N @64bits
(c) P@H=2
Figure 5: Experimental results of HCN and comparison methods on ImageNet w.r.t. three evaluation metrics.
(a) P-R curve @64bits
(b) P@N @64bits
(c) P@H=2
Figure 6: Experimental results of HCN and comparison methods on UCF101 w.r.t. three evaluation metrics.

Results in terms of Mean Average Precision (MAP) for image retrieval and video retrieval are shown in Table 2 and 3. From Table 2, one can observe that our HCN achieves the best performance for the image retrieval task. Compared with the state-of-the-art deep hashing method HashNet, our HCN brings an increase of at least 13.4%, 6.5%, 4.9% in MAP for different bits on ImageNet, MS COCO and NUS_WIDE respectively. Specifically, the MAP boost on ImageNet is much larger than that on the other two datasets by about 7%-9%. Note ImageNet has the most severe data imbalance among the three image retrieval datasets (Table 1). This proves that central similarity learning can efficiently relieve the data imbalance problem.

In Table 3, our HCN also achieves significant performance boost for video retrieval. It achieves an impressive MAP increase of over 12.0% and 4.8% for different bits on UCF101 and HMDB51 respectively. We achieve larger increases on UCF101 because it also suffers severe data imbalance.

Fig. 5 and Fig. 6 show retrieval performance in Precision-Recall curves (P-R curve), Precision curves w.r.t. different numbers of returned samples (P@N) and Precision curves with Hamming distance 2(P@H=2) respectively, for one image dataset (ImageNet) and one video dataset (UCF101). From these two figures, we can find HCN also outperforms all comparison methods by large margins on ImageNet and UCF101 w.r.t. the three performance metrics.

4.3 Visualization Results

Visualization of retrieval results

We illustrate the retrieval results on ImageNet, MS COCO, UCF101 and HMDB51 in Fig. 7. It can be seen that HCN can return much more relevant results. On MS COCO, HCN uses the centroid of multiple centers as the hashing target for multi-label data, thus the returned images of HCN share more common labels with the query compared with HashNet.

Figure 7: Example of top 10 retrieved images and videos for two image datasets and two video datasets. For COCO images, below each returned image the number of common labels with the query is given, which has labels of traffic light, person, truck and tree.
(a) HCN
(b) HashNet
Figure 8: The t-SNE of hash codes learned by proposed HCN and HashNet. We sample 10k hash codes from random 10 categories.
Visualization of hash codes

To have an intuitive view of generated hash codes by HCN, we visualize some examples in t-SNE [17] in Fig. 8. We sample 10k generated hash codes in ImageNet, so Fig. 8(a) and Fig. 8(b) have the same number of points. As can be seen, HCN generates more cohensive hash codes for similar pairs (images from the same category) and dispersed hash codes for dissimilar pairs. This is desirable because the retrieval system can receive more relevant data and easily exclude irrelevant data by using Hamming ranking.

Visualization of hash code distance

We visualize the Hamming distance between 20 hash centers and generated hash codes of ImageNet and UCF101 by heat maps in Fig. 9. The columns represent the 20 hash centers of test data in ImageNet (sampled 1k test images) or UCF101 (sampled 0.6k test videos). The rows are the generated hash codes assigned to these 20 centers. We calculate the average Hamming distance between hash centers and hash codes assigned to different centers. The diagonal values in the heat maps are the average Hamming distances of the hash codes with the corresponding hash center. We find the diagonal values are small, meaning the generated hash codes “cluster” to the corresponding hash centers in the Hamming space. Most off-diagonal values are very large, meaning dissimilar data pairs spread sufficiently. We also find most off-diagonal values are around 32, which is exactly the Hamming distance between different hash centers in a 64 bits space.

(a) ImageNet (b) UCF101
Figure 9: The heat map of average Hamming distance between 20 hash centers (the columns) with the hash codes (64bit, rows) generated by proposed HCN from test data in ImageNet and UCF101.

4.4 Ablation Study

Ablation study I

We investigate effects of the proposed central similarity, traditional pairwise similarity and quantization process for hashing function learning, by evaluating different combinations of central similarity loss , pairwise similarity loss , and quantization loss . Results are summarized in Table 4. Our HCN includes and , corresponding to the 1st row in Table 4. When we add to HCN (2nd row), MAP only increases for some bits. This shows pairwise similarity has limited effects on further improving over central similarity learning. We add while removing (3rd row), and find the MAP decreases significantly for various bits. When only using , the MAP just decreases slightly. These two results show the positive effects of central similarity learning.

ImageNet (MAP@1000) MS COCO (MAP@5000)
16 bits 32 bits 64 bits 16 bits 32 bits 64 bits
0.851 0.865 0.873 0.796 0.838 0.861

0.847 0.870 0.871 0.798 0.835 0.863

0.551 0.629 0.655 0.631 0.725 0.746

0.841 0.864 0.870 0.781 0.834 0.843

Table 4: MAP results of HCN and its three variants on two image datasets.
Ablation study II

When applying Alg. 1, we can sample different rows of the Hadamard matrix to generate hash centers. To show HCN performs consistently well for different hash center choices, we evaluate its performance for five different combinations of hash centers. From the results in Table 5, we can validate the robustness of HCN to hash center choices.

Dataset ImageNet MS COCO UCF101 HMDB51

Table 5: Run five times for different hash center choices. The results are mean std for MAP@64bits.

5 Conclusion

In this paper, we propose a novel concept “Hash Center” to formulate the central similarity for deep hash learning. The proposed Hash Center Network (HCN) architecture can learn hash codes by optimizing the Hamming distance between hash codes with corresponding centers. We conduct extensive experiments to validate that HCN can generate high quality hash codes and yield state-of-the-art performance for both image and video retrieval.


  • [1] Z. Cao, M. Long, J. Wang, and S. Y. Philip (2017) HashNet: deep learning to hash by continuation.. In ICCV, pp. 5609–5618. Cited by: §1, §2, §3.2, §3, §4.1.1, §4.1.1, Table 2.
  • [2] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) Multi-fiber networks for video recognition. arXiv preprint arXiv:1807.11195. Cited by: §4.1.2, §6.2.
  • [3] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 48. Cited by: §4.1.1.
  • [4] A. Gionis, P. Indyk, R. Motwani, et al. (1999) Similarity search in high dimensions via hashing. In Vldb, Vol. 99, pp. 518–529. Cited by: §4.1.1, Table 2.
  • [5] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §4.1.1, Table 2.
  • [6] Y. Gu, C. Ma, and J. Yang (2016) Supervised recurrent hashing for large scale video retrieval. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 272–276. Cited by: §1, §2, §4.1.2, Table 3.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.1.1, §6.2.
  • [8] Aapo. Hyvärinen, Jarmo. Hurri, and P. O. Hoyer Natural image statistics: a probabilistic approach to early computational vision. Springer. Cited by: §3.3.
  • [9] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
  • [10] H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre (2013) Hmdb51: a large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12, pp. 571–582. Cited by: §4.1.2.
  • [11] B. Kulis and T. Darrell (2009) Learning to hash with binary reconstructive embeddings. In Advances in neural information processing systems, pp. 1042–1050. Cited by: §4.1.1, Table 2.
  • [12] H. Lai, Y. Pan, Y. Liu, and S. Yan (2015) Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3270–3278. Cited by: §1, §1, §2, §4.1.1, §4.1.1, Table 2.
  • [13] W. Li, S. Wang, and W. Kang (2015) Feature learning based deep supervised hashing with pairwise labels. arXiv preprint arXiv:1511.03855. Cited by: §1, §1.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.1.
  • [15] V. E. Liong, J. Lu, Y. Tan, and J. Zhou (2017) Deep video hashing. IEEE Transactions on Multimedia 19 (6), pp. 1209–1219. Cited by: §1, §2, Table 3.
  • [16] H. Liu, R. Wang, S. Shan, and X. Chen (2016) Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2064–2072. Cited by: §1.
  • [17] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9 (Nov), pp. 2579–2605.
    Cited by: §4.3.
  • [18] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov (2012) Hamming distance metric learning. In Advances in neural information processing systems, pp. 1061–1069. Cited by: §1.
  • [19] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.1, §6.2.
  • [20] J. Qin, L. Liu, M. Yu, Y. Wang, and L. Shao (2017) Fast action retrieval from videos via feature disaggregation. Computer Vision and Image Understanding 156, pp. 104–116. Cited by: §1, §2, §4.1.2, Table 3.
  • [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.1.1.
  • [22] F. Shen, C. Shen, W. Liu, and H. Tao Shen (2015) Supervised discrete hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 37–45. Cited by: §1, §4.1.1, Table 2.
  • [23] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.2.
  • [24] J. Wang, H. T. Shen, J. Song, and J. Ji (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927. Cited by: §1.
  • [25] T. Weise, M. Zapf, R. Chiong, and A. J. Nebro (2009) Why is optimization difficult?. In Nature-Inspired Algorithms for Optimisation, pp. 1–50. Cited by: §3.3.
  • [26] Y. Weiss, A. Torralba, and R. Fergus (2009) Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760. Cited by: §4.1.1, Table 2.
  • [27] E. W. Weisstein (2002) Hadamard matrix. Cited by: §3.2.
  • [28] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §2.
  • [29] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2. Cited by: §1, §2, §4.1.1, Table 2.
  • [30] H. Zhu, M. Long, J. Wang, and Y. Cao (2016) Deep hashing network for efficient similarity retrieval.. In AAAI, pp. 2415–2421. Cited by: §1, §1, §2, §3.2, §3.3, §3, §4.1.1, §4.1.1, Table 2.
  • [31] N. Zhuang, J. Ye, and K. A. Hua (2016) Dlstm approach to video modeling with hashing for large-scale video retrieval. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3222–3227. Cited by: §4.1.2.

6 Supplementary Material

6.1 Jointly Learning with Pairwise Similarity

Given the semantic hash centers and pairwise similarity label , we can formulate central similarity and pairwise similarity based learning together to optimize the deep hashing functions. Recall the similarity label indicates the data pairs and are similar. The Maximum Likelihood (ML) estimation of hash codes for all training data with label can be obtained by maximizing the following likelihood probability:


Since we build the hash centers based on , the is known and can be treated as constant. Equation (10) thus becomes . Then the log likelihood can be written as


where the first RHS term represent the central similarity and the second RHS term is the pairwise similarity. The central similarity loss has been given in Sec. 3.3. For the pairwise similarity term in Equation (11), we use the inner product of the hash codes to measure the probability of the similarity labels.

Recall the Hamming distance and inner product for any two hash codes and satisfies: , where 1 is the all-one vector. We use inner product to replace the Hamming distance and define , the conditional probability of , as follows:


or equivalently,



is the Sigmoid function. This logistic regression-alike formulation satisfies that the smaller the Hamming distance

, the larger the inner product and the larger the conditional probability . This means that the pairs and

have a large probability to be classified as similar. Otherwise, the pairs would be classified to be dissimilar (

is large). After algebraic calculations, maximizing the above likelihood can be equivalently written as minimizing the following the pairwise similarity loss is computed as:


Putting all the pieces together, we obtain the following jointly optimization


where is the quantization loss, which has been given in the main text. In the experiment section of the main text, we also present and discuss performance of jointly learning by combining both and in the first ablation study.

6.2 Implementation Details

Implementation details for image retrieval

We implement the HCN model based on Pytorch [19] framework and employ ResNet [7]

architecture as 2D CNN for image feature learning. For fair comparison, the four baseline deep methods also use the same feature extraction network with the same configurations. We fine-tune the four convolution layers

conv1 to conv4 with learning rate 1e-5, which inherits from ResNet model pre-trained on the ImageNet. We never touch the test data in pre-training. We train the hash layer from scratch with 20 times learning rate than the convolution layers. We use the Adam solver [9] with a batch size of 64 and fix the hyper-parameters and .

Implementation details for video retrieval

We employ MFN [2] as 3D CNN for video feature learning. The HCN is first pre-trained on action classification task to learn video features, and we copy the parameters of 3D convolution layers. Then we fine tune the convlutional layers with learning rate 5e-4, and train the hash layer with 5 times learning rate than the 3D convolution layers. We use mini-batch stochastic gradient decent (SGD) with 0.9 momentum. The batch size is 32 and weight decay parameters is 0.0001. We train on two TITAN X GPU (12G) and takes around 16 hours for UCF101 and 9 hours for HMDB51.

6.3 Visualization of Hash Centers

We visualize some generated hash centers from algorithm 1 in this section. The Hadamard matrix is shown as Fig. 10. The hash centers of 64-bit for the five datasets we used are constructed by and as Algorithm 1.

For NUS_WIDE, we only sample 21 most frequent categories for experiments. Because of , all the hash centers with bits of 16, 32 and 64 for NUS_WIDE is constructed by Hadamard matrix , and . For the other four datasets, the 64-bit hash centers are constructed by and , but 16-bit and 32-bit hash centers are constructed by sampling from Bernoulli distributions. We give the illustration of the hash centers of 16-bit, 32-bit and 64-bit for ImageNet in Fig. 12 and Fig. 13. The hash centers for other three datasets are similar. In Fig. 12 and Fig. 13, every row means one hash center for one category in ImageNet.

Figure 10: Hadamard matrix : .
Figure 11: Hadamard matrix : .
Figure 12: Hash centers of ImageNet@64bit: . Each row represents one hash center, and there are totally 100 hash centers. The number of column represents the dimension of the hash center.
(a) ImageNet@32bit
(b) ImageNet@16bit
Figure 13: Hash centers of ImageNet@32bit and @16bit. The size of (a) is ; the size of (b) is . Each row represents one hash center, and the number of column represents the dimension of the hash center.

6.4 Proof on Hash Center Validity from

When in Algorithm 1, we use the combination of two Hadamard matrices to construct the hash centers. Here, we prove that the rows of can also be valid hash centers in the -dimensional Hamming space. According to Definition 1, we know if the Hamming distance between any two row vectors of is equal to or lager than , the row vectors of are valid hash centers.

We consider following three cases for Hamming distance between any two row vectors and in :

  1. Both of the two row vectors and belong to the upper half or below half of , i.e., or . So and are still orthgonal with each other with an inner product of . We get the Hamming distance ;

  2. One of the two row vectors belongs to , and the other one belongs to . We assume and . If , the two row vectors are still orthgonal with each other, thus .

  3. One of the two row vectors belongs to , and the other one belongs to , but . Thus the inner product is , and .

We summarize these three situations as following:


So the average Hamming distance is larger than , and the row vectors in are valid hash centers in K-dimensional Hamming space.