By projecting high-dimensional data to compact binary hash codes in the Hamming space via a proper hashing function
, hashing methods offer remarkable efficiency for data storage and retrieval. Recently, “deep learning to hash” methods[12, 22, 15, 13, 16, 29]29, 30] and video retrieval [6, 20, 15].
Most previous methods [1, 30, 18, 13] learn the deep hashing functions by utilizing pairwise data similarity that captures data relationships from a local aspect. As shown in the upper panel of Fig. 1, the pairwise similarity learning aims at obtaining a hashing function that generates hash codes with minimal distance for similar data and maximal distance for dissimilar data in the Hamming space. However, it intrinsically suffers the following issues: 1) Low-efficiency in profiling the whole distribution of training data. The commonly used doublet similarity [1, 30, 13] or triplet similarity metrics [18, 12] have a time complexity at the order of for learning hashing functions from data points, which means it is almost impractical to exhaustively learn from all possible data pairs. 2) Insufficient coverage of data distribution. Pairwise similarity based methods utilize only partial relationships between data pairs, which may harm the discriminability of the generated hash codes. 3) Low effectiveness on imbalanced data. In real world scenarios, the number of dissimilar pairs is much larger than that of similar pairs. Thus pairwise similarity learning based hashing methods cannot learn similarity relationships adequately to generate sufficiently good hash codes, leading to limited performance.
To solve the above issues, we propose a new global similarity metric, termed as central similarity, and we learn it to obtain better hashing functions. In particular, central similarity measures the Hamming distance between hash codes and hash center which is defined as a set of points in the Hamming space with a sufficient mutual distance. Central similarity learning targets at encouraging the generated hash codes to be close to the corresponding hash centers. In this way, the various hash codes in the Hamming space would concentrate around different hash centers thus can be well discriminated and benefit the retrieval accuracy. From the bottom panel of Fig. 1, it can be intuitively seen that through learning central similarity, the hash codes of similar data pairs concentrate around their common hash centers (the black stars) and those of dissimilar pairs distribute around different hash centers. The central similarity learning is with time complexity of only for data points and centers. Even in presence of severe data imbalance, the hashing function can still be well learned from the global relationships.
To obtain suitable hash centers for similar and dissimilar data pairs, two systematic approaches can be adopted. One is to use Hadamard matrix to construct hash centers; the other is to generate hash centers with random sampling from Bernoulli distributions. We prove that both approaches can generate proper hash centers that are separated from each other with sufficient Hamming distance.
With central similarity learning, we propose a novel network architecture, the Hash Center Network (HCN), based on Convolutional Neural Network (CNN), to learn a deep hashing function in the Hamming space. In particular, HCN consists of convolution layers for deep feature learning and a hash layer for generating hash codes. After identifying the hash centers, we train HCN to generate hash codes with the target of making similar pairs converge to the same hash center. HCN is generic and compatible with both 2D and 3D CNNs for learning hash codes for both images and videos.
Our contributions are three-fold. 1) We introduce a novel central similarity learning as a new hashing method to capture global data distribution and generate high-quality hashing functions. 2) We propose a novel concept of hash center to facilitate the central similarity learning. We also present two methods to generate proper hash centers. To our best knowledge, this is the first work to utilize central similarity and hash center for deep hashing function learning. 3) We propose a unified deep hash network architecture for both image and video hashing that establishes new state-of-the-art.
2 Related Work
have been proposed for image hashing. These “deep learning to hash” methods adopt 2D CNNs to learn image features and then hash layers to learn hash codes. Recent hashing methods for images focus on how to design a more efficient pairwise-similarity loss function. DNNH proposes to use a triplet ranking loss for similarity learning. DHN 
uses Maximum a Posterior (MAP) estimation to obtain the pairwise similarity loss function. HashNet adopts Weighted Maximum Likelihood (WML) estimation to alleviate the severe data imbalance by adding weights in pairwise loss functions. Different from previous works, we propose to use central similarity to model the relationships between similar and dissimilar pairs and improve the discriminability of generated hash codes.
Compared with image hashing, some video hashing methods such as DH , SRH , DVH  propose to capture the temporal information in videos. For instance,  utilizes Disaggregation Hashing to exploit the correlations among different feature dimensions.  presents an LSTM-based method to capture the temporal information between video frames. Recently,  attempts to fuse the temporal information by using fully-connected layers and frame pooling. Different from these hashing methods, our proposed HCN is a unified architecture for both image and video hashing. Via directly replacing 2D CNNs with 3D CNNs, the proposed HCN can capture the temporal information for video hashing.
Our central similarity is partially related with 
which uses a center loss to learn more discriminative representation for face recognition (classification). The centers in are derived from the feature representation of the corresponding categories, which is unstable with intra-class variations. Different from this center loss in recognition , our proposed hash centers help generate high-quality hash codes in the Hamming space. It is defined over the hash codes instead of feature representations.
3 Proposed Method
We consider learning a hashing function in a supervised manner from a training set of data , where each is a datum to hash and denotes the semantic label for data . Let denote the nonlinear hashing function from input space to -bit Hamming space . Similar to other supervised “deep learning to hash” methods [1, 30], we aim for such a hashing function that the generated hash codes ’s for the data ’s are close if they share similar labels.
As aforementioned, most existing methods learn the hashing function by encouraging the generated hash codes to preserve the raw data pairwise similarity. However, they suffer from learning inefficiency and inability of modeling the global relations of the whole dataset, leading to inaccurate hash codes and degraded retrieval performance. Instead of learning from the local pairwise relations, we propose to utilize global relations to enhance the quality of the hashing function. Specifically, we supervise hashing function learning by encouraging the generated hash codes to concentrate around a common center, termed a hash center, if the input data pairs are similar.
We define a set of points with sufficient distance in the Hamming space as hash centers, and propose to learn the hashing functions supervised by the central similarity w.r.t. . The central similarity would encourage similar data pairs to be close to a common hash center and dissimilar data pairs to concentrate around different hash centers. Through such central similarity learning, the global similarity information between data pairs can be preserved in , giving high-quality hash codes.
In below, we first give a formal definition of hash center and explain how to generate proper hash centers systematically. Then we elaborate on details of the central similarity learning. The HCN framework will be described in the end.
3.1 Definition of Hash Center
The first step of our proposed method is to position a set of good hash centers to anchor the following central similarity based hashing function learning. To ensure the generated hash codes for dissimilar data are sufficiently distant from each other, each center should be more distant from the other centers than to the hash codes associated with it. As such, the dissimilar pairs can be better separated and similar pairs can be aggregated cohesively. Based on this intuition, we formally define a set of points in the Hamming space as valid hash centers with following properties.
Definition 1 (Hash Center).
We define hash centers as a set of points in the -dimensional Hamming space whose average pairwise distance satisfies::
where is the Hamming distance, is the number of hash centers, and is the number of combinations of different and .
For better clarity, we illustrate some examples of the desired hash centers in the 3d and 4d Hamming space in Fig. 2. In Fig. 2(a), the hash center of the hash codes , and is , and the three hash codes have the same Hamming distance from . In Fig. 2(b), we use 4d hypercube to represent the 4d Hamming space. The two stars and are the hash centers given in Definition 1. The distance between and is , and the distance between the green dots and the center is the same (). However, we do not strictly require all points to have the same distance from the corresponding center. Instead, we define the nearest center as the corresponding hash center for one hash code.
3.2 Generation of Hash Center
To obtain the hash centers with the above properties, in this subsection, we develop two systematic generation approaches based on the following observation. In the -dimensional Hamming space, if a set of points are mutually orthogonal, they will have an equal distance of to each other. Namely they are valid hash centers satisfying Definition 1.
Accordingly, our first approach is to generate hash centers by leveraging the following nice properties of a Hadamard matrix. It is known that a Hadamard matrix satisfies: 1) It is a squared matrix whose rows are mutually orthogonal, i.e.
, the inner products of any two row vectors. The Hamming distance between any two row vectors is . Therefore, we can choose hash centers from these row vectors. 2) Its size is a power of (i.e., ), which is consistent with the usual number of bits of hash codes. 3) It is a binary matrix whose entries are either -1 or +1. We can simply replace all -1 with 0 to obtain hash centers in .
To sample the hash centers from the Hadamard matrix, we first build a Hadamard matrix by Sylvester’s construction  as follows:
where represents the Hadamard product, and . The two factors within the initial Hadamard matrix are and . When the number of centers , we directly choose each row to be a hash center. When , we use a combination of two Hadamard matrices to construct hash centers111We prove that the rows of can also be valid hash centers in the -dimensional Hamming space in supplementary material..
Though applicable in most cases, the number of valid centers generated by the above approach is constrained by the fact that the Hadamard matrix is a squared one. If is larger than or is not the power of 2, the first approach is inapplicable. We thus propose the second generation approach based on randomly sampling the bits of each center vector. In particular, each bit of a center is sampled from a Bernoulli distribution where if . We can easily prove that the distance between these centers is in expectation. Namely, if . We summarize these two approaches in Alg. 1.
Once obtaining a set of hash centers, the next step is to associate the training data with their individual corresponding centers to compute the central similarity. Recall is the semantic label for , and usually , where is the number of categories. For single-label data, each datum belongs to one category, while each multi-label datum belongs to more than one category. We term the hash centers that are generated from Alg. 1 and associated with semantic labels as semantic hash centers. We now explain how to obtain the semantic hash centers for single- and multi-label data separately.
Semantic hash centers for single-label data
For single-label data, we assign one hash center for each category. That is, we generate hash centers by Alg. 1 corresponding to labels . Thus, data pairs with the same label share a common center and are encouraged to be close to each other. Because each datum is assigned to one hash center, we obtain the semantic hash centers , where is the hash center of .
Semantic hash centers for multi-label data
For multi-label data, HashNet  and DHN  directly make data pairs be similar if they share at least one category. However, they ignore the transitive similarity when data pairs share more than one category. In this paper, we generate transitive centers for data pairs sharing multiple labels. First, we generate hash centers by Alg. 1 corresponding to semantic labels . Then for data including two or more categories, we calculate the centroid of these centers, each of which corresponds to a single category. For example, suppose one datum has three categories , and . The centers of the three categories are , and , as shown in Fig 3. We calculate the centroid of the three centers as the hash center of . To ensure the elements to be binary, we calculate each bit by voting at the same bit of the three centers and taking the value that dominates, as shown in the right panel of Fig 3. If the number of 0 is equal to the number of 1 at some bits (i.e., the voting result is a draw), we sample from for these bits. Finally, for each , we generate the centroid as its semantic hash center, and then obtain semantic hash centers , where is the hash center of .
3.3 Central Similarity Learning
Given the generated centers for training data with categories, we obtain the semantic hash centers for single- or multi-label data, where denotes the hash center of the datum . We derive the central similarity learning objective by maximizing the logarithm posterior of the hash codes w.r.t. the semantic hash centers. Formally, the logarithm Maximum a Posterior (MAP) estimation of hash codes
for all the training data can be obtained by maximizing the following likelihood probability:
where is the prior distribution over hash codes, and is the likelihood function. is the conditional probability of center given hash code . We model as a Gibbs distribution:
where and are constants, and measures the Hamming distance between a hash code and its hash center. Since hash centers are binary vectors, we use Binary Cross Entropy (BCE) to measure the Hamming distance between the hash code and its center, . So the conditional probability is
We can see that the larger the conditional probability is, the smaller the Hamming distance will be between hash code and its hash center , meaning the hash code is close to its corresponding center; otherwise the hash code is further away from its corresponding center. By substituting Eqn. (5) into MAP estimation, we obtain the optimization objective of the central similarity loss :
Since each hash center is binary, existing optimization cannot guarantee that the generated hash codes completely converge on hash centers  due to the inherent optimization difficulty. So we introduce a quantization loss to refine the generated hash codes and . Similar with DHN , we use bi-modal Laplacian prior for quantization, which is defined as
where is an all-one vector. As is a non-smooth function which makes it difficult to calculate its derivative, we adopt the smooth function  to replace it. So . Then the quantization loss becomes
Finally, we obtain the central similarity optimization problem:
where is the set of all parameters for deep hashing function learning, and is the hyper-parameter to balance the central similarity estimation and quantization processing.222We provide the formulation for jointly estimating central similarity and pairwise similarity to learn deep hashing functions in supplementary material. And pairwise loss function is also given.
3.4 Architecture of HCN
Base on these definitions and designs, we propose a Hash Center Network (HCN) to learn central similarity for image and video hashing. The network architecture is shown in Fig. 4. The input of HCN is . Here and are the hash centers for and
respectively. HCN takes this input and outputs compact hash codes through the following deep hashing pipeline: 1) a 2D or 3D CNN sub-network to extract the data representation for image or video data, 2) a hash layer with three fully-connected layers and activation functions to project high dimensional data features to hash codes in the Hamming space, 3) a central similarity lossfor central similarity-preserving learning, where all hash centers are defined in the Hamming space, making hash codes converge on corresponding centers. and 4) a quantization loss
for improving binarization.
We conduct experiments for both image and video retrieval to evaluate our central similarity and HCN against several state-of-the-arts.
4.1 Experiment Setting
Five benchmark datasets are used in our experiments and their statistics used in this paper are summarized in Table 1.
4.1.1 Settings for Image Hashing and Retrieval
We use three standard image retrieval datasets, ImageNet, NUS_WIDE and MS COCO. OnImageNet , we follow the settings in  and sample all images from 100 categories. As it is a single-label dataset, we directly generate 100 hash centers, with one for each category. MS COCO  is a multi-label image dataset with 80 categories. NUS_WIDE  is also a multi-label image dataset. Following [30, 12], we choose images from the 21 most frequent categories for evaluation. For MS COCO and NUS_WIDE datasets, we first generate 80 and 21 hash centers for all categories respectively, and then calculate the centroid of the multi-centers as the semantic hash centers for each image with multiple labels, following the approach in Sec. 3.2. The visualization of all generated hash centers is given in supplementary material.
We compare retrieval performance of proposed HCN with ten classical or state-of-the-art hashing methods, including unsupervised methods LSH , SH , ITQ , supervised shallow methods ITQ-CCA , BRE , SDH  and supervised deep methods HashNet , DHN , CNNH , DNNH . For shallow hashing methods, we adopt the result from latest works [1, 30]
to make them directly comparable. We evaluate image retrieval performance based on four standard evaluation metrics: Mean Avearage Precision (MAP), Precision-Recall curves (PR), Precision curves w.r.t. different numbers of returned samples (P@N), Precision curves within Hamming distance 2 (P@H=2). We adopt MAP@1000 for ImageNet as every category has 1,300 images, and adopt MAP@5000 for MS COCO and NUS_WIDE.
4.1.2 Settings for Video Hashing and Retrieval
Two video retrieval datasets, UCF101  and HMDB51 , are used and we directly use their default settings. On UCF101, we use 9.5k videos for training and retrieval, and 3.8k queries in every split. For HMDB51, we have 3.5k videos for training and retrieval, 1.5k videos for testing (queries) in each split.
In video retrieval experiments, HCN adopts a lightweight 3D CNN, Multi-Fiber 3D CNN , as the convolution layers to learn the representation for videos. We compare retrieval performance of proposed HCN with three deep supervised video hashing methods: DH , DLSTM  and SRH  based on the same evaluation metrics with image retrieval experiments.
Due to space limit, we defer implementation details of HCN for image and video hashing to supplementary material.
4.2 Quantitative Results
|ImageNet (MAP@1000)||MS COCO (MAP@5000)||NUS-WIDE (MAP@5000)|
|16 bits||32 bits||64 bits||16 bits||32 bits||64 bits||16 bits||32 bits||64 bits|
|UCF-101 (MAP@100)||HMDB51 (MAP@70)|
|16 bits||32 bits||64 bits||16 bits||32 bits||64 bits|
Results in terms of Mean Average Precision (MAP) for image retrieval and video retrieval are shown in Table 2 and 3. From Table 2, one can observe that our HCN achieves the best performance for the image retrieval task. Compared with the state-of-the-art deep hashing method HashNet, our HCN brings an increase of at least 13.4%, 6.5%, 4.9% in MAP for different bits on ImageNet, MS COCO and NUS_WIDE respectively. Specifically, the MAP boost on ImageNet is much larger than that on the other two datasets by about 7%-9%. Note ImageNet has the most severe data imbalance among the three image retrieval datasets (Table 1). This proves that central similarity learning can efficiently relieve the data imbalance problem.
In Table 3, our HCN also achieves significant performance boost for video retrieval. It achieves an impressive MAP increase of over 12.0% and 4.8% for different bits on UCF101 and HMDB51 respectively. We achieve larger increases on UCF101 because it also suffers severe data imbalance.
Fig. 5 and Fig. 6 show retrieval performance in Precision-Recall curves (P-R curve), Precision curves w.r.t. different numbers of returned samples (P@N) and Precision curves with Hamming distance 2(P@H=2) respectively, for one image dataset (ImageNet) and one video dataset (UCF101). From these two figures, we can find HCN also outperforms all comparison methods by large margins on ImageNet and UCF101 w.r.t. the three performance metrics.
4.3 Visualization Results
Visualization of retrieval results
We illustrate the retrieval results on ImageNet, MS COCO, UCF101 and HMDB51 in Fig. 7. It can be seen that HCN can return much more relevant results. On MS COCO, HCN uses the centroid of multiple centers as the hashing target for multi-label data, thus the returned images of HCN share more common labels with the query compared with HashNet.
Visualization of hash codes
To have an intuitive view of generated hash codes by HCN, we visualize some examples in t-SNE  in Fig. 8. We sample 10k generated hash codes in ImageNet, so Fig. 8(a) and Fig. 8(b) have the same number of points. As can be seen, HCN generates more cohensive hash codes for similar pairs (images from the same category) and dispersed hash codes for dissimilar pairs. This is desirable because the retrieval system can receive more relevant data and easily exclude irrelevant data by using Hamming ranking.
Visualization of hash code distance
We visualize the Hamming distance between 20 hash centers and generated hash codes of ImageNet and UCF101 by heat maps in Fig. 9. The columns represent the 20 hash centers of test data in ImageNet (sampled 1k test images) or UCF101 (sampled 0.6k test videos). The rows are the generated hash codes assigned to these 20 centers. We calculate the average Hamming distance between hash centers and hash codes assigned to different centers. The diagonal values in the heat maps are the average Hamming distances of the hash codes with the corresponding hash center. We find the diagonal values are small, meaning the generated hash codes “cluster” to the corresponding hash centers in the Hamming space. Most off-diagonal values are very large, meaning dissimilar data pairs spread sufficiently. We also find most off-diagonal values are around 32, which is exactly the Hamming distance between different hash centers in a 64 bits space.
|(a) ImageNet||(b) UCF101|
4.4 Ablation Study
Ablation study I
We investigate effects of the proposed central similarity, traditional pairwise similarity and quantization process for hashing function learning, by evaluating different combinations of central similarity loss , pairwise similarity loss , and quantization loss . Results are summarized in Table 4. Our HCN includes and , corresponding to the 1st row in Table 4. When we add to HCN (2nd row), MAP only increases for some bits. This shows pairwise similarity has limited effects on further improving over central similarity learning. We add while removing (3rd row), and find the MAP decreases significantly for various bits. When only using , the MAP just decreases slightly. These two results show the positive effects of central similarity learning.
|ImageNet (MAP@1000)||MS COCO (MAP@5000)|
|16 bits||32 bits||64 bits||16 bits||32 bits||64 bits|
Ablation study II
When applying Alg. 1, we can sample different rows of the Hadamard matrix to generate hash centers. To show HCN performs consistently well for different hash center choices, we evaluate its performance for five different combinations of hash centers. From the results in Table 5, we can validate the robustness of HCN to hash center choices.
In this paper, we propose a novel concept “Hash Center” to formulate the central similarity for deep hash learning. The proposed Hash Center Network (HCN) architecture can learn hash codes by optimizing the Hamming distance between hash codes with corresponding centers. We conduct extensive experiments to validate that HCN can generate high quality hash codes and yield state-of-the-art performance for both image and video retrieval.
-  (2017) HashNet: deep learning to hash by continuation.. In ICCV, pp. 5609–5618. Cited by: §1, §2, §3.2, §3, §4.1.1, §4.1.1, Table 2.
-  (2018) Multi-fiber networks for video recognition. arXiv preprint arXiv:1807.11195. Cited by: §4.1.2, §6.2.
-  (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 48. Cited by: §4.1.1.
-  (1999) Similarity search in high dimensions via hashing. In Vldb, Vol. 99, pp. 518–529. Cited by: §4.1.1, Table 2.
-  (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §4.1.1, Table 2.
-  (2016) Supervised recurrent hashing for large scale video retrieval. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 272–276. Cited by: §1, §2, §4.1.2, Table 3.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §4.1.1, §6.2.
-  Natural image statistics: a probabilistic approach to early computational vision. Springer. Cited by: §3.3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
-  (2013) Hmdb51: a large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12, pp. 571–582. Cited by: §4.1.2.
-  (2009) Learning to hash with binary reconstructive embeddings. In Advances in neural information processing systems, pp. 1042–1050. Cited by: §4.1.1, Table 2.
-  (2015) Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3270–3278. Cited by: §1, §1, §2, §4.1.1, §4.1.1, Table 2.
-  (2015) Feature learning based deep supervised hashing with pairwise labels. arXiv preprint arXiv:1511.03855. Cited by: §1, §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.1.
-  (2017) Deep video hashing. IEEE Transactions on Multimedia 19 (6), pp. 1209–1219. Cited by: §1, §2, Table 3.
-  (2016) Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2064–2072. Cited by: §1.
Visualizing data using t-sne.
Journal of machine learning research9 (Nov), pp. 2579–2605. Cited by: §4.3.
-  (2012) Hamming distance metric learning. In Advances in neural information processing systems, pp. 1061–1069. Cited by: §1.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.1.1, §6.2.
-  (2017) Fast action retrieval from videos via feature disaggregation. Computer Vision and Image Understanding 156, pp. 104–116. Cited by: §1, §2, §4.1.2, Table 3.
-  (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.1.1.
-  (2015) Supervised discrete hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 37–45. Cited by: §1, §4.1.1, Table 2.
-  (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.2.
-  (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927. Cited by: §1.
-  (2009) Why is optimization difficult?. In Nature-Inspired Algorithms for Optimisation, pp. 1–50. Cited by: §3.3.
-  (2009) Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760. Cited by: §4.1.1, Table 2.
-  (2002) Hadamard matrix. Cited by: §3.2.
-  (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §2.
-  (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2. Cited by: §1, §2, §4.1.1, Table 2.
-  (2016) Deep hashing network for efficient similarity retrieval.. In AAAI, pp. 2415–2421. Cited by: §1, §1, §2, §3.2, §3.3, §3, §4.1.1, §4.1.1, Table 2.
-  (2016) Dlstm approach to video modeling with hashing for large-scale video retrieval. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3222–3227. Cited by: §4.1.2.
6 Supplementary Material
6.1 Jointly Learning with Pairwise Similarity
Given the semantic hash centers and pairwise similarity label , we can formulate central similarity and pairwise similarity based learning together to optimize the deep hashing functions. Recall the similarity label indicates the data pairs and are similar. The Maximum Likelihood (ML) estimation of hash codes for all training data with label can be obtained by maximizing the following likelihood probability:
Since we build the hash centers based on , the is known and can be treated as constant. Equation (10) thus becomes . Then the log likelihood can be written as
where the first RHS term represent the central similarity and the second RHS term is the pairwise similarity. The central similarity loss has been given in Sec. 3.3. For the pairwise similarity term in Equation (11), we use the inner product of the hash codes to measure the probability of the similarity labels.
Recall the Hamming distance and inner product for any two hash codes and satisfies: , where 1 is the all-one vector. We use inner product to replace the Hamming distance and define , the conditional probability of , as follows:
where, the larger the inner product and the larger the conditional probability . This means that the pairs and
have a large probability to be classified as similar. Otherwise, the pairs would be classified to be dissimilar (is large). After algebraic calculations, maximizing the above likelihood can be equivalently written as minimizing the following the pairwise similarity loss is computed as:
Putting all the pieces together, we obtain the following jointly optimization
where is the quantization loss, which has been given in the main text. In the experiment section of the main text, we also present and discuss performance of jointly learning by combining both and in the first ablation study.
6.2 Implementation Details
Implementation details for image retrieval
architecture as 2D CNN for image feature learning. For fair comparison, the four baseline deep methods also use the same feature extraction network with the same configurations. We fine-tune the four convolution layersconv1 to conv4 with learning rate 1e-5, which inherits from ResNet model pre-trained on the ImageNet. We never touch the test data in pre-training. We train the hash layer from scratch with 20 times learning rate than the convolution layers. We use the Adam solver  with a batch size of 64 and fix the hyper-parameters and .
Implementation details for video retrieval
We employ MFN  as 3D CNN for video feature learning. The HCN is first pre-trained on action classification task to learn video features, and we copy the parameters of 3D convolution layers. Then we fine tune the convlutional layers with learning rate 5e-4, and train the hash layer with 5 times learning rate than the 3D convolution layers. We use mini-batch stochastic gradient decent (SGD) with 0.9 momentum. The batch size is 32 and weight decay parameters is 0.0001. We train on two TITAN X GPU (12G) and takes around 16 hours for UCF101 and 9 hours for HMDB51.
6.3 Visualization of Hash Centers
We visualize some generated hash centers from algorithm 1 in this section. The Hadamard matrix is shown as Fig. 10. The hash centers of 64-bit for the five datasets we used are constructed by and as Algorithm 1.
For NUS_WIDE, we only sample 21 most frequent categories for experiments. Because of , all the hash centers with bits of 16, 32 and 64 for NUS_WIDE is constructed by Hadamard matrix , and . For the other four datasets, the 64-bit hash centers are constructed by and , but 16-bit and 32-bit hash centers are constructed by sampling from Bernoulli distributions. We give the illustration of the hash centers of 16-bit, 32-bit and 64-bit for ImageNet in Fig. 12 and Fig. 13. The hash centers for other three datasets are similar. In Fig. 12 and Fig. 13, every row means one hash center for one category in ImageNet.
6.4 Proof on Hash Center Validity from
When in Algorithm 1, we use the combination of two Hadamard matrices to construct the hash centers. Here, we prove that the rows of can also be valid hash centers in the -dimensional Hamming space. According to Definition 1, we know if the Hamming distance between any two row vectors of is equal to or lager than , the row vectors of are valid hash centers.
We consider following three cases for Hamming distance between any two row vectors and in :
Both of the two row vectors and belong to the upper half or below half of , i.e., or . So and are still orthgonal with each other with an inner product of . We get the Hamming distance ;
One of the two row vectors belongs to , and the other one belongs to . We assume and . If , the two row vectors are still orthgonal with each other, thus .
One of the two row vectors belongs to , and the other one belongs to , but . Thus the inner product is , and .
We summarize these three situations as following:
So the average Hamming distance is larger than , and the row vectors in are valid hash centers in K-dimensional Hamming space.