Hashing [14, 22, 15, 16, 17, 5, 11, 10] has been paid attention by lots of researchers for large-scale image retrieval in recent years. The goal of hashing is to transform the multimedia data from the original high-dimensional space into a compact binary space while preserving data similarities. The const or sub-linear search speed can be achieved via Hamming distance measurement, which is performed by using XOR and POPCNT operations on modern CPUs or GPUs. The efficient storage and search make hashing technology popular for large-scale multimedia retrieval.
Generally, we can divide existing hashing approaches into two categories: data-independent and data-dependent hashing methods. Data-independent hashing methods 
map the data points from the original feature space into a binary-code space by using random projections as hash functions. These methods provide theoretical guarantees for mapping the nearby data points into the same hash codes with high probabilities. However, they need long binary codes to achieve high precision. Data-dependent hashing methods (i.e., learning to hash methods)[14, 22, 15, 5, 16, 12, 18, 11, 24, 21, 10] learn hash functions and compact binary codes from training data. They can be further divided into unsupervised hashing methods [22, 5, 7, 16, 14] and supervised hashing methods [12, 20, 18]
, based on whether or not the semantic (label) information is used. In many real applications, supervised hashing methods demonstrate superior performance over unsupervised hashing methods. Recently, deep learning based hashing methods[6, 11, 24, 21, 10, 15] demonstrate superior performance over these traditional hashing methods [22, 5, 12, 16, 14]. The main reason is that deep hashing methods can perform simultaneous feature learning and hash-code learning in an end-to-end framework. Existing deep supervised hashing methods [1, 9, 23, 11, 24, 21, 15, 6] mainly utilize the pairwise supervision or the triplet supervised information for hash learning, while ignoring the semantic class information which can help the improvement of the semantic discriminative ability of hash codes. Recently, some deep supervised hashing methods [17, 24, 10] improve the hashing retrieval performance by introducing the essential semantic structure of the data in form of class labels.  constructs a two-stream network architecture, one for classification stream and the other for hashing stream. However, the semantic labels do not directly guide the hash learning.  assumes that the learned binary codes should be ideal for linear classification which restricts its scalability for some complex scene.  weakens this assumpution to support nonlinear classification. However,  utilizes the geometrical center of semantic relevant sub-centers as supervision information for multi-label hashing, which destroys the intrinsic manifold structure of the sub-center space.
In this paper, we propose a hierarchy neighborhood discriminative hashing method (HNDH). Specifically, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the sub-class feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the single-label and multi-label image retrieval problem with a one-stream deep neural network architecture. Compared with the multi-label classification loss in  using the geometrical center of relevant sub-centers as supervision information, the proposed method employs the intrinsic manifold structure of these sub-centers for learning the discriminative hash codes. Some preliminary results on multi-task learning based semantic hashing framework were presented in , while the extension on the hierarchy neighborhood discriminative hashing loss and the unification of single-label and multi-label hash learning with one-stream network are novel.
2 Hierarchy Neighborhood Discriminative Hashing
2.1 Problem Definition
Assume we have a training set , where denotes the number of training samples. The label information is denoted as , where denotes the number of categories. In addition, we can define a pairwise supervision matrix as if and are semantically similar, and otherwise. Under the supervised information and , supervised hashing aims to learn a hash function to transform the training data into a collection of -bit compact binary codes . The Hamming distance between and is calculated by using . Therefore, we can utilize the inner product to measure the similarity of hash codes.
2.2 Network Architecture
As illustrated in Fig. 1, the proposed one-stream deep architecture mainly contains two components:
the feature extraction subnetwork consists of the conv1 to fc7 layers of a pre-trained VGG-19 network; the output layer for generating discriminative embedding features and hash codes for retrieval and classification.
2.3 Proposed Method
2.3.1 Coarse neighborhood discriminative hashing loss
The output layer includes a -dimensional fully-connected layer (4096-r) and a tangent layer to approximate the sign function. We utilize denotes the real-valued output of the network, i.e., the embedding features. For each sub-class, we compute its feature center as follows: where represents the centroid of the -th sub-class, is the number of training samples that belong to the -th sub-class. Therefore, we can construct a complete bipartite graph to build the relationship between the sub-class feature centers and the embeddings features . The edge weight between the vertex and the vertex is defined as . Inspired by , we can use a softmax normalization over the edge weights that connect the vertex to define the neighbour probability :
where is the probability of selecting as its neighbor. The neighborhood relationship is coarse, since we only consider the relations between the embedding features and the sub-class centers. If the -th sub-class is contained in the assigned labels of the embedding feature , then the sub-class center is the relevant semantic neighborhood of . Therefore, can participate in the class labels voting for . Under this definition, the probability that image
will be correctly classified can be computed as
The negative logarithm likelihood function can be defined as:
where refers to the batch size. This function can minimize the intra-class variation and maximize the inter-class variation simultaneously to generate powerful embedding representations. It is worth noting that if
is an one-hot vector, this function will become the single-label classification function in[17, 13]. However, this function is not restricted to single-label classification problem. It can extend to the multi-label classification problem naturally. Compared with the multi-label hashing loss in  using the geometrical center of relevant semantic neighborhood sub-centers as supervision information, the proposed method employs the intrinsic manifold structure of the sub-center space for learning discriminative embedding features.
2.3.2 Fined neighborhood discriminative hashing loss
In the classification task above, we only consider the discrimination and polymerization of embedding features. The semantic neighborhood relationship between the embedding features is ignored which may make the distance between dissimilar embedding features smaller than the distance between similar embedding features. To overcome this limitation, we introduce the following pairwise constraint  which is commonly used to preserve the semantic similarity in retrieval task.
where . The semantic similarity matrix displays a fined neighborhood relationship between the embedding features, i.e., if denotes that image and image are neighborhood in the semantic space and otherwise. The neighborhood relationship is fined, since we consider the relations between all the embedding features.
2.4 Objective Function and Learning Algorithm
We formulate the proposed Hierarchy Neighborhood Discriminative Hashing method as the following multi-task learning framework:
where balance the impact of the different number of factors between the first term and the second term. We use an alternating optimization over the class sub-centers and the CNN parameters as follows:
Fix and optimize . We can update the feature center of the -th sub-class directly as follows:
|Dataset||Total||Query / Retrieval / Training||Labels|
|CIFAR-10||60,000||1,000 / 5,9000 / 5,000||10|
|NUS-WIDE||195,834||2,100 / 193,734 / 10,500||21|
3 Experimental Results
3.1 Datasets and Experimental Settings
We evaluate the performance of several deep hashing methods on two public datasets: CIFAR-10 and NUS-WIDE. We split each dataset into a query set and a retrieval set. The training set is randomly selected from the retrieval set. For the CIFAR-10 dataset, 100 images per class are randomly selected as the query set and the remaining images are used as the retrieval set following [11, 21, 10]. Moreover, 500 images per class are randomly sampled from the retrieval set as the training set. For the NUS-WIDE dataset, we only use the images that belong to the 21 most frequent labels. Then it contains at least 5,000 images for each class. We randomly sample 2100 images (100 images per class) as the query set and the remaining images form the retrieval set. Moreover, 500 images per class from the retrieval set are used as training set. The statistics of the two dataset splits are summarized in Table 1
. For CIFAR-10, we use Mean Average Precision (MAP) as the evaluation metric following[11, 21, 10]. The MAP@5K for NUS-WIDE is evaluated on top 5,000 retrieved images as similar in [11, 21, 10].
We use two NVIDIA TITAN XP GPUs and MatConvnet as the platform to implement the proposed model. The pre-trained VGG-19 model is utilized to initialize the base network in HMDH and the other parameters of the network are randomly initialized. The iteration number of the proposed HMDH is set to be 100 and the batch size is fixed to 128 for all datasets. The learning rate of the base network is gradually reduced from to for both the CIFAR-10 and NUS-WIDE datasets. The learning rate for the newly added layers is set to be 10 times more than the layers of the base network. For both datasets, we set via cross validation on training sets.
|12 bits||24 bits||32 bits||48 bits|
|12 bits||24 bits||32 bits||48 bits|
|12 bits||24 bits||32 bits||48 bits||12 bits||24 bits||32 bits||48 bits|
3.2 Results and Discussions
The MAP results of CIFAR-10 and NUS-WIDE are presented in Table 2 and Table 3 respectively. The results of deep supervised baselines including [6, 11, 21] and  on CIFAR-10 and NUS-WIDE are cited from  and  respectively. It can be seen that the proposed method outperforms the other baselines for most cases. The average MAP of the proposed method is 0.824, which is 1.1 percents higher than the average of DDSH’s 0.813 on the CIFAR-10 dataset. On the NUS-WIDE dataset, the proposed method performs consistently better than the other baselines across all bits. The average MAP@5K of the proposed method is 0.831, which is 0.7 percents higher than the average of MDLH’s 0.824 on the NUS-WIDE dataset. The reason can be that the relations between the learned embedding feature from MDLH tends to locate at geometrical center of its semantic relevant sub-class centers which destroys the manifold structure in the sub-center space. When the hash code is short (e.g., 16 bits), the proposed method and MDLH perform much better than the state-of-the-art. The reason is that the semantic label information is employed to learn the discrimination and polymerization of hash codes. Although  also utilizes the label information, they ignore the polymerization of hash codes.
Compared to the single-task learning based hashing methods including DTSH , DPSH , HashNet , DHN , DNNH  and CNNH , the multi-task learning based hashing methods including the proposed method, MLDH  and DSDH  jointly consider the the retrieval task and the classification task for learning the discrete discriminative hash codes. From Table 2 and Table 3, it can be found that the multi-task learning based hashing methods generally perform better than the single-task learning based hashing methods. In addition, different from DSDH, the polymerization of hash codes is also considered in the proposed method.
3.3 Ablation Experiments
We report the effect of different components of our HNDH on two benchmark datasets with different numbers of bits in Table 4. From the MAP results, it verifies the effectiveness of combining the coarse neighborhood discriminative hashing loss and the fined neighborhood discriminative hashing loss. In addition, the proposed the individual coarse neighborhood discriminative hashing loss performs better than the individual multi-label hashing loss in . To facilitate the outstanding, we focus on two HNDH variants: (a) HNDH-C is the first variant which removes fined neighborhood discriminative hashing loss; (b) HNDH-F is the second variant which removes coarse neighborhood discriminative hashing loss. Fig. 6 show t-SNE visualization  of the deep representations of HNDH-F, HNDH-C, and HNDH with 48 bits on the query set of CIFAR-10 dataset. As shown in Fig. 6 , the image embeddings generated by HNDH show most compact and discriminative structures with clearest boundaries.
3.4 Convergence Analysis
The training convergence curves of the proposed model at 48 bits over CIFAR-10 and NUS-WIDE datasets are shown in Fig. 6 (d). It can be observed that the proposed model can converge within 100 iterations, which validates the effectiveness of the proposed approach.
In this paper, we propose a hierarchy neighborhood discriminative hashing method. Firstly, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the sub-class feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the single-label and multi-label image retrieval problem with a one-stream deep neural network architecture. In the future work, we plan to extend the proposed single-modal hashing method to the cross-modal hashing.
-  Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI, pages 3457–3463, 2016.
-  Z. Cao, M. Long, J. Wang, and P. S. Yu. Hashnet: Deep learning to hash by continuation. In ICCV, pages 5609–5618, 2017.
-  A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of International Conference on Very Large Data Bases, pages 518–529, 1999.
-  J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, pages 513–520, 2004.
-  Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.
-  Q. Jiang, X. Cui, and W. Li. Deep discrete supervised hashing. IEEE Transactions on Image Processing, 27(12):5996–6009, 2018.
-  W. Kang, W. Li, and Z. Zhou. Column sampling based discrete supervised hashing. In AAAI, pages 1230–1236, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
-  Q. Li, Z. Sun, R. He, and T. Tan. Deep supervised discrete hashing. In NIPS, pages 2479–2488, 2017.
-  W. Li, S. Wang, and W. Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016.
-  W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
-  Y. Liu, H. Li, and X. Wang. Rethinking feature discrimination and polymerization for large-scale recognition. In NIPS, 2017.
-  L. Ma, H. Li, F. Meng, Q. Wu, and K. N. Ngan. Learning efficient binary codes from high-level feature representations for multilabel image retrieval. IEEE Trans. Multimedia, 19(11):2545–2560, 2017.
-  L. Ma, H. Li, F. Meng, Q. Wu, and K. N. Ngan. Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing, 312:49–62, 2018.
-  L. Ma, H. Li, F. Meng, Q. Wu, and L. Xu. Manifold-ranking embedded order preserving hashing for image semantic retrieval. J. Visual Communication and Image Representation, 44:29–39, 2017.
-  L. Ma, H. Li, Q. Wu, C. Shang, and K. Ngan. Multi-task learning for deep semantic hashing. In VCIP, pages 1–4, 2018.
-  F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
L. van der Maaten and G. Hinton.
Visualizing high-dimensional data using t-sne.JMLR, 9:2579–2605, 2008.
-  J. Wang, S. Kumar, and S. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012.
-  X. Wang, Y. Shi, and K. M. Kitani. Deep supervised hashing with triplet labels. In ACCV, pages 70–84, 2016.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2008.
-  R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, pages 2156–2162, 2014.
-  T. Yao, F. Long, T. Mei, and Y. Rui. Deep semantic-preserving and ranking-based hashing for image retrieval. In IJCAI, pages 3931–3937, 2016.