Due to its fast retrieval speed and low storage cost, similarity-preserving hashing has been widely used for approximate nearest neighbor (ANN) search [Arya et al.1998, Zhu et al.2016]. The central idea of hashing is to map the data points from the original feature space into binary codes in the Hamming space and preserve the pairwise similarities in the original space. With the binary-code representation, hashing enables constant or sub-linear time complexity for ANN search [Gong and Lazebnik2011, Zhang et al.2014]. Moreover, hashing can reduce the storage cost dramatically.
Compared with traditional data-independent hashing methods like Locality Sensitive Hashing (LSH) [Gionis et al.1999] which do not use any data for training, data-dependent hashing methods, can achieve better accuracy with shorter codes by learning hash functions from training data [Gong and Lazebnik2011, Liu et al.2012, Liu et al.2014, Zhang et al.2014]. Existing data-dependent methods can be further divided into three categories: unsupervised methods [He et al.2013, Gong and Lazebnik2011, Liu et al.2014, Shen et al.2015], semi-supervised methods [Wang et al.2010b, Wang et al.2014, Zhang et al.2016], and supervised methods [Liu et al.2012, Zhang et al.2014, Kang et al.2016]. Unsupervised hashing works by preserving the Euclidean similarity between the attributes of training points, while semi-supervised and supervised hashing try to preserve the semantic similarity constructed from the semantic labels of the training points [Zhang et al.2014, Norouzi and Fleet2011, Kang et al.2016]. Although there are also some works to exploit other types of supervised information like the ranking information for hashing [Li et al.2013, Norouzi et al.2012], the semantic information is usually given in the form of pairwise labels indicating whether two data points are known to be similar or dissimilar. Meanwhile, some recent supervised methods performing simultaneous feature learning and hash code learning with deep neural networks, have shown better performance [Zhang et al.2015, Zhao et al.2015, Li et al.2015, Li et al.2016]. Noticeably, these semi-supervised and supervised methods can mainly be used to deal with the data with non-hierarchical labels.
However, there are indeed lots of hierarchical labeled data, such as Imagenet[Deng et al.2009], IAPRTC-12111http://imageclef.org/SIAPRdata. and CIFAR-100222https://www.cs.toronto.edu/ kriz/cifar.html.. Intuitively, we can simply take hierarchical labeled data as non-hierarchical labeled data, and then take advantage of the existing algorithms. Obviously, it cannot achieve optimal performance, because most of the existing methods are essentially designed to deal with non-hierarchical labeled data which do not consider special characteristics of hierarchical labeled data. For example, in Figure 1, if taking the hierarchical ones as non-hierarchical labeled data, images and have the same label “Rose”, the label of the image is “Sunflower”, and the labels for and are respectively “Ock” and “Tiger”. Given a query with the ground truth label “Rose”, the retrieved results may be “, , , , and ” in descending order without considering the hierarchy. It does not make sense that the ranking positions of images and are higher than that of , because the image is also a flower although it is not a rose.
To address the aforementioned problem, we propose a novel supervised hierarchical deep hashing method for hierarchical labeled data, denoted as SHDH. Specifically, we define a novel similarity formula for hierarchical labeled data by weighting each layer, and design a deep convolutional neural network to obtain a hash code for each data point. Extensive experiments on several real-world public datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.
2.1 Hierarchical Similarity
It is reasonable that images have distinct similarity in different layers in a hierarchy. For example, in Figure 1, images and are similar in the third layer because they are both flower. However, they are dissimilar in the fourth layer because belongs to rose but belongs to sunflower. In the light of this, we have to define hierarchical similarity for two images in hierarchical labeled data.
Definition 1 (Layer Similarity)
For two images and , the similarity at the layer in the hierarchy is defined as:
where is the ancestor node of image at the layer.
Equation (1) means that if images and share the common ancestor node in the layer, they are similar at this layer. On the contrary, they are dissimilar. For example, in Figure 1, the layer similarities between images and at different layers are: , , and .
Intuitively, the higher layer is more important, because we cannot reach the right descendant nodes if we choose a wrong ancestor. We thus have to consider the weight for each layer in a hierarchy.
Definition 2 (Layer Weight)
The importance of layer in a hierarchy whose height is K, can be estimated as:
layer in a hierarchy whose height is K, can be estimated as:
Note that because the root has no discriminative ability for all data points. It is easy to prove that , which satisfies the demand where the influence of ancestor nodes is greater than that of descendant nodes. And .
Based upon the two definitions above, the final hierarchical similarity between images and can be calculated as below:
where is the height of a hierarchy. Equation (3) guarantees that the more common hierarchical labels image pairs have, the more similar they are.
2.2 Supervised Hierarchical Deep Hashing
shows the deep learning architecture of the proposed method. Our SHDH model consists of two parts: feature learning and hash function learning. The feature learning part includes a convolutional neural network (CNN) component and two fully-connected layers. The CNN component contains five convolutional layers. After the CNN component, the architecture holds two fully-connected layers which have 4,096 units. The mature neural network architecture in[Chatfield2014Return] is multipled in our work. Other CNN architectures can be used here such as AlexNet [Krizhevsky et al.2012], GoogleNet [SzegedyLJSRAEVR15]Chatfield2014Return].
The hash function learning part includes a hashing layer and an independent weighting layer. The hashing functions are learnt by the hashing layer whose size is the length of hash codes. And no activation function used in this layer. Note that the hashing layer is divided into -segments and is the height of hierarchical labeled data. The size of segments is , where L is the length of hash codes. And the size of the segment is . The size of segment is denoted as . Here, there is an implicit assumption that is larger than . Besides, the values in the weighting layer are the weights learnt by Eq. (2) from the hierarchical labeled data, which are used to adjust the Hamming distance among hash codes. Each value in the weighting layer weights a corresponding segment in the hashing layer. The parameters including weights and bias in the hashing layer are initialized to be a samll real number between 0 and 0.001.
2.2.1 Objective Function
Given a hierarchical labeled dataset where
is the feature vector for data pointand is the number of data points. Its semantic matrix can be built via Eq. (3), where . The goal of our SHDH is to learn a -bit binary codes vector for each point . Assume there are layers in our deep network, and there are units for the layer, where . For a given sample , the output of the first layer is: , where is the projection matrix to be learnt at the first layer of the network, is the bias, and is the activation function. The output of the first layer is then considered as the input for the second layer, so that , where and
are the projection matrix and bias vector for the second layer, respectively. Similarly, the output for thelayer is: , and the output at the top layer of our network is:
where the mapping is parameterized by , . Now, we can perform hashing for the output at the top layer of the network to obtain binary codes as follows: . The procedure above is forward. To learn the parameters of our network, we have to define an objective function.
First, for an image , its hash code is consisting of , where is the hash code in the segment, . Thus, the weighted Hamming distance between images and can be defined as:
We define the similarity-preserving objective function:
Eq. (5) is used to make sure the similar images could share same hash code in each segment.
Second, to maximize the information from each hash bit, each bit should be a balanced partition of the dataset. Thus, we maximize the entropy, just as below:
Assume that is all the hash codes for data points, and thus the objective function Eq. (2.2.1) could be transformed into the matrix form as below:
, where denotes the elementwise sign function which returns if the element is positive and returns otherwise; is the weight of the layer in our SHDH, is bias vector, and is the output of the layer. is a diagonal matrix. It can be divided into small diagonal matrix corresponding to . The diagonal value of is . Since the elements in are discrete integer, is not derivable. So, we relax it as
from discrete to continuous by removing the sign function. Stochastic gradient descent (SGD)[Paras2014Stochastic] is used to learn the parameters. We use back-propagation (BP) to update the parameters:
where is calculated as below:
where denotes element-wise multiplication.
Then, the parameters are updated by using the following gradient descent algorithm until convergence:
where is the step-size. In addition, we use the lookup table technology proposed in [Zhang et al.2015] to speed up the searching process. The outline of the proposed supervised hierarchical deep hashing (SHDH) is described in Algorithm 1.
3.1 Datasets and Setting
We carried out experiments on two public benchmark datasets: CIFAR-100 dataset and IAPRTC-12 dataset. CIFAR-100 is an image dataset containing 60,000 colour images of 3232 pixels. It has 100 classes and each class contains 600 images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image has a “fine” label (the class which it belongs to) and a “coarse” label (the superclass which it belongs to). Thus, the height of the hierarchical labels with a “root” node in CIFAR-100 is three. The IAPRTC-12 dataset has 20,000 segmented images. Each image has been manually segmented, and the resultant regions have been annotated according to a predefined vocabulary of labels. The vocabulary is organized according to a hierarchy of concepts. The height of the hierarchical labels in IAPRTC-12 is seven. For both datasets, we randomly selected 90% as the training set and the left 10% as the test set. The hyper-parameter in SHDH is empirically set as one.
We compared our methods with six state-of-the-art hashing methods, where four of them are supervised, the other two are unsupervised. The four supervised methods include DPSH [Li et al.2015, Li et al.2016], COSDISH [Kang et al.2016], LFH [Zhang et al.2014], and KSH [Liu et al.2012]. The two unsupervised methods are KMH [He et al.2013] and ITQ [Gong and Lazebnik2011]. For all of these six baselines, we employed the implementations provided by the original authors, and used the default parameters recommended by the corresponding papers. Moreover, to study the influence of hierarchical labels separately, we replaced the values in the similarity matrix for KSH and COSDISH by using hierarchical similarity to obtain two new methods, KSH+H and COSDISH+H. “H” means hierarchical version. ITQ and KMH cannot be modified because they are unsupervised. LFH and DPSH cannot be modified because their algorithm structures are not suitable to add hierarchical labeled information.
We resized all images to 3232 pixels and directly used the raw images as input for the deep hashing methods including SHDH and DPSH. The left six methods use hand-crafted features. We represented each image in CIFAR-100 and IAPRTC-12 by a 512-D GIST vector.
3.2 Evaluation Criterion
We measured the ranking quality of retrieved list for different methods by Average Cumulative Gain (ACG), Discounted Cumulative Gain (DCG), Normalized Discounted Cumulative Gain (NDCG) [Järvelin and Kekäläinen2000] and Weighted Recall. Note that we proposed the Weighted Recall metric to measure the recall in the scenario of hierarchical labeled data, defined as:
where is the number of top returned data points, represents the similarity between the query and data point in the ranking list, is the length of the ranking list.
3.3 Results on CIFAR-100
Table 1 summarizes the comparative results of different hashing methods on the CIFAR-100 dataset. We have several observations from Table 1: (1) our SHDH outperforms the other supervised and unsupervised baselines for different code length. For example, comparing with the best competitor (DPSH), the results of our SHDH have a relative increase of 12.5% 18.4% on ACG, 10.7% 16.7% on DCG, and 8.7% 11.4% on NDCG; (2) the hierarchical semantic labels can improve the performance of hashing methods. For example, COSDISH+H and KSH+H perform respectively better than COSDISH and KSH, which means the inherent hierarchical information is valuable to improve hashing performance; (3) among all the supervised approaches, the deep learning based approaches (SHDH and DPSH) give relatively better results, and it confirms that the learnt representations by deep network from raw images are more effective than hand-crafted features to learn hash codes.
Figure 3 (a) (c) are the Weighted Recall curves for different methods over different weighted Hamming distance at 32, 48, and 64 bits, respectively, which shows our method has a consistent advantage over baselines. Figure 3 (g) (i) are the Weighted Recall results over top- retrieved results, where ranges from 1 to 5,000. Our approach also outperforms other state-of-the-art hashing methods. The Weighted Recall curves at different length of hash codes are also illustrated in Figure 4 (a). From the figure, our SHDH model performs better than baselines, especially when the code length increases. This is because when the code length increases, the learnt hash functions can increase the discriminative ability for hierarchical similarity among images.
3.4 Results on IAPRTC-12
Table 2 shows the performance comparison of different hashing methods over IAPRTC-12 dataset, and our SHDH performs better than other approaches regardless of the length of codes. Obviously, it can be found that all baselines cannot achieve optimal performance for hierarchical labeled data. Figure 3 (j) (l) are the Weighted Recall results over top- returned neighbors, where ranges from 1 to 5,000. These curves show a consistent advantage against baselines. Moreover, our SHDH provides the best performance at different code length, shown in Figure 4 (b).
The results of the Weighted Recall over different weighted Hamming distance are shown in Figure 3 (d) (f). In these figures, our method is not the best one. The reason is that our SHDH has better discriminative ability at the same weighted Hamming distance due to considering the hierarchical relation. For example, DPSH returns 4,483 data points while our SHDH only returns 2,065 points when the weighted Hamming distance is zero and the code length is 64 bits. Thus, the better discriminative ability leads to better precision (Table 2) but not-so-good Weighted Recall.
3.5 Sensitivity to Hyper-Parameter
Figure 5 shows the effect of the hyper-parameter over CIFAR-100. We can find that SHDH is not sensitive to . For example, SHDH can achieve good performance on both datasets with 0.5 2. We can also obtain similar conclusion over IAPRTC-12 dataset, and the figure is not included in this paper due to the limitation of space.
In this paper, we have proposed a novel supervised hierarchical deep hashing method for hierarchical labeled data. To the best of our knowledge, SHDH is the first method to utilize the hierarchical labels of images in supervised hashing area. Extensive experiments on two real-world public datasets have shown that the proposed SHDH method outperforms the state-of-the-art hashing algorithms.
In the future, we will explore more hashing methods to process hierarchical labeled data, and further improve the performance of hashing methods for non-hierarchical labeled data by constructing their hierarchy automatically.
- [Andoni and Indyk2008] A Andoni and P Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008.
- [Arya et al.1998] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM, 45(6):891–923, 1998.
- [Beis and Lowe1997] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In , pages 1000–1006, San Juan, Puerto Rico, 1997.
- [Cai2016] Deng Cai. A revisit of hashing algorithms for approximate nearest neighbor search. CoRR, abs/1612.07545, 2016.
- [Chatfield et al.2014] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
- [Datar et al.2004] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the annual Symposium on Computational geometry, pages 253–262, 2004.
- [Deng et al.2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, pages 248–255, Miami, Florida, USA, 2009. IEEE Computer Society.
- [Do et al.2016] Thanh-Toan Do, Anh-Dzung Doan, and Ngai-Man Cheung. Learning to hash with binary deep neural network. In European Conference on Computer Vision, pages 219–234, Amsterdam, The Netherlands, 2016. Springer.
- [Gionis et al.1999] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In International Conference on Very Large Data Bases, pages 518–529, Edinburgh, Scotland, UK, 1999. Morgan Kaufmann.
- [Gong and Lazebnik2011] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Conference on Computer Vision and Pattern Recognition, pages 817–824, Colorado Springs, CO, USA, 2011. IEEE Computer Society.
- [He et al.2013] Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Conference on Computer Vision and Pattern Recognition, pages 2938–2945, Portland, OR, USA, 2013. IEEE Computer Society.
- [Heo et al.2012] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon. Spherical hashing. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2957–2964, 2012.
Piotr Indyk and Rajeev Motwani.
Approximate nearest neighbors: Towards removing the curse of dimensionality.In
ACM Symposium on the Theory of Computing, pages 604–613, Dallas, Texas, USA, 1998.
- [Järvelin and Kekäläinen2000] Kalervo Järvelin and Jaana Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Conference on Research and Development in Information Retrieval, pages 41–48, Athens, Greece, 2000. ACM.
[Kang et al.2016]
Wang-Cheng Kang, Wu-Jun Li, and Zhi-Hua Zhou.
Column sampling based discrete supervised hashing.
Conference on Artificial Intelligence, pages 1230–1236, Phoenix, Arizona, USA, 2016. AAAI Press.
- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems, pages 1106–1114, Lake Tahoe, Nevada, United States, 2012.
- [Kulis and Grauman2009] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In International Conference on Computer Vision, pages 2130–2137, Kyoto, Japan, 2009.
- [Kulis et al.2009] Brian Kulis, Prateek Jain, and Kristen Grauman. Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell., 31(12):2143–2157, 2009.
[Li et al.2013]
Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Anthony R. Dick.
Learning hash functions using column generation.
International Conference on Machine Learning, pages 142–150, Atlanta, GA, USA, 2013. JMLR.org.
- [Li et al.2015] Wu Jun Li, Sheng Wang, and Wang Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. Computer Science, 2015.
- [Li et al.2016] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. In International Joint Conference on Artificial Intelligence, pages 1711–1717, New York, NY, USA, 2016. IJCAI/AAAI Press.
- [Liu et al.2012] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In Conference on Computer Vision and Pattern Recognition, pages 2074–2081, Providence, RI, USA, 2012.
- [Liu et al.2013] Xianglong Liu, Junfeng He, and Bo Lang. Reciprocal hash tables for nearest neighbor search. In Conference on Artificial Intelligence Bellevue, Washington, USA, 2013. AAAI Press.
- [Liu et al.2014] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. Discrete graph hashing. In Conference on Neural Information Processing Systems, pages 3419–3427, Montreal, Quebec, Canada, 2014.
- [Norouzi and Fleet2011] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of the International Conference on Machine Learning, pages 353–360, Bellevue, Washington, USA, 2011. Omnipress.
- [Norouzi et al.2012] Mohammad Norouzi, David J. Fleet, and Ruslan Salakhutdinov. Hamming distance metric learning. In Conference on Neural Information Processing Systems, pages 1070–1078, Lake Tahoe, Nevada, United States, 2012.
- [Raginsky and Lazebnik2009] Maxim Raginsky and Svetlana Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Advances in Neural Information Processing Systems, pages 1509–1517, 2009.
- [Shen et al.2015] Fumin Shen, Wei Liu, Shaoting Zhang, Yang Yang, and Heng Tao Shen. Learning binary codes for maximum inner product search. In International Conference on Computer Vision, pages 4148–4156, Santiago, Chile, 2015. IEEE Computer Society.
- [Smeulders et al.2000] Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh C. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22(12):1349–1380, 2000.
- [Wang et al.2010a] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In Conference on Computer Vision and Pattern Recognition, pages 3360–3367, San Francisco, CA, USA, 2010.
- [Wang et al.2010b] Jun Wang, Ondrej Kumar, and Shih-Fu Chang. Semi-supervised hashing for scalable image retrieval. In Conference on Computer Vision and Pattern Recognition, pages 3424–3431, San Francisco, CA, USA, 2010. IEEE Computer Society.
- [Wang et al.2014] Qifan Wang, Luo Si, and Dan Zhang. Learning to hash with partial tags: Exploring correlation between tags and hashing bits for large scale image retrieval. In European Conference on Computer Vision, pages 378–392, Zurich, Switzerland, 2014. Springer.
- [Zhang et al.2014] Peichao Zhang, Wei Zhang, Wu-Jun Li, and Minyi Guo. Supervised hashing with latent factor models. In International Conference on Research and Development in Information Retrieval, pages 173–182, Gold Coast, QLD, Australia, 2014. ACM.
- [Zhang et al.2015] Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, and Lei Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. Image Processing, 24(12):4766–4779, 2015.
- [Zhang et al.2016] Jian Zhang, Yuxin Peng, and Junchao Zhang. SSDH: semi-supervised deep hashing for large scale image retrieval. CoRR, abs/1607.08477, 2016.
- [Zhao et al.2015] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic ranking based hashing for multi-label image retrieval. In Conference on Computer Vision and Pattern Recognition, pages 1556–1564, Boston, MA, USA, 2015. IEEE Computer Society.
- [Zheng et al.2015] Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian. Query-adaptive late fusion for image search and person re-identification. In Conference on Computer Vision and Pattern Recognition, pages 1741–1750, Boston, MA, USA, 2015.
- [Zhu et al.2016] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In Conference on Artificial Intelligence, pages 2415–2421, Phoenix, Arizona, USA, 2016. AAAI Press.