Unsupervised Semantic Deep Hashing

03/19/2018 ∙ by Sheng Jin, et al. ∙ 0

In recent years, deep hashing methods have been proved to be efficient since it employs convolutional neural network to learn features and hashing codes simultaneously. However, these methods are mostly supervised. In real-world application, it is a time-consuming and overloaded task for annotating a large number of images. In this paper, we propose a novel unsupervised deep hashing method for large-scale image retrieval. Our method, namely unsupervised semantic deep hashing (USDH), uses semantic information preserved in the CNN feature layer to guide the training of network. We enforce four criteria on hashing codes learning based on VGG-19 model: 1) preserving relevant information of feature space in hashing space; 2) minimizing quantization loss between binary-like codes and hashing codes; 3) improving the usage of each bit in hashing codes by using maximum information entropy, and 4) invariant to image rotation. Extensive experiments on CIFAR-10, NUSWIDE have demonstrated that USDH outperforms several state-of-the-art unsupervised hashing methods for image retrieval. We also conduct experiments on Oxford 17 datasets for fine-grained classification to verify its efficiency for other computer vision tasks.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the explosive increase of data, searching for content relevant image, or other media data remains a challenge because of large amount of computational cost and the accuracy requirement. In the early stage, researchers focus on data-independent methods. Locality-Sensitive Hashing [1] and its variants are proposed. But it has a lower accuracy since the semantic information of data is not considered during coding process. In recent years, the data-dependent hashing methods [2] attract more attention since its compact representation and superior accuracy performance. Compared with data-independent hashing method, data-dependent hashing methods improve retrieval performance via training on the dataset.

Data-dependent methods mainly include supervised hashing methods [3], unsupervised hashing methods [4] and semi-supervised hashing methods [5]. These supervised methods make use of the class information provided in the manual labels, where the supervised information is used in three forms: point-wise labels, pair-wise labels, and ranking labels. Some representative works have been proposed, e.g. Semantic Hashing [6], Binary Reconstruction Embedding [7], Minimal Loss Hashing[8], Kernel-based Supervised Hashing [3], Hamming Distance Metric Learning [9], and Column Generation Hashing [10]. Although the supervised hashing methods and semi-supervised hashing methods have been proved to gain better accuracy with compacter hashing codes, it is a time-consuming and heavy workload task in practical application. In the past years, some classical unsupervised hashing methods also have been developed, e.g. Isotropic Hashing [11], Spherical Hashing [12], Discrete Graph Hashing [13], Locally Linear Hashing [14], Asymmetric Inner-product Binary Coding [15] and Scalable Graph Hashing [16].

In these traditional hashing methods, each image is initially represented by a hand-crafted feature. However, these features may not preserve accurate semantic information. And they also may not be suitable for generating binary codes. Due to these facts, the accuracy of image retrieval could not meet our requirement. Over the last five years, deep learning has been proved to be effective in computer vision because it could automatically extract high-level semantic feature to represent image that is robust to the variances of object. Hinton

[6] et al. firstly proposed hashing method based on deep neural network. However, in [6], the input of the network is still hand-crafted features, which is the most crucial limitation.

Very recently, convolutional Neural Network Hashing [17] introduces an end-to-end network for learning better hashing codes. However, this method has limitations since it cannot perform feature learning and hashing codes learning simultaneously. Followed [17], new variants of deep hashing have been proposed, e.g, Deep Neural Network Hashing [18], Deep Semantic Ranking Hashing [19], deep supervised hashing [20] and DeepBit [21], which extract features and learn hashing codes simultaneously. These methods are more effective and perform more efficiently in image retrieval task. However, most of these deep hashing methods, except DeepBit [21] and DBD-MQ [22], are pure supervised. DBD-MQ [22]

propose a quantization method for hashing learning. This method does not utilize the rigid sign function for binarization and considers the binarization as a multi-quantization task. DeepBit

[21] tries to make hashing codes invariant to rotation by minimizing the difference between the hashing codes that describe the reference image and that of rotated one. However, this method only considers rotation invariance of images, and the invariance among different images with same class label can not be guaranteed.

In this paper, motivated by the success of DeepBit [21], we propose a novel unsupervised deep hashing method, called unsupervised semantic deep hashing (USDH).

Figure 1:

We enforce four criterions on the loss function to learn efficient hashing codes based on VGG-19 architecture. In the training stage, the hashing codes are learned by the form of image pairs. On the first stage, we train the deep model by minimizing quantization loss ,information loss and use the mid-level feature to guide the process of learning hashing codes. On the second stage, we augment dataset with rotation , hashing codes is learned to be invariant to rotation by minimization the distance between that represents reference image and that of rotated one.

The main contributions of USDH are outlined as follows:

USDH is an unsupervised end-to-end deep hashing framework. Compared with DeepBit method, USDH not only considers rotation invariance in a single image, but also preserves the semantic information of image pairs.

USDH proposes a novel deep unsupervised hashing method to preserve the similarity information in the feature space. It regards the output of full-connected layer as representation descriptor of image. The loss function requires hashing codes learned by deep network approximating the similarity computed by representation descriptors of image.

Experiments on general datasets show that USDH can outperform other unsupervised methods to achieve the state-of-the-art performance in image retrieval applications. And it is also quite effective for fine-grained classification.

2 Unsupervised Semantic Deep Hashing

In this paper, we introduce a novel unsupervised deep hashing method. Compared with existing methods, our method utilizes relevant information preserved in the feature space to guide the learning process of hashing codes. Based on this motivation, the cost function of unsupervised semantic deep hashing contains four components: 1) preserving relevant information of feature space in hashing codes via a semantic loss 2) minimizing quantization loss between binary-like codes and hashing codes 3) improving the usage of each bit in hashing codes by maximizing information entropy 4) keeping the learned hashing codes invariant to rotation by pulling hashing codes of reference image and that of the rotated one together. The whole deep model is shown in Figure 1. The cost function is written as below:


represents semantic loss, represents quantization loss, represents information loss, represents rotation loss.

2.1 Semantic Loss

To preserve semantic information in the feature space, firstly, we should adopt an optimal feature to represent images and use a proper formula to measure the similarity of images in the feature space, then we let similarity computed by the hashing codes of image pairs approximate the similarity measured in the original feature space.

Firstly, we adopt VGG-19 model to process the images and use the output of the second full-connected layer as our image feature. Many researches have proved that high-level feature of convolutional neural network has sufficient semantic information and these mid-features are robust to inner-class including rotation, shape and color variance. There also exist different metrics to measure similarity in the feature space. We adopt a widely-used metric that is defined as:


Where denotes the dimension of the second full-connected layer and is a positive constant parameter. can represent a similarity degree of the images and . The hashing codes of image is denoted as .

We require hashing codes preserving relevant semantic information. More specifically, if is near to 1, we assume hashing codes and has smaller distance. But if is near to 0, then and has larger distance. For each training batch, we can obtain a similarity matrix. We try to use the similarity degree in the feature space to guide the learning of hashing codes. To do so, in hamming space, we also define a similarity measure, and then the similarity measure defined in hashing space is required to be as similar as possible to the similarity matrix defined in the original feature space.

According to this constraint, the neighbor points in the feature space are still neighbors in the hashing space. Specifically, is relaxed to

, then the hashing codes is linearly transformed to



where . The inner product of and is in the range of , where is the length of hashing codes. Then the inner product is linear transformed to via . The result of linear transformation is also regarded as a similarity degree. And it fits in with the assumption on information loss that each bit of hashing codes plays the same role.The function of semantic loss is written as:


With this loss function, the deep model is trained by back-propagation algorithm with batch gradient descent method. To solve this, the gradient of semantic loss function need to be computed. Since norm is non-differentiable at some certain point , we employ sub-gradient to overcome the problem and we define the sub-gradient at this point to be equal to right-hand derivative. The gradient of semantic loss is defined as:



2.2 Quantization Loss

Since it is difficult to directly optimize discrete loss function, we should relax the objective function to transform the discrete problem into a continuous optimization problem. As discussed in [20], some widely-used relaxation scheme working with non-linear functions, such as sigmoid and tanh function, would inevitably slow down or even restrain the convergence of the network [23]

. To overcome such limitation, we still use the relu function as activation function of the second full-connected layer. Then the output of network is quantized to the binary codes. The quantization function is written as:

where denotes the binarization function.

To decrease this loss, we let the value of network’s output near to 1 or 0. First, the hashing codes is linearly transformed to in the same way. Then the result of linearly transformation is changed into an absolute value. The absolute value of the hashing codes should be near to 1. Finally the quantization loss is defined as:


where denotes element-wise absolute value, and denotes norm. is a weighting parameter.

To train the model, the gradient of need to be computed. The sub-gradient is taken to replace the gradient of because of the non-differentiate point in the absolute operation and norm. The gradient is written as:

2.3 Information Loss

As the main assumption of semantic loss, each bit of hashing codes should play an equivalent impact, which means each bit should have the same mean value. Inspired by the efficiency of DeepBit [21] method, we also maximize capability of each bit in hashing codes to express information. So we further enhance the hashing codes by assuming that each bit has half chance to be one. Based on this constraint, the balanced distribution criterion can be written as below:


where denotes the mean value of -th bite of hashing codes, denotes norm and denotes the size of training batch.

2.4 Rotation Loss

Existing widely-used hand-crafted features should be invariant to rotation and scale. Inspired by this motivation, we also rotate the images and pull hashing codes that represent the reference image and that of the rotated one together. The proposed rotation-invariance criterion can be written as:


Where denotes hashing codes of image with rotation .

3 Experiment

In order to test the performance of our proposed method, we conduct experiments on four datasets, including three widely used image retrieval datasets: CIFAR-10 and NUSWIDE dataset, as well as one recognition dataset: Oxford flower17. Similar to other image retrieval task, our method is also evaluated based on mean accuracy precision at top 1000. Compared with some representative unsupervised hashing methods, such as KMH [24], SphH [12], SpeH [4], PCAH [5], LSH [1], PCA-ITQ [25], DH [26], DeepBit [21] and DBD-MQ [22], experimental results verify that our proposed method outperforms these existing unsupervised hashing method. In order to prove our method is flexible for other computer vision applications, we also conduct experiments for fined-grained recognition on Oxford flower17 dataset.

3.1 Dataset

CIFAR-10 dataset consists of 60000 3232 images in 10 classes. Each image in dataset belongs to one class.( 6000 images per class) The dataset is divided in two parts: train set(5000 images per class) and test set(1000 images per class).

NUSWIDE dataset is a multi-label dataset. NUSWIDE contains nearly 270k images associated with 81 semantic concepts. Followed [17], We select the 21 most frequent concept. Each of concepts is associated with at least 5000 images. The dataset is splitted into training set and test set. We sample 100 images from each concepts to form a test set and the remaining images are treated as a training set.

Oxford 17 flower dataset consists of 1360 images belonging to 17 mutually classes. Each class contains 80 images. The dataset is divided into three parts, including train set, test set and validation set, with 40 images, 20 images and 20 images respectively. In our experiment, we ignore validation set.

3.2 Implementation Details


method is implemented based on Caffe and the deep model is trained by batch gradient descend. As shown in Figure


, We use VGG-19 as the base model, and the model is firstly trained on Imagenet dataset. Then the output layer of VGG-19 is replaced by hashing layer. In the training stage, image is regarded as input in the form of batch and every two images in same batch construct an image pair. The parameters of deep model are updated by minimizing objective function, including semantic loss, quantization loss, information loss and rotation loss. We conduct experiments for learning 16-bit, 32-bit, 48-bit hashing codes, respectively on cifar-10 dataset and NUSWIDE dataset. In this paper, we propose multiple loss function. So we further evaluate these loss functions. The semantic loss is proved more important and our quantization loss also improve performance. Since the efficiency of semantic loss, robustness analysis is discussed. We conduct experiments by different parameters

in semantic loss. The constant parameters are respectively set as , , . Where denotes as the dimension of output of second full-connected layer. To prove the efficiency of hashing codes learned by USDH, we also conduct experiments for other computer vision field, such as fined grained classification.

3.3 Results on image retrieval

Similar to DeepBit [21] method, the dataset is splitted into two parts. More specially, 10000 images is selected randomly as query image and then we conduct retrieval task on the remaining images for both CIFAR-10. We define similarity label based on semantic-level labels and images from the same class are considered similar. For NUSWIDE dataset, we follow the setting in [17], and if two images share at least one same label, they are considered same. The Mean Average Precision (MAP,%) at top 1000 of different unsupervised hashing methods on CIFAR-10 dataset was shown in table1. The experiment results on Table 1 show that USDH outperforms existing best retrieval performance by , , and improves DeepBit method by , , , correspond to different hash bits, respectively 16 bits, 32 bits and 64 bits. we also conduct experiments for large-scale image retrieval. As shown in Table 2, our method absolute increases of , , in average MAP for different bits on NUSWIDE dataset. Based on results of experiment, USDH is proved to be effective for image retrieval and the semantic information among different images in feature space improves significantly performance.

Method 16-bit 32-bit 64-bit
KMH 13.59 13.93 14.46
SphH 13.98 14.58 15.38
SpeH 12.55 12.42 12.56
PCAH 12.91 12.60 12.10
LSH 12.55 13.76 15.07
PCA-ITQ 15.67 16.20 16.64
DH 16.17 16.62 16.96
DeepBit 19.43 24.86 27.73
DBD-MQ 21.53 26.50 31.85
USDH 26.13 36.56 39.27
Table 1: Mean Average Precision (MAP) results for different number of bits CIFAR-10
Method 16-bit 32-bit 48-bit
SphH 41.30 42.40 43.10
SpeH 43.30 42.60 42.30
PCAH 42.90 43.70 41.40
LSH 40.30 42.60 42.30
PCA-ITQ 45.28 46.82 47.70
DH 42.20 44.80 48.00
DeepBit 38.30 40.10 41.20
USDH 64.07 65.68 65.87
Table 2: MAP results for different number of bits NUSWIDE

Component analysis of loss function: Our loss function consists of four components. In this section, we evaluate the effectiveness of two major components: semantic loss and quantization loss. The results on CIFAR-10 are shown in Table 3 . It is worth mentioning that the semantic loss has improved the performance by 7.62% compared to DeepBit method. And the quantization loss proposed in our paper has further improved the performance by 4.07%.

Method MAP
DeepBit 24.86
DeepBit+semantic loss 32.48
our method 36.55
Table 3: Effectiveness (MAP 32 bits) of different loss function

Robustness analysis of semantic loss: Since all these experiment results have shown the effectiveness of semantic loss, the next experiment would focus on the influences of different parameter settings. We set the parameter in different value, including , , , , to conduct experiments on CIFAR10 to learn the 64-bits hashing codes, where denotes the dimension of second full-connected layer. Table 4 reveals that semantic loss is robust to the value of parameter . The experimental results suggest that the hashing codes learned by USDH focus on relative relationship of image features, instead of their exact similarity value.

MAP 39.27 39.02 39.20 39.11
Table 4: Comparison of image retrieval MAP of our USDH with respect to different values of parameters

3.4 Results on fined grained classification

Different from supervised hashing method, USDH learns hashing codes without label information. Thus, it has more practical potential which benefits not only image retrieval but also other computer vision tasks such as fine-grained classification. To verify it, we conduct experiments on fine-grained classification compared with some traditional features, such as, SIFT, HOG, HSV and so on. It is worth mentioning that DeepBit is also a deep unsupervised hashing method. However, DeepBit method [21]

only requires hashing codes invariant to rotation and not considers the within-class variance among different images. Fine grained classification is a classic computer vision task and refers to discriminating categories of same sub-class belong to different super class. This task requires image descriptors invariant to within-class variance. More specially, for flower classification, within-class variances include color difference, shape deformation and pose. We select multi-svm as classifier and conduct experiments with different features. Table 

5 and Figure 2 shows classification accuracy of these experiment shows the experiment results of USDH. Since within-class variance limits the efficiency of traditional color descriptor and hand-crafted shape descriptor, hashing codes learned by deep network has a superior performance, improved than SIFT-Internal feature. Compared with DeepBit, our method still improvs . Additionally, our method is same fast as DeepBit method and more faster than traditional descriptor since it has low dimension. From the above experiment, the proposed USDH method has been proved efficiency for classification task.

4 Conclusions

In this paper, we propose a novel unsupervised deep hashing method, named unsupervised semantic deep hashing method. The parameters of deep neural network is fine-tuned according to four loss function: 1) semantic loss; 2) quantization loss; 3) information loss; and 4) rotation loss. Compare with previous unsupervised deep hashing methods, USDH requires hashing codes to preserve the relevant semantic information in the feature space. Extensive experiments on CIFAR-10 dataset and NUSWIDE dataset demonstrate that our proposed method outperforms existing unsupervised hashing method for image retrieval task. And the experimental results on Oxford17 dataset also prove that the hashing code learned by USDH is also effective on other computer vision tasks, such as fine-grained classification.

Feature Accuracy Training time(sec)
Colour 60.9 2.1% 3
Texture 70.2 1.3% 4
HOG 63.7 2.7% 3
HSV 58.5 4.5% 4
SIFT-Boundary 59.4 3.3% 4
SIFT-Internal 70.6 1.6% 4
DeepBit 75.1 2.5% 0.07
USDH 81.3 2.1% 0.07
Table 5: The recognition accuracy for fine grained classification on Oxford17 dataset compared with different features
Figure 2: Confusion matrix of Oxford 17 flower classification using the proposed USDH.


  • [1] Alexandr Andoni and Piotr Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 459–468.
  • [2] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David Suter,

    “Fast supervised hashing with decision trees for high-dimensional data,”

    in CVPR, 2014, pp. 1963–1970.
  • [3] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang, “Supervised hashing with kernels,” in CVPR. IEEE, 2012, pp. 2074–2081.
  • [4] Yair Weiss, Antonio Torralba, and Rob Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
  • [5] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang, “Semi-supervised hashing for scalable image retrieval,” in CVPR. IEEE, 2010, pp. 3424–3431.
  • [6] Ruslan Salakhutdinov and Geoffrey Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
  • [7] Brian Kulis and Trevor Darrell, “Learning to hash with binary reconstructive embeddings,” in Advances in neural information processing systems, 2009, pp. 1042–1050.
  • [8] Mohammad Norouzi and David M Blei, “Minimal loss hashing for compact binary codes,” in ICML-11, 2011, pp. 353–360.
  • [9] Mohammad Norouzi, David J Fleet, and Ruslan R Salakhutdinov, “Hamming distance metric learning,” in Advances in neural information processing systems, 2012, pp. 1061–1069.
  • [10] Xi Li, Guosheng Lin, Chunhua Shen, Anton Hengel, and Anthony Dick, “Learning hash functions using column generation,” in ICML, 2013, pp. 142–150.
  • [11] Weihao Kong and Wu-Jun Li, “Isotropic hashing,” in Advances in Neural Information Processing Systems, 2012, pp. 1646–1654.
  • [12] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon, “Spherical hashing,” in CVPR. IEEE, 2012, pp. 2957–2964.
  • [13] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang, “Discrete graph hashing,” in Advances in Neural Information Processing Systems, 2014, pp. 3419–3427.
  • [14] Go Irie, Zhenguo Li, Xiao-Ming Wu, and Shih-Fu Chang, “Locally linear hashing for extracting non-linear manifolds,” in CVPR, 2014, pp. 2115–2122.
  • [15] Fumin Shen, Wei Liu, Shaoting Zhang, Yang Yang, and Heng Tao Shen, “Learning binary codes for maximum inner product search,” in CVPR, 2015, pp. 4148–4156.
  • [16] Qing-Yuan Jiang and Wu-Jun Li, “Scalable graph hashing with feature transformation.,” in IJCAI, 2015, pp. 2248–2254.
  • [17] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan, “Supervised hashing for image retrieval via image representation learning.,” in AAAI, 2014, vol. 1, pp. 2156–2162.
  • [18] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, 2015, pp. 3270–3278.
  • [19] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in CVPR, 2015, pp. 1556–1564.
  • [20] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen, “Deep supervised hashing for fast image retrieval,” in CVPR, 2016, pp. 2064–2072.
  • [21] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016, pp. 1183–1192.
  • [22] Yueqi Duan, Jiwen Lu, Ziwei Wang, Jianjiang Feng, and Jie Zhou, “Learning deep binary descriptor with multi-quantization,” in CVPR, July 2017.
  • [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [24] Kaiming He, Fang Wen, and Jian Sun,

    K-means hashing: An affinity-preserving quantization method for learning binary compact codes,”

    in CVPR, 2013, pp. 2938–2945.
  • [25] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Transactions on PAMI, vol. 35, no. 12, pp. 2916–2929, 2013.
  • [26] Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen, “Deep learning of binary hash codes for fast image retrieval,” in CVPR workshops, 2015, pp. 27–35.