1 Introduction
With the explosive increase of data, searching for content relevant image, or other media data remains a challenge because of large amount of computational cost and the accuracy requirement. In the early stage, researchers focus on dataindependent methods. LocalitySensitive Hashing [1] and its variants are proposed. But it has a lower accuracy since the semantic information of data is not considered during coding process. In recent years, the datadependent hashing methods [2] attract more attention since its compact representation and superior accuracy performance. Compared with dataindependent hashing method, datadependent hashing methods improve retrieval performance via training on the dataset.
Datadependent methods mainly include supervised hashing methods [3], unsupervised hashing methods [4] and semisupervised hashing methods [5]. These supervised methods make use of the class information provided in the manual labels, where the supervised information is used in three forms: pointwise labels, pairwise labels, and ranking labels. Some representative works have been proposed, e.g. Semantic Hashing [6], Binary Reconstruction Embedding [7], Minimal Loss Hashing[8], Kernelbased Supervised Hashing [3], Hamming Distance Metric Learning [9], and Column Generation Hashing [10]. Although the supervised hashing methods and semisupervised hashing methods have been proved to gain better accuracy with compacter hashing codes, it is a timeconsuming and heavy workload task in practical application. In the past years, some classical unsupervised hashing methods also have been developed, e.g. Isotropic Hashing [11], Spherical Hashing [12], Discrete Graph Hashing [13], Locally Linear Hashing [14], Asymmetric Innerproduct Binary Coding [15] and Scalable Graph Hashing [16].
In these traditional hashing methods, each image is initially represented by a handcrafted feature. However, these features may not preserve accurate semantic information. And they also may not be suitable for generating binary codes. Due to these facts, the accuracy of image retrieval could not meet our requirement. Over the last five years, deep learning has been proved to be effective in computer vision because it could automatically extract highlevel semantic feature to represent image that is robust to the variances of object. Hinton
[6] et al. firstly proposed hashing method based on deep neural network. However, in [6], the input of the network is still handcrafted features, which is the most crucial limitation.Very recently, convolutional Neural Network Hashing [17] introduces an endtoend network for learning better hashing codes. However, this method has limitations since it cannot perform feature learning and hashing codes learning simultaneously. Followed [17], new variants of deep hashing have been proposed, e.g, Deep Neural Network Hashing [18], Deep Semantic Ranking Hashing [19], deep supervised hashing [20] and DeepBit [21], which extract features and learn hashing codes simultaneously. These methods are more effective and perform more efficiently in image retrieval task. However, most of these deep hashing methods, except DeepBit [21] and DBDMQ [22], are pure supervised. DBDMQ [22]
propose a quantization method for hashing learning. This method does not utilize the rigid sign function for binarization and considers the binarization as a multiquantization task. DeepBit
[21] tries to make hashing codes invariant to rotation by minimizing the difference between the hashing codes that describe the reference image and that of rotated one. However, this method only considers rotation invariance of images, and the invariance among different images with same class label can not be guaranteed.In this paper, motivated by the success of DeepBit [21], we propose a novel unsupervised deep hashing method, called unsupervised semantic deep hashing (USDH).
The main contributions of USDH are outlined as follows:
USDH is an unsupervised endtoend deep hashing framework. Compared with DeepBit method, USDH not only considers rotation invariance in a single image, but also preserves the semantic information of image pairs.
USDH proposes a novel deep unsupervised hashing method to preserve the similarity information in the feature space. It regards the output of fullconnected layer as representation descriptor of image. The loss function requires hashing codes learned by deep network approximating the similarity computed by representation descriptors of image.
Experiments on general datasets show that USDH can outperform other unsupervised methods to achieve the stateoftheart performance in image retrieval applications. And it is also quite effective for finegrained classification.
2 Unsupervised Semantic Deep Hashing
In this paper, we introduce a novel unsupervised deep hashing method. Compared with existing methods, our method utilizes relevant information preserved in the feature space to guide the learning process of hashing codes. Based on this motivation, the cost function of unsupervised semantic deep hashing contains four components: 1) preserving relevant information of feature space in hashing codes via a semantic loss 2) minimizing quantization loss between binarylike codes and hashing codes 3) improving the usage of each bit in hashing codes by maximizing information entropy 4) keeping the learned hashing codes invariant to rotation by pulling hashing codes of reference image and that of the rotated one together. The whole deep model is shown in Figure 1. The cost function is written as below:
(1) 
represents semantic loss, represents quantization loss, represents information loss, represents rotation loss.
2.1 Semantic Loss
To preserve semantic information in the feature space, firstly, we should adopt an optimal feature to represent images and use a proper formula to measure the similarity of images in the feature space, then we let similarity computed by the hashing codes of image pairs approximate the similarity measured in the original feature space.
Firstly, we adopt VGG19 model to process the images and use the output of the second fullconnected layer as our image feature. Many researches have proved that highlevel feature of convolutional neural network has sufficient semantic information and these midfeatures are robust to innerclass including rotation, shape and color variance. There also exist different metrics to measure similarity in the feature space. We adopt a widelyused metric that is defined as:
(2) 
Where denotes the dimension of the second fullconnected layer and is a positive constant parameter. can represent a similarity degree of the images and . The hashing codes of image is denoted as .
We require hashing codes preserving relevant semantic information. More specifically, if is near to 1, we assume hashing codes and has smaller distance. But if is near to 0, then and has larger distance. For each training batch, we can obtain a similarity matrix. We try to use the similarity degree in the feature space to guide the learning of hashing codes. To do so, in hamming space, we also define a similarity measure, and then the similarity measure defined in hashing space is required to be as similar as possible to the similarity matrix defined in the original feature space.
According to this constraint, the neighbor points in the feature space are still neighbors in the hashing space. Specifically, is relaxed to
, then the hashing codes is linearly transformed to
:(3) 
where . The inner product of and is in the range of , where is the length of hashing codes. Then the inner product is linear transformed to via . The result of linear transformation is also regarded as a similarity degree. And it fits in with the assumption on information loss that each bit of hashing codes plays the same role.The function of semantic loss is written as:
(4) 
With this loss function, the deep model is trained by backpropagation algorithm with batch gradient descent method. To solve this, the gradient of semantic loss function need to be computed. Since norm is nondifferentiable at some certain point , we employ subgradient to overcome the problem and we define the subgradient at this point to be equal to righthand derivative. The gradient of semantic loss is defined as:
(5) 
where
2.2 Quantization Loss
Since it is difficult to directly optimize discrete loss function, we should relax the objective function to transform the discrete problem into a continuous optimization problem. As discussed in [20], some widelyused relaxation scheme working with nonlinear functions, such as sigmoid and tanh function, would inevitably slow down or even restrain the convergence of the network [23]
. To overcome such limitation, we still use the relu function as activation function of the second fullconnected layer. Then the output of network is quantized to the binary codes. The quantization function is written as:
where denotes the binarization function.
To decrease this loss, we let the value of network’s output near to 1 or 0. First, the hashing codes is linearly transformed to in the same way. Then the result of linearly transformation is changed into an absolute value. The absolute value of the hashing codes should be near to 1. Finally the quantization loss is defined as:
(6) 
where denotes elementwise absolute value, and denotes norm. is a weighting parameter.
To train the model, the gradient of need to be computed. The subgradient is taken to replace the gradient of because of the nondifferentiate point in the absolute operation and norm. The gradient is written as:
2.3 Information Loss
As the main assumption of semantic loss, each bit of hashing codes should play an equivalent impact, which means each bit should have the same mean value. Inspired by the efficiency of DeepBit [21] method, we also maximize capability of each bit in hashing codes to express information. So we further enhance the hashing codes by assuming that each bit has half chance to be one. Based on this constraint, the balanced distribution criterion can be written as below:
(7) 
where denotes the mean value of th bite of hashing codes, denotes norm and denotes the size of training batch.
2.4 Rotation Loss
Existing widelyused handcrafted features should be invariant to rotation and scale. Inspired by this motivation, we also rotate the images and pull hashing codes that represent the reference image and that of the rotated one together. The proposed rotationinvariance criterion can be written as:
(8) 
Where denotes hashing codes of image with rotation .
3 Experiment
In order to test the performance of our proposed method, we conduct experiments on four datasets, including three widely used image retrieval datasets: CIFAR10 and NUSWIDE dataset, as well as one recognition dataset: Oxford flower17. Similar to other image retrieval task, our method is also evaluated based on mean accuracy precision at top 1000. Compared with some representative unsupervised hashing methods, such as KMH [24], SphH [12], SpeH [4], PCAH [5], LSH [1], PCAITQ [25], DH [26], DeepBit [21] and DBDMQ [22], experimental results verify that our proposed method outperforms these existing unsupervised hashing method. In order to prove our method is flexible for other computer vision applications, we also conduct experiments for finedgrained recognition on Oxford flower17 dataset.
3.1 Dataset
CIFAR10 dataset consists of 60000 3232 images in 10 classes. Each image in dataset belongs to one class.( 6000 images per class) The dataset is divided in two parts: train set(5000 images per class) and test set(1000 images per class).
NUSWIDE dataset is a multilabel dataset. NUSWIDE contains nearly 270k images associated with 81 semantic concepts. Followed [17], We select the 21 most frequent concept. Each of concepts is associated with at least 5000 images. The dataset is splitted into training set and test set. We sample 100 images from each concepts to form a test set and the remaining images are treated as a training set.
Oxford 17 flower dataset consists of 1360 images belonging to 17 mutually classes. Each class contains 80 images. The dataset is divided into three parts, including train set, test set and validation set, with 40 images, 20 images and 20 images respectively. In our experiment, we ignore validation set.
3.2 Implementation Details
The USDH
method is implemented based on Caffe and the deep model is trained by batch gradient descend. As shown in Figure
1, We use VGG19 as the base model, and the model is firstly trained on Imagenet dataset. Then the output layer of VGG19 is replaced by hashing layer. In the training stage, image is regarded as input in the form of batch and every two images in same batch construct an image pair. The parameters of deep model are updated by minimizing objective function, including semantic loss, quantization loss, information loss and rotation loss. We conduct experiments for learning 16bit, 32bit, 48bit hashing codes, respectively on cifar10 dataset and NUSWIDE dataset. In this paper, we propose multiple loss function. So we further evaluate these loss functions. The semantic loss is proved more important and our quantization loss also improve performance. Since the efficiency of semantic loss, robustness analysis is discussed. We conduct experiments by different parameters
in semantic loss. The constant parameters are respectively set as , , . Where denotes as the dimension of output of second fullconnected layer. To prove the efficiency of hashing codes learned by USDH, we also conduct experiments for other computer vision field, such as fined grained classification.3.3 Results on image retrieval
Similar to DeepBit [21] method, the dataset is splitted into two parts. More specially, 10000 images is selected randomly as query image and then we conduct retrieval task on the remaining images for both CIFAR10. We define similarity label based on semanticlevel labels and images from the same class are considered similar. For NUSWIDE dataset, we follow the setting in [17], and if two images share at least one same label, they are considered same. The Mean Average Precision (MAP,%) at top 1000 of different unsupervised hashing methods on CIFAR10 dataset was shown in table1. The experiment results on Table 1 show that USDH outperforms existing best retrieval performance by , , and improves DeepBit method by , , , correspond to different hash bits, respectively 16 bits, 32 bits and 64 bits. we also conduct experiments for largescale image retrieval. As shown in Table 2, our method absolute increases of , , in average MAP for different bits on NUSWIDE dataset. Based on results of experiment, USDH is proved to be effective for image retrieval and the semantic information among different images in feature space improves significantly performance.
Method  16bit  32bit  64bit 

KMH  13.59  13.93  14.46 
SphH  13.98  14.58  15.38 
SpeH  12.55  12.42  12.56 
PCAH  12.91  12.60  12.10 
LSH  12.55  13.76  15.07 
PCAITQ  15.67  16.20  16.64 
DH  16.17  16.62  16.96 
DeepBit  19.43  24.86  27.73 
DBDMQ  21.53  26.50  31.85 
USDH  26.13  36.56  39.27 
Method  16bit  32bit  48bit 

SphH  41.30  42.40  43.10 
SpeH  43.30  42.60  42.30 
PCAH  42.90  43.70  41.40 
LSH  40.30  42.60  42.30 
PCAITQ  45.28  46.82  47.70 
DH  42.20  44.80  48.00 
DeepBit  38.30  40.10  41.20 
USDH  64.07  65.68  65.87 
Component analysis of loss function: Our loss function consists of four components. In this section, we evaluate the effectiveness of two major components: semantic loss and quantization loss. The results on CIFAR10 are shown in Table 3 . It is worth mentioning that the semantic loss has improved the performance by 7.62% compared to DeepBit method. And the quantization loss proposed in our paper has further improved the performance by 4.07%.
Method  MAP 

DeepBit  24.86 
DeepBit+semantic loss  32.48 
our method  36.55 
Robustness analysis of semantic loss: Since all these experiment results have shown the effectiveness of semantic loss, the next experiment would focus on the influences of different parameter settings. We set the parameter in different value, including , , , , to conduct experiments on CIFAR10 to learn the 64bits hashing codes, where denotes the dimension of second fullconnected layer. Table 4 reveals that semantic loss is robust to the value of parameter . The experimental results suggest that the hashing codes learned by USDH focus on relative relationship of image features, instead of their exact similarity value.
MAP  39.27  39.02  39.20  39.11 

3.4 Results on fined grained classification
Different from supervised hashing method, USDH learns hashing codes without label information. Thus, it has more practical potential which benefits not only image retrieval but also other computer vision tasks such as finegrained classification. To verify it, we conduct experiments on finegrained classification compared with some traditional features, such as, SIFT, HOG, HSV and so on. It is worth mentioning that DeepBit is also a deep unsupervised hashing method. However, DeepBit method [21]
only requires hashing codes invariant to rotation and not considers the withinclass variance among different images. Fine grained classification is a classic computer vision task and refers to discriminating categories of same subclass belong to different super class. This task requires image descriptors invariant to withinclass variance. More specially, for flower classification, withinclass variances include color difference, shape deformation and pose. We select multisvm as classifier and conduct experiments with different features. Table
5 and Figure 2 shows classification accuracy of these experiment shows the experiment results of USDH. Since withinclass variance limits the efficiency of traditional color descriptor and handcrafted shape descriptor, hashing codes learned by deep network has a superior performance, improved than SIFTInternal feature. Compared with DeepBit, our method still improvs . Additionally, our method is same fast as DeepBit method and more faster than traditional descriptor since it has low dimension. From the above experiment, the proposed USDH method has been proved efficiency for classification task.4 Conclusions
In this paper, we propose a novel unsupervised deep hashing method, named unsupervised semantic deep hashing method. The parameters of deep neural network is finetuned according to four loss function: 1) semantic loss; 2) quantization loss; 3) information loss; and 4) rotation loss. Compare with previous unsupervised deep hashing methods, USDH requires hashing codes to preserve the relevant semantic information in the feature space. Extensive experiments on CIFAR10 dataset and NUSWIDE dataset demonstrate that our proposed method outperforms existing unsupervised hashing method for image retrieval task. And the experimental results on Oxford17 dataset also prove that the hashing code learned by USDH is also effective on other computer vision tasks, such as finegrained classification.
Feature  Accuracy  Training time(sec) 

Colour  60.9 2.1%  3 
Texture  70.2 1.3%  4 
HOG  63.7 2.7%  3 
HSV  58.5 4.5%  4 
SIFTBoundary  59.4 3.3%  4 
SIFTInternal  70.6 1.6%  4 
DeepBit  75.1 2.5%  0.07 
USDH  81.3 2.1%  0.07 
References
 [1] Alexandr Andoni and Piotr Indyk, “Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 459–468.

[2]
Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David Suter,
“Fast supervised hashing with decision trees for highdimensional data,”
in CVPR, 2014, pp. 1963–1970.  [3] Wei Liu, Jun Wang, Rongrong Ji, YuGang Jiang, and ShihFu Chang, “Supervised hashing with kernels,” in CVPR. IEEE, 2012, pp. 2074–2081.
 [4] Yair Weiss, Antonio Torralba, and Rob Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
 [5] Jun Wang, Sanjiv Kumar, and ShihFu Chang, “Semisupervised hashing for scalable image retrieval,” in CVPR. IEEE, 2010, pp. 3424–3431.
 [6] Ruslan Salakhutdinov and Geoffrey Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
 [7] Brian Kulis and Trevor Darrell, “Learning to hash with binary reconstructive embeddings,” in Advances in neural information processing systems, 2009, pp. 1042–1050.
 [8] Mohammad Norouzi and David M Blei, “Minimal loss hashing for compact binary codes,” in ICML11, 2011, pp. 353–360.
 [9] Mohammad Norouzi, David J Fleet, and Ruslan R Salakhutdinov, “Hamming distance metric learning,” in Advances in neural information processing systems, 2012, pp. 1061–1069.
 [10] Xi Li, Guosheng Lin, Chunhua Shen, Anton Hengel, and Anthony Dick, “Learning hash functions using column generation,” in ICML, 2013, pp. 142–150.
 [11] Weihao Kong and WuJun Li, “Isotropic hashing,” in Advances in Neural Information Processing Systems, 2012, pp. 1646–1654.
 [12] JaePil Heo, Youngwoon Lee, Junfeng He, ShihFu Chang, and SungEui Yoon, “Spherical hashing,” in CVPR. IEEE, 2012, pp. 2957–2964.
 [13] Wei Liu, Cun Mu, Sanjiv Kumar, and ShihFu Chang, “Discrete graph hashing,” in Advances in Neural Information Processing Systems, 2014, pp. 3419–3427.
 [14] Go Irie, Zhenguo Li, XiaoMing Wu, and ShihFu Chang, “Locally linear hashing for extracting nonlinear manifolds,” in CVPR, 2014, pp. 2115–2122.
 [15] Fumin Shen, Wei Liu, Shaoting Zhang, Yang Yang, and Heng Tao Shen, “Learning binary codes for maximum inner product search,” in CVPR, 2015, pp. 4148–4156.
 [16] QingYuan Jiang and WuJun Li, “Scalable graph hashing with feature transformation.,” in IJCAI, 2015, pp. 2248–2254.
 [17] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan, “Supervised hashing for image retrieval via image representation learning.,” in AAAI, 2014, vol. 1, pp. 2156–2162.
 [18] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, 2015, pp. 3270–3278.
 [19] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in CVPR, 2015, pp. 1556–1564.
 [20] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen, “Deep supervised hashing for fast image retrieval,” in CVPR, 2016, pp. 2064–2072.
 [21] Kevin Lin, Jiwen Lu, ChuSong Chen, and Jie Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016, pp. 1183–1192.
 [22] Yueqi Duan, Jiwen Lu, Ziwei Wang, Jianjiang Feng, and Jie Zhou, “Learning deep binary descriptor with multiquantization,” in CVPR, July 2017.
 [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[24]
Kaiming He, Fang Wen, and Jian Sun,
“Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes,”
in CVPR, 2013, pp. 2938–2945.  [25] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” IEEE Transactions on PAMI, vol. 35, no. 12, pp. 2916–2929, 2013.
 [26] Kevin Lin, HueiFang Yang, JenHao Hsiao, and ChuSong Chen, “Deep learning of binary hash codes for fast image retrieval,” in CVPR workshops, 2015, pp. 27–35.
Comments
There are no comments yet.