1 Introduction
In the era of big data, largescale visual search is vital for accessing and processing a huge amount of images and this is of great importance in many fields of computer vision. Compared to tree based approaches, hashing based methods utilize several hash functions to project image features into binary codes and are more suitable for largescale visual search due to compact representations
[1, 2]. Therefore, the hashing approaches are becoming appealing for dealing with highdimensional data.
Broadly speaking, existing hashing approaches can be classified into two categories:
dataindependent methods [3, 4, 5] and datadependent methods [2, 6, 7, 8, 9]. Dataindependent methods basically project the data into a hamming space by random hash functions whereas datadependent methods (also referred to as learning based hashing) usually learn hash functions from training datasets by optimizing an objective function. In the case of dataindependent methods, the random projections require many binary bits to achieve good performance. With the growing size of data, these methods tend to suffer from memory constraints. The learning based methods, on the other hand, can learn more discriminative hash functions with different objective functions while the hash bits are not linearly growing with the data size. Hence, learning based hashing is clearly more suitable for largescale data and is the focus of recent research.It appears that most hashing approaches use linear models to map the data into binary codes. Hence, most existing methods using learning techniques do not capture well the nonlinear relationships within images. Although several improvements have been proposed, e.g. by adding kernelization [10], it is still challenging to select an appropriate kernel function for specific data. As deep learning techniques are shown to capture well the nonlinear relationships within data, a deep architecture can effectively boost the learning of hash functions.
In this paper, we propose a novel deep hashing method to learn hierarchy and nonlinear hash functions for obtaining compact binary codes. Our proposed deep architecture includes two heterogeneous layers: autoencoder layers and an RBM (a Restricted Boltzmann Machine) layer. The autoencoder layers are used to generate the initial binary codes whereas the RBM layer is utilized to reduce the dimensionality of the binary codes. For learning the deep autoencoders and RBM, we introduce new objective functions minimizing the reconstruction error and energy function under the constraints of balanced and uncorrelated bits. Extensive experimental analysis on the problem of largescale visual search demonstrates the validity and competitiveness of our proposed approach compared to stateoftheart methods.
2 Related Work
LearningBased Hashing. According to whether the semantic information is used or not, learning based hashing can be divided into three categories: unsupervised hashing, semisupervised hashing and supervised hashing. Unsupervised hashing approaches do not use semantic information (such as tags) whereas supervised hashing approaches learn hash functions with semantic information. Semisupervised approaches model the data with labeled as well as unlabeled data. For the first category, the Spectral Hashing (SH) [6], ITerative Quantization (ITQ) [7]
and KMeans Hashing (KMH)
[8]used different objective functions with constraints of binarization loss or/and the variance of binary bits. For the second category, the SemiSupervised Hashing (SSH) by Wang
et al. [2] constructed an objective function minimizing binarization loss of labeled data and maximizing the variance of unlabeled data. The approach was later extended by employing nonlinear hash functions [9]. For the third category, Linear Discriminant Analysis (LDA) [11] and multiple linearSVMs [12] were used as hash functions and trained with large margin criterion. While most methods seek a single linear mapping, we propose a new solution based on a deep learning framework to explore the hierarchy and nonlinear hash mapping.Deep Learning.
Recently, several deep learning algorithms have been proposed in machine learning and applied to visual object detection and recognition, image classification, face verification and many other research problems
[13]. Since several foundational deep learning frameworks, such as Convolutional Neural Networks (CNN)
[14], Stacked AutoEncoders (SAE) [15]and Deep Belief Network (DBN)
[16], have been presented, numerous deep learning approaches are developed based on these frameworks. Some deep learning approaches have been applied for learning binary codes. Liong et al. [17]presented a framework minimizing a global quantization loss function with two constraints to learn binary codes. In
[18, 19, 20], the convolutional neural networks were utilized to extract visual features and a hashing layer was combined to learn binary codes through supervised learning. In this context, we propose a novel deep learning approach with a heterogeneous architecture and specific constraints for image hashing. In the architecture, we use the layerwise unsupervised learning to learn the model parameters.
3 Deep Hashing
Fig. 1
illustrates the framework of our proposed deep hashing method. The framework contains two heterogeneous layers: (1) several deep autoencoder layers; (2) an RBM layer. Given a feature vector
, the deep hashing framework can transform the input vector into a binary vector , where .3.1 SAE Layers
Let us assume that there are layers in our deep autoencoder layers, and the hash function in th layer is . represents the input vector in th layer and is the initial input . The output vector in th layer is denoted as . To learn multiplelayers autoencoder, layerbylayer training has been proposed [15] to minimize the reconstruction error. As shown in Fig. 2, the deep autoencoder (i.e. SAE) can be divided into several threelayers autoencoders for each hidden layer of SAE. represents the reconstructed vector of . The optimization problem of a conventional autoencoder is to minimize the reconstruction error for each hidden layer:
(1) 
where is the number of training samples.
Besides preserving the similarity in the projected space by minimizing the reconstruction error, the representative hash codes should be balanced and uncorrelated [6]. For a balanced ith bit, we should have . In order to be more informative for each bit, the code bits should also be uncorrelated. This is satisfied by setting . The solution to the problem (Eq. 1) with above constraints is nontrivial as the problem is NPhard. Since our goal is to obtain the most balanced and uncorrelated bits of hash codes, we propose to add above constraints as regularization terms to seek the suboptimal solution. The regularized optimization problem is defined by
(2) 
where represents the Frobenius norm. is the reconstructed vector and computed as . is the output vector of hidden layers and computed as .
To learn the model parameters for the
layers autoencoders, we employ the BackPropagation (BP) algorithm to solve the optimization problem (Eq.
2). and are updated as(3) 
where is a learning rate.
The gradients of parameters are derived as
(4) 
where ”” denotes elementwise multiplication and the sample index ”” is marked as subscripts for clarify. In Eq. 4, the local gradient
and the derivative of the activation function
are computed as(5) 
In Eq. 5, the parameters and need to be learned when the reconstruction errors are backpropagated. These parameters can be learned similarly to the parameters and .
3.2 RBM Layer
We further employ an RBM layer (Fig. 3) to reduce the dimension of the binary codes. Since the variables are binary in the RBM layer, the sign function is used to transform the output vector of SAE layers so that each input unit of the RBM layer can be valued as .
Let us assume that the visible layer and the hidden layer are denoted as and respectively whereas and are the bias and weights of visible layer and hidden layer. The energy of the RBM model is defined as
and the joint probability of
is .The optimization problem of a conventional RBM is to maximize the likelihood of training samples as follows
(6) 
In order to keep the binary codes balanced and uncorrelated, we integrate above constraints into the optimization problem (Eq. 6). Thus, the regularized problem is defined as
(7) 
where . Since the derivative of sign function is an impulse function, the problem (Eq. 7) is intractable to compute. To seek the approximate solution, we replace the sign function with a derivable function . Through computing the gradients and , the parameters are updated as
(8) 
We utilize the Contrastive Divergence (CD) algorithm
[21] to seek the numerical solution of the problem (Eq. 7).The gradients are estimated with Gibbs sampling as follows
(9) 
where represents the rstep Gibbs sampling and . The derivate of is .
The detailed deep hashing algorithm is summarized in Algorithm 1.
4 Experimental Analysis
To evaluate our proposed method, we performed extensive experiments on two datasets: CIFAR10^{1}^{1}1http://www.cs.toronto.edu/ kriz/cifar.html and MIRFLICKR25K^{2}^{2}2http://press.liacs.nl/mirflickr/. The CIFAR10 dataset consists of 60,000 color images in 10 classes with 50,000 training images and 10,000 test images. The MIRFLICKR25K dataset contains 25,000 color images in 26 classes in which 20,000 training images and 5,000 test images are randomly selected. Moreover, the cascaded 512D GIST [22] and 512D BagofFeatures (BoF) [23] are used for image representation.
For comparative analysis, the KLSH [5], SH [6], ITQ [7] and KMH [8] algorithms^{3}^{3}3The authors have shared their codes on Internet. are considered and used as baseline methods. Our approach (denoted as HetDH) uses 3 hidden layers ( architecture) for SAE and 1 hidden layer (neurons) for RBM due to the dimensions of the images. To gain insight into the impact of constraints in our proposed deep learning framework, we performed experiments with constraints (denoted as HetDH) and without constraints (denoted as HetDHWC). We report the results of all the approaches in terms of precision and recall (precisionrecall curves).
Fig. 4 shows the precisionrecall curves on the CIFAR10 and MIRFLICKR25K datasets at 32 and 64 bits for all considered methods. It can be observed that our proposed method (HetDH) outperforms all other methods in all configurations. The results also point out that all the learning based methods work better than the dataindependent method (KLSH). Compared to shallow learning models (SH, ITQ, KMH), both HetDH and HetDHWC achieve good performance due to the hierarchy representation of deep learning in our proposed approach. Comparing HetDH with HetDHWC, the obtained results show that the constraints for each layer effectively boost the conventional deep learning and improve the searching performance. This assesses the effectiveness of our proposed algorithm.
It is worth noting that the dimension of the hash codes (the number of binary bits) affects the performance of image search (see the experimental results at 32 bits vs. 64 bits). The results indicate that larger dimensions improve the precision and recall but at the cost of more memory storage. Finally, the experiments also show that all the hashing methods seem to work better on the CIFAR10 dataset compared to MIRFLICKR25K dataset. This is perhaps due to the diverse nature of the images in MIRFLICKR25K dataset compared to the images in the CIFAR10 dataset.
5 Conclusion
We proposed a heterogeneous deep learning architecture for learning hash functions. With two constraints for balanced and uncorrelated binary codes, we learned the parameters of SAE and RBM layers. Experimental results and extensive comparative analysis on the problem of largescale image search assessed the effectiveness of our proposed approach which outperformed stateoftheart unsupervised methods.
References
 [1] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010.
 [2] J. Wang, S. Kumar, and S.F. Chang, “Semisupervised hashing for largescale search.,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. 12, pp. 2393–2406, 2012.
 [3] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases, 1999, pp. 518–529.
 [4] M. Slaney and M. Casey, “Localitysensitive hashing for finding nearest neighbors,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 128–131, 2008.
 [5] B. Kulis and K. Grauman, “Kernelized localitysensitive hashing for scalable image search,” in IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2130 – 2137.
 [6] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing.,” Advances in Neural Information Processing Systems, vol. 282, no. 3, pp. 1753–1760, 2008.
 [7] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 817 – 824.
 [8] K. He, F. Wen, and J. Sun, “Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2938–2945.
 [9] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semisupervised nonlinear hashing using bootstrap sequential projection learning,” IEEE Transactions on Knowledge & Data Engineering, vol. 25, no. 6, pp. 1380–1393, 2013.
 [10] S.F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
 [11] P. Fua, M. Bronstein, and C. Bronstein, A.and Strecha, “Ldahash: Improved matching with smaller descriptors,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. 1, pp. 66–78, 2011.
 [12] M. Rastegari, A. Farhadi, and D. Forsyth, “Attribute discovery via predictable discriminative binary codes,” Lecture Notes in Computer Science on ECCV, vol. 7577, no. 1, pp. 876–889, 2012.
 [13] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–44, 2015.

[14]
A. Krizhevsky, I. Sutskever, and G. Hinton,
“Imagenet classification with deep convolutional neural networks,”
in Advances in Neural Information Processing Systems, 2012, pp. 1090–1098. 
[15]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol,
“Extracting and composing robust features with denoising autoencoders,”
in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.  [16] R. Salakhutdinov and G. Hinton, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [17] V.E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2475–2483.

[18]
R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan,
“Supervised hashing for image retrieval via image representation learning,”
inTwentyEighth AAAI Conference on Artificial Intelligence
, 2014, pp. 2156–2162.  [19] K. Lin, H.F. Yang, J.H. Hsiao, and C.S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, pp. 27–35.
 [20] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
 [21] A. Fischer and C. Igel, “Training restricted boltzmann machines: An introduction,” Pattern Recognition, vol. 47, no. 1, pp. 25–39, 2014.
 [22] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.
 [23] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591–606, 2009.
Comments
There are no comments yet.