I Introduction
We are interested in learning binary hash codes for visual search problems. Given an input image treated as a query, the visual search systems search for visually similar images in a database. In the stateoftheart image retrieval systems
[1, 2, 3, 4], images are represented as highdimensional feature vectors which later can be searched via classical distance such as the Euclidean or Cosin distance. However, when the database is scaled up, there are two main requirements for retrieval systems, i.e., efficient storage and fast search. Among solutions, binary hashing is an attractive approach for achieving those requirements
[5, 6, 7, 8] due to its fast computation and efficient storage.Briefly, in binary hashing problem, each original high dimensional vector is mapped into a very compact binary vector , where .
Many hashing methods have been proposed in the literature. They can be divided into two categories: dataindependence methods and datadependence methods. The former ones [9, 10, 11, 12] rely on random projections to construct hash functions. The representative methods in this category are Locality Sensitive Hashing (LSH) [9] and its kernelized or discriminative extensions [10, 11]. The latter ones use the available training data to learn the hash functions in unsupervised [13, 14, 15, 16, 17] or (semi)supervised [18, 19, 20, 21, 22, 23, 24] manners. The prepresentative unsupervised hashing methods, e.g. Spectral Hashing [13], Iterative Quantization (ITQ) [14]
, Kmeans Hashing
[15], Spherical Hashing [16], try to learn binary codes which preserve the distance similarity between samples. The prepresentative supervised hashing methods, e.g., ITQCCA [14], Binary Reconstructive Embedding [23], Kernel Supervised Hashing [19], Twostep Hashing [22], Supervised Discrete Hashing [24], try to learn binary codes which preserve the label similarity between samples. The detailed review of dataindependent and datadependent hashing methods can be found in the recent surveys [5, 6, 7, 8].One difficult problem in hashing is to deal with the binary constraint on the codes. Specifically, the outputs of the hash functions have to be binary. In general, this binary constraint leads to an NPhard mixedinteger optimization problem. To handle this difficulty, most aforementioned methods relax the constraint during the learning of hash functions. With this relaxation, the continuous codes are learned first. Then, the codes are binarized (e.g., by thresholding). This relaxation greatly simplifies the original binary constrained problem. However, the solution can be suboptimal, i.e., the binary codes resulting from thresholded continuous codes could be inferior to those that are obtained by directly including the binary constraint in the learning.
Furthermore, a good hashing method should produce binary codes with the following properties [13]: (i) similarity preserving, i.e., (dis)similar inputs should likely have (dis)similar binary codes; (ii) independence, i.e., different bits in the binary codes are independent to each other so that no redundant information is captured; (iii) bit balance, i.e., each bit has a chance of being or . Note that direct incorporation of the independent and balance properties can complicate the learning. Previous works have used some relaxation or approximation to overcome the difficulties [14, 20], but there may be some performance degradation.
Recently, the deep learning has given great attention to the computer vision community due to its superiority in many vision tasks such as classification, detection, segmentation
[25, 26, 27]. Inspired by the success of deep learning in different vision tasks, recently, some researchers have used deep learning for joint learning image representations and binary hash codes in an endtoend deep learningbased supervised hashing framework [28, 29, 30, 31]. However, learning binary codes in deep networks is challenging. This is because one has to deal with the binary constraint on the hash codes, i.e., the final network outputs must be binary. A naive solution is to adopt the activation layer to produce binary codes. However, due to the nonsmoothness of the function, it causes the vanishing gradient problem when training the network with the standard back propagation [32].Contributions: In this work, firstly, we propose a novel deep network model and a learning algorithm for unsupervised hashing. In order to achieve binary codes, instead of involving the or step function as in recent works [33, 34], our proposed network design constrains one layer to directly output the binary codes (therefore, the network is named as Binary Deep Neural Network). In addition, we propose to directly incorporate the independence and balance properties. Furthermore, we include the similarity preserving in our objective function. The resulting optimization with these binary and direct constraints is NPhard. We propose to attack this challenging problem with alternating optimization and careful relaxation. Secondly, in order to enhance the discriminative power of the binary codes, we extend our method to supervised hashing by leveraging the label information so that the binary codes preserve the semantic similarity between samples. Finally, to demonstrate the flexibility of our proposed method and to leverage the powerful capacity of convolution deep neural networks, we adapt our optimization strategy and the proposed supervised hashing model to an endtoend deep hashing network framework. The solid experiments on various benchmark datasets show the improvements of the proposed methods over stateoftheart hashing methods.
A preliminary version of this work has been reported in [35]. In this work, we present substantial extension to our previous work. In particular, the main extension is that we propose the endtoend binary deep neural network framework which jointly learns the image features and the binary codes. The experimental results show that the proposed endtoend hashing framework significantly boosts the retrieval accuracy. Other minor extensions are we conduct more experiments (e.g., new experiments on SUN397 dataset [36], comparison to stateoftheart endtoend hashing methods) to evaluate the effectiveness of the proposed methods.
The remaining of this paper is organized as follows. Section II presents related works. Section III presents and evaluates the proposed unsupervised hashing method. Section IV presents and evaluates the proposed supervised hashing method. Section V presents and evaluates the proposed endtoend deep hashing network. Section VI concludes the paper.
Ii Related work
Our work is inspired by the recent successful hashing methods which define hash functions as a neural network [37, 33, 38, 34, 28, 29, 30]. We propose an improved design to address their limitations. In Semantic Hashing [37]
, the model is formed by a stack of Restricted Boltzmann Machine, and a pretraining step is required. Additionally, this model does not consider the independence and balance of the codes. In Binary Autoencoder
[34], a linear autoencoder is used as hash functions. As this model only uses one hidden layer, it may not well capture the information of inputs. Extending [34] with multiple, nonlinear layers is not straightforward because of the binary constraint. They also do not consider the independence and balance of codes. In Deep Hashing [33, 38], a deep neural network is used as a hash function. However, this model does not fully take into account the similarity preserving. They also apply some relaxation in arriving the independence and balance of codes and this may degrade the performance. In the endtoend deep hashing [29, 30], the independence and balance of codes are not considered.In order to handle the binary constraint, Semantic Hashing [37] first solves the relaxed problem by discarding the constraint and then thresholds the solved continuous solution. In Deep Hashing (DH) [33, 38], the output of the last layer, , is binarized by the function. They include a term in the objective function to reduce this binarization loss: . Solving the objective function of DH [33, 38] is difficult because the function is nondifferentiable. The authors in [33, 38] overcome this difficulty by assuming that the function is differentiable everywhere. In Binary Autoencoder (BA) [34], the outputs of the hidden layer are passed into a step function to binarize the codes. Incorporating the step function in the learning leads to a nonsmooth objective function, hence the optimization is NPcomplete. To handle this difficulty, the authors [34] use binary SVMs to learn the model parameters in the case when there is only a single hidden layer.
Joint learning image representations and binary hash codes in an endtoend deep learningbased supervised hashing framework [28, 29, 30]
has shown considerable boost in retrieval accuracy. By joint optimization, the produced hash codes are more sufficient to preserve the semantic similarity between images. In those works, the network architectures often consist of a feature extraction subnetwork and a subsequent hashing layer to produce hash codes. Ideally, the hashing layer should adopt a
activation function to output exactly binary codes. However, due to the vanishing gradient difficulty of the function, an approximation procedure must be employed. For example, can be approximated by a tanhlike function , where is a free parameter controlling the trade off between the smoothness and the binary quantization loss [29]. However, it is nontrivial to determine an optimal . A small causes large binary quantization loss while a big makes the output of the function close to the binary values, but the gradient of the function almost vanishes, making backpropagation infeasible. The problem still remains when the logisticlike functions [28, 30] are used.Iii Unsupervised Hashing with Binary Deep Neural Network (UHBDNN)
Iiia Formulation of UHBDNN
Notation  Meaning 

: set of training samples;  
each column of corresponds to one sample  
: binary code of  
Number of required bits to encode a sample  
Number of layers (including input and output layers)  
Number of units in layer  
Activation function of layer  
: weight matrix connecting layer  
and layer  
:bias vector for units in layer 

:  
output values of layer ; convention:  
Matrix has rows, columns and all elements equal to 1 
For easy following, we first summarize the notations in Table I
. In our work, the hash functions are defined by a deep neural network. In our proposed architecture, we use different activation functions in different layers. Specifically, we use the sigmoid function as the activation function for layers
, and the identity function as the activation function for layer and layer . Our idea is to learn the network such that the output values of the penultimate layer (layer ) can be used as the binary codes. We introduce constraints in the learning algorithm such that the output values at the layer have the following desirable properties: (i) belonging to ; (ii) similarity preserving; (iii) independence and (iv) balancing. Fig. 1 illustrates our network for the case .Let us start with first two properties of the codes, i.e., belonging to and similarity preserving. To achieve the binary codes having these two properties, we propose to optimize the following constrained objective function
(1)  
(2) 
The constraint (2) is to ensure the first property. As the activation function for the last layer is the identity function, the term is the output of the last layer. The first term of (1) makes sure that the binary code gives a good reconstruction of . It is worth noting that the reconstruction criterion has been used as an indirect approach for preserving the similarity in stateoftheart unsupervised hashing methods [14, 34, 37], i.e., it encourages (dis)similar inputs map to (dis)similar binary codes. The second term is a regularization that tends to decrease the magnitude of the weights, and this helps to prevent overfitting. Note that in our proposed design, we constrain the network to directly output the binary codes at one layer, which avoids the difficulty of the / function which is nondifferentiability. On the other hand, our formulation with (1) under the binary constraint (2) is very difficult to solve. It is a mixedinteger problem which is NPhard. In order to attack the problem, we propose to introduce an auxiliary variable and use alternating optimization. Consequently, we reformulate the objective function (1) under constraint (2) as the following
(3)  
(4) 
(5) 
The benefit of introducing the auxiliary variable is that we can decompose the difficult constrained optimization problem (1) into two suboptimization problems. Consequently, we are able to iteratively solve the optimization problem by using alternating optimization with respect to and while holding the other fixed. Inspired from the quadratic penalty method [39], we relax the equality constraint (4) by converting it into a penalty term. We achieve the following constrained objective function
(6)  
(7) 
in which, the third term in (6) measures the (equality) constraint violation. By setting the penalty parameter sufficiently large, we penalize the constraint violation severely, thereby forcing the minimizer of the penalty function (6) closer to the feasible region of the original constrained function (3).
We now consider the two remaining properties of the codes, i.e., independence and balance. Unlike previous works which use some relaxation or approximation on the independence and balance properties [14, 33, 20], we propose to encode these properties strictly and directly based on the binary outputs of our layer . Specifically, we encode the independence and balance properties of the codes by introducing the fourth and the fifth terms respectively in the following constrained objective function
(8)  
(9) 
The objective function (8) under constraint (9) is our final formulation. Before discussing how to solve it, let us present the differences between our work and the recent deep learning based hashing models Deep Hashing [33], Binary Autoencoder [34], and endtoend hashing methods [28, 29, 30].
The main important difference between our model and other deep learningbased hashing methods Deep Hashing [33], Binary Autoencoder [34], endtoend hashing [28, 29, 30] is the way to achieve the binary codes. Instead of involving the , step function as in [33, 34] or the relaxation of as [28, 30], [29], we constrain the network to directly output the binary codes at one layer. Other differences to the most related works Deep Hashing [33] and Binary Autoencoder [34] are presented as follows.
Comparison to Deep Hashing (DH) [33, 38]
the deep model of DH is learned by the following objective function
The DH’s model does not have the reconstruction layer. The authors apply the function to the outputs at the top layer of the network to obtain the binary codes. The first term aims to minimize quantization loss by applying the function to the outputs at the top layer. The balancing and the independent properties are presented in the second and the third terms. It is worth noting that minimizing DH’s objective function is difficult due to the nondifferentiability of the sgn function. The authors work around this difficulty by assuming that the sgn function is differentiable everywhere.
Contrary to DH, we propose a different model design. In particular, our model encourages the similarity preserving by obtaining the reconstruction layer in the network. For the balancing property, they maximize . According to [20], maximizing this term is only an approximation in arriving the balancing property. In our objective function, the balancing property is directly enforced on the codes by the term . For the independent property, DH uses a relaxed orthogonality constraint on the network weights , i.e., . On the contrary, we (once again) directly constrain the code independence using . Incorporating the direct constraints can lead to better performance.
Comparison to Binary Autoencoder (BA) [34]
the differences between our model and BA are quite clear. Firstly, BA, as described in [34]
, is a shallow linear autoencoder network with one hidden layer. Secondly, the BA’s hash function is a linear transformation of the input followed by the step function to obtain the binary codes. In BA, by treating the encoder layer as binary classifiers, they use binary SVMs to learn the weights of the linear transformation. On the contrary, our hash function is defined by multiple, hierarchical layers of nonlinear and linear transformations. It is not clear if the binary SVMs approach in BA can be used to learn the weights in our deep architecture with multiple layers. Instead, we use alternating optimization to derive a backpropagation algorithm to learn the weights in all layers. Additionally, another difference is that our model ensures the independence and balance of the binary codes while BA does not. Note that independence and balance properties may not be easily incorporated in their framework, as these would complicate their objective function and the optimization problem may become very difficult to solve.
IiiB Optimization
IiiB1 step
When fixing , the problem becomes the unconstrained optimization. We use LBFGS [40] optimizer with backpropagation for solving. The gradients of the objective function (8) w.r.t. different parameters are computed as follows.
At , we have
(10) 
(11) 
For other layers, let us define
(12)  
(13) 
where , ; denotes the Hadamard product.
Then, , we have
(14) 
(15) 
IiiB2 step
When fixing , we can rewrite problem (8) as
(16)  
(17) 
We adaptively use the recent method discrete cyclic coordinate descent [24] to iteratively solve , i.e., row by row. The advantage of this method is that if we fix rows of and only solve for the remaining row, we can achieve a closedform solution for that row.
Let ; . For , let be column of ; be matrix excluding ; be column of ; be row of ; be matrix of excluding . We have closedform for as
(18) 
The proposed UHBDNN method is summarized in Algorithm 1. In the Algorithm 1, and are values of and at iteration , respectively.
CIFAR10  MNIST  SIFT1M  

8  16  24  32  8  16  24  32  8  16  24  32  
UHBDNN  0.55  5.79  22.14  18.35  0.53  6.80  29.38  38.50  4.80  25.20  62.20  80.55 
BA[34]  0.55  5.65  20.23  17.00  0.51  6.44  27.65  35.29  3.85  23.19  61.35  77.15 
ITQ[14]  0.54  5.05  18.82  17.76  0.51  5.87  23.92  36.35  3.19  14.07  35.80  58.69 
SH[13]  0.39  4.23  14.60  15.22  0.43  6.50  27.08  36.69  4.67  24.82  60.25  72.40 
SPH[16]  0.43  3.45  13.47  13.67  0.44  5.02  22.24  30.80  4.25  20.98  47.09  66.42 
KMH[15]  0.53  5.49  19.55  15.90  0.50  6.36  25.68  36.24  3.74  20.74  48.86  76.04 
IiiC Evaluation of Unsupervised Hashing with Binary Deep Neural Network (UHBDNN)
CIFAR10  MNIST  
mAP  precision  mAP  precision  
L  16  32  16  32  16  32  16  32 
DH [33]  16.17  16.62  23.33  15.77  43.14  44.97  66.10  73.29 
UHBDNN  17.83  18.52  24.97  18.85  45.38  47.21  69.13  75.26 
This section evaluates the proposed UHBDNN and compares it with the following stateoftheart unsupervised hashing methods: Spectral Hashing (SH) [13], Iterative Quantization (ITQ) [14], Binary Autoencoder (BA) [34], Spherical Hashing (SPH) [16], Kmeans Hashing (KMH) [15]. For all compared methods, we use the implementations and the suggested parameters provided by the authors.
IiiC1 Dataset, evaluation protocol, and implementation notes
Dataset
CIFAR10 [41] dataset consists of 60,000 images of 10 classes. The training set (also used as the database for the retrieval) contains 50,000 images. The query set contains 10,000 images. Each image is represented by a 800dimensional feature vector extracted by PCA from 4096dimensional CNN feature produced by the AlexNet [25].
MNIST [42] dataset consists of 70,000 handwritten digit images of 10 classes. The training set (also used as database for retrieval) contains 60,000 images. The query set contains 10,000 images. Each image is represented by a 784 dimensional grayscale feature vector by using its intensity.
Evaluation protocol
We follow the standard setting in unsupervised hashing methods [14, 16, 15, 34] which use Euclidean nearest neighbors as the ground truths for queries. The number of ground truths are set as in [34], i.e., for CIFAR10 and MNIST datasets, for each query, we use its Euclidean nearest neighbors as the ground truths; for large scale dataset SIFT1M, for each query, we use
its Euclidean nearest neighbors as the ground truths. We use the following evaluation metrics which have been used in the state of the art
[14, 34, 33] to measure the performance of methods. 1) mean Average Precision (mAP); 2) precision of Hamming radius (precision) which measures precision on retrieved images having Hamming distance to query (if no images satisfy, we report zero precision). Note that as computing mAP is slow on large dataset SIFT1M, we consider top returned neighbors when computing mAP.Implementation notes
In our deep model, we use layers. The parameters , , and are empirically set by cross validation as , , and , respectively. The max iteration number is empirically set to 10. The number of units in hidden layers are empirically set as , , and for the code length 8, 16, 24 and 32 bits, respectively.
IiiC2 Retrieval results
Fig. 2 and Table II show comparative mAP and precision of Hamming radius (precision) of methods, respectively.
We find the following observations are consistent for all three datasets. In term of mAP, the proposed UHBDNN is comparable to or outperforms other methods at all code lengths. The improvement is more clear at high code length, i.e., . The best competitor to UHBDNN is binary autoencoder (BA) [34] which is the current stateoftheart unsupervised hashing method. Compare to BA, at high code length, UHBDNN consistently outperforms BA on all datasets. In term of precision, UHBDNN is comparable to other methods at low code lengths, i.e., . At , UHBDNN significantly achieve better performance than other methods. Specifically, the improvements of UHBDNN over the best competitor BA [34] are clearly observed at on the MNIST and SIFT1M datasets.
Comparison with Deep Hashing (DH) [33, 38] As the implementation of DH is not available, we set up the experiments on CIFAR10 and MNIST similar to [33] to make a fair comparison. For each dataset, we randomly sample 1,000 images (i.e., 100 images per class) as query set; the remaining images are used as training and database set. Follow [33], for CIFAR10, each image is represented by 512 GIST descriptor [45].
Follow Deep Hashing [33], the ground truths of queries are based on their class labels^{1}^{1}1It is worth noting that in the evaluation of unsupervised hashing, instead of using class label as ground truths, most stateoftheart methods [14, 16, 15, 34] use Euclidean nearest neighbors as ground truths for queries.. Similar to [33], we report comparative results in term of mAP and the precision of Hamming radius . The results are presented in the Table III. It is clearly showed that the proposed UHBDNN outperforms DH [33] at all code lengths, in both mAP and precision of Hamming radius.
Iv Supervised Hashing with Binary Deep Neural Network (SHBDNN)
In order to enhance the discriminative power of the binary codes, we extend UHBDNN to supervised hashing by leveraging the label information. There are several approaches proposed to leverage the label information, leading to different criteria on binary codes. In [18, 46], binary codes are learned such that they minimize Hamming distances between samples belonging to the same class, while maximizing the Hamming distance between samples belonging to different classes. In [24], the binary codes are learned such that they are optimal for linear classification.
In this work, in order to exploit the label information, we follow the approach proposed in Kernelbased Supervised Hashing (KSH) [19]. The benefit of this approach is that it directly encourages the Hamming distances between binary codes of withinclass samples equal to , and the Hamming distances between binary codes of betweenclass samples equal to . In the other words, it tries to perfectly preserve the semantic similarity. To achieve this goal, it enforces the Hamming distances between learned binary codes to be highly correlated with the precomputed pairwise label matrix.
In general, the network structure of SHBDNN is similar to UHBDNN, excepting that the last layer preserving the reconstruction of UHBDNN is removed. The layer in UHBDNN becomes the last layer in SHBDNN. All desirable properties, i.e., semantic similarity preserving, independence, and balance, in SHBDNN are constrained on the outputs of its last layer.
Iva Formulation of SHBDNN
We define the pairwise label matrix as
(19) 
In order to achieve the semantic similarity preserving property, we learn the binary codes such that the Hamming distance between learned binary codes highly correlates with the matrix , i.e., we want to minimize the quantity . In addition, to achieve the independence and balance properties of codes, we want to minimize the quantities and , respectively.
Follow the same reformulation and relaxation as UHBDNN (Sec. IIIA), we solve the following constrained optimization which ensures the binary constraint, the semantic similarity preserving, the independence, and the balance properties of codes.
(20)  
(21) 
(20) under constraint (21) is our formulation for supervised hashing. The main difference in formulation between UHBDNN (8) and SHBDNN (20) is that the reconstruction term preserving the neighbor similarity in UHBDNN (8) is replaced by the term preserving the label similarity in SHBDNN (20).
IvB Optimization
In order to solve (20) under constraint (21), we use alternating optimization, which comprises two steps over and .
IvB1 step
When fixing , (20) becomes unconstrained optimization. We used LBFGS [40] optimizer with backpropagation for solving. The gradients of objective function (20) w.r.t. different parameters are computed as follows.
Let us define
(22)  
where .
(23) 
where , ; denotes the Hadamard product.
Then , we have
(24) 
(25) 
IvB2 step
When fixing , we can rewrite problem (20) as
(26) 
(27) 
The proposed SHBDNN method is summarized in Algorithm 2. In Algorithm 2, and are values of and at iteration , respectively.
CIFAR10  MNIST  SUN397  

8  16  24  32  8  16  24  32  8  16  24  32  
SHBDNN  54.12  67.32  69.36  69.62  84.26  94.67  94.69  95.51  15.52  41.98  52.53  56.82 
SDH[24]  31.60  62.23  67.65  67.63  36.49  93.00  93.98  94.43  13.89  40.39  49.54  53.25 
ITQCCA[14]  49.14  65.68  67.47  67.19  54.35  79.99  84.12  84.57  13.22  37.53  50.07  53.12 
KSH[19]  44.81  64.08  67.01  65.76  68.07  90.79  92.86  92.41  12.64  40.67  49.29  46.45 
BRE[23]  23.84  41.11  47.98  44.89  37.67  69.80  83.24  84.61  9.26  26.95  38.36  40.36 
IvC Evaluation of Supervised Hashing with Binary Deep Neural Network
This section evaluates our proposed SHBDNN and compares it to the following stateoftheart supervised hashing methods including Supervised Discrete Hashing (SDH) [24], ITQCCA [14], Kernelbased Supervised Hashing (KSH) [19], Binary Reconstructive Embedding (BRE) [23]. For all compared methods, we use the implementations and the suggested parameters provided by the authors.
IvC1 Dataset, evaluation protocol, and implementation notes
Dataset
We evaluate and compare methods on CIFAR10 and MNIST, and SUN397 datasets. The descriptions of the first two datasets are presented in section IIIC1.
SUN397 [36] dataset contains about 108K images from 397 scene categories. We use a subset of this dataset including 42 largest categories in which each category contains more than 500 images. This results about 35K images in total. The query set contains 4,200 images (100 images per class) randomly sampled from the dataset. The rest images are used as the training set and also the database set. Each image is represented by a 800dimensional feature vector extracted by PCA from 4096dimensional CNN feature produced by the AlexNet [25].
Evaluation protocol
Implementation notes
The network configuration is same as UHBDNN except the final layer is removed. The values of parameters , , , and are empirically set using cross validation as , , , and , respectively. The max iteration number is empirically set to 5.
Follow the settings in ITQCCA [14], SDH [24], all training samples are used in the learning for these two methods. For SHBDNN, KSH [19] and BRE [23] where the label information is leveraged by the pairwise label matrix, we randomly select training samples from each class and use them for learning. The ground truths of queries are defined by the class labels from the datasets.
IvC2 Retrieval results
Fig. 3 and Table IV show comparative results between the proposed SHBDNN and other supervised hashing methods on CIFAR10, MNIST, and SUN397 datasets.
On the CIFAR10 dataset, Fig. 3(a) and Table IV clearly show that the proposed SHBDNN outperforms all compared methods by a fair margin at all code lengths in both mAP and precision@2. The best competitor to SHBDNN on this dataset is CCAITQ [14]. The more improvements of SHBDNN over CCAITQ are observed at high code lengths, i.e., SHBDNN outperforms CCAITQ about 4% at and .
On the MNIST dataset, Fig. 3(b) and Table IV show that the proposed SHBDNN significant outperforms the current stateoftheart SDH [24] at low code length, i.e., . When increases, SHBDNN and SDH [24] achieve similar performance. SHBDNN significantly outperforms other methods, i.e., KSH [19], ITQCCA [14], BRE [23], on both mAP and precision@2.
On the SUN397 dataset, the proposed SHBDNN outperforms other competitors at all code lengths in terms of both mAP and precision@2. The best competitor to SHBDNN on this dataset is SDH [24]. At high code lengths (e.g., ), SHBDNN achieves more improvements over SDH.
V Supervised Hashing with EndtoEnd Binary Deep Neural Network (E2EBDNN)
Even though the proposed SHBDNN can significantly enhance the discriminative power of the binary codes, similar to other hashing methods, its capability is partially depended on the discriminative power of image features. The recent endtoend deep learningbased supervised hashing methods [28, 29, 30] have shown that joint learning image representations and binary hash codes in an endtoend fashion can boost the retrieval accuracy. Therefore, in this section, we propose to extend the proposed SHBDNN to an endtoend framework. Specifically, we integrate the convolutional neural network (CNN) with our supervised hashing network (SHBDNN) into a unified endtoend deep architecture, namely EndtoEnd Binary Deep Neural Network (E2EBDNN), which can jointly learn visual features and binary representations of images. In the following, we first introduce our proposed network architecture. We then describe the training process. Finally, we present experiments on various benchmark datasets.
Va Network architecture
Fig. 4 illustrates the overall architecture of the endtoend binary deep neural network – E2EBDNN. In details, the network consists of three main components: (i) a feature extractor, (ii) a dimensional reduction layer, and (iii) a binary optimizer component. We utilize AlexNet [25] as the feature extractor component of the E2EBDNN. In our configuration, we remove the last layer of AlexNet, namely the softmax layer, and consider its last fully connected layer (fc7) as the image representation.
The dimensional reduction component (the DR layer) involves a fully connected layer for reducing the high dimensional image representations outputted by the feature extractor component into lower dimensional representations. We use the identity function as the activation function for this DR layer. The reduced representations are then used as inputs for the following binary optimizer component.
The binary optimizer component of E2EBDNN is similar to SHBDNN. It means that we also constrain the output codes of E2EBDNN to be binary. These codes also have desired properties such as semantic similarity preserving, independence, and balance. By using the same design as SHBDNN for the last component of E2EBDNN, it allows us to observe the advantages of the endtoend architecture over SHBDNN.
The training data for the E2EBDNN is labelled images, which contrasts to SHBDNN which uses visual features such as GIST [45], SIFT [44]
or deep features from convolutional deep networks. Given the input labeled images, we aim to learn binary codes with aforementioned desired properties, i.e., semantic similarity preserving, independence, and balance. In order to achieve these properties on the codes, we use the similar objective function as SHBDNN. However, it is important to mention that in SHBDNN, by its non endtoend architecture, we can feed the whole training set to the network at a time for training. This does not hold for E2EBDNN. Due to the memory consuming of the endtoend architecture, we can only feed a minibatch of images to the network at a time for training. Technically, let
be the output of the last fully connected layer of E2EBDNN for a minibatch of size ; be similarity matrix defined over the minibatch (using equation (19));is an auxiliary variable. Similar to SHBDNN, we train the network to minimize the following constrained loss function
(28)  
(29) 
VB Training
The training procedure of E2EBDNN is presented in the Algorithm 3. In the Algorithm 3, and are a minibatch sampled from the training set at iteration and its corresponding binary codes, respectively. is the binary codes of the whole training set at iteration ; is the network weight when learning up to iterations and .
At beginning (line 1 in the Algorithm 3), we initialize the network parameter as follows. For the feature extractor component, we initialize it by the pretrained AlexNet model [25]. The dimensional reduction (DR) layer is initialized by applying PCA on the AlexNet features, i.e. the outputs of fc7 layer, of the training set. Specifically, let and be the weights and bias of the DR layer, respectively, we initialize them as follows,
where
are the top eigenvectors corresponding to the largest eigenvalues extracted from the covariance matrix of training features;
is the mean of the training features.The binary optimizer component is initialized by training SHBDNN using AlexNet features (of training set) as inputs. We then initialize the binary code matrix of the whole dataset via ITQ [14] (line 2 in the Algorithm 3). Here, AlexNet features are used as training inputs for ITQ.
In each iteration of the Algorithm 3, we only sample a minibatch from the training set to feed to the network (line 5 in the Algorithm 3). So after iterations, we exhaustively sample all training data. In each iteration , we first create the similarity matrix (using equation (19)) corresponding to as well as the matrix (line 6 and line 7 in the Algorithm 3). Since has been already computed, we can fix that variable and update the network parameter
by standard backpropagation with Stochastic Gradient Descent (SGD) (line 8 in the Algorithm
3). After iterations, since the network was exhaustively learned from the whole training set, we update (line 10 in the Algorithm 3), where is the outputs of the last fully connected layer for all training samples. We then repeat the optimization procedure until it reaches a criterion, i.e., after iterations.Implementation details The proposed E2EBDNN is implemented in MATLAB with MatConvNet library [47]. All experiments are conducted on a workstation machine with a GPU Titan X. Regarding the hyperparameters of the loss function, we empirically set , , and . The learning rate is set to and the weight decay is set to . The minibatch size is .
CIFAR10  MNIST  SUN397  

8  16  24  32  8  16  24  32  8  16  24  32  
SHBDNN  57.15  66.04  68.81  69.66  84.65  94.24  94.80  95.25  33.06  47.13  57.02  61.89 
E2EBDNN  64.83  71.02  72.37  73.56  88.82  98.03  98.26  98.21  34.15  48.21  59.51  64.58 
VC Evaluation of EndtoEnd Binary Deep Neural Network (E2EBDNN)
Since we have already compare SHDBNN to other supervised hashing methods in Section IVC, in this experiment we focus to compare E2EBDNN with SHBDNN. We also compare the proposed E2EBDNN to other endtoend hashing methods [29, 28, 30, 31].
Comparison between SHBDNN and E2EBDNN
Table V presents comparative mAP between SHBDNN and E2EBDNN. The results shows that E2EBDNN consistently improves over SHBDNN at all code lengths on all datasets. The large improvements of E2EBDNN over SHBDNN are observed on the CIFAR10 and MNIST datasets, especially at low code lengths, i.e., on CIFAR10, E2EBDNN outperforms SHBDNN 7.7% and 5% at and , respectively; on MNIST, E2EBDNN outperforms SHBDNN 4.2% and 3.8% at and , respectively. On SUN397 dataset, the improvements of E2EBDNN over SHBDNN are more clear at high code length, i.e., E2EBDNN outperforms SHBDNN 2.5% and 2.7% at and , respectively. The improvements of E2EBDNN over SHBDNN confirm the effectiveness of the proposed endtoend architecture for learning discriminative binary codes.
Comparison between E2EBDNN and other endtoend supervised hashing methods
We also compare our proposed deep networks SHBDNN and E2EBDNN with other endtoend supervised hashing architectures, i.e., Hashing with Deep Neural Network (DNNH) [30], Deep Hashing Network (DHN) [48], Deep Quantization Network (DQN) [31], Deep Semantic Ranking Hashing (DSRH) [28], and Deep Regularized Similarity Comparison Hashing (DRSCH) [29]. In those works, the authors propose the frameworks in which the image features and hash codes are simultaneously learned by combining CNN layers and a binary quantization layer into a large model. However, their binary mapping layer only applies a simple operation, e.g., an approximation of function ( [28, 30], [29]), norm approximation of binary constraints [48]. Our SHBDNN and E2EBDNN advances over those works in the way to map the image features to the binary codes. Furthermore, our learned codes ensure good properties, i.e. independence and balance, while [30, 29, 31] do not consider such properties, and [28] only considers the balance of codes. It is worth noting that, in [30, 28, 29, 48], different settings are used when evaluating. For fair comparison, follow those works, we setup two different experimental settings as follows
Table VI shows the comparative mAP between our methods and DNNH [30], DHN [48], DQN [31] on the CIFAR10 dataset with the setting 1. Interestingly, even with non endtoend approach, our SHBDNN outperforms DNNH and DQN at all code lengths. This confirms the effectiveness of the proposed approach for dealing with binary constrains and the having of desire properties such as independence and balance on the produced codes. With the endtoend approach, it helps to boost the performance of the SHBDNN. The proposed E2EBDNN outperforms all compared methods, DNNH [30], DHN [48], DQN [31]. It is worth noting that in Lai et al. [30], increasing the code length does not necessary boost the retrieval accuracy, i.e., they report a mAP 55.80 at , while a higher mAP, i.e., 56.60 is reported at . Different from [30], both SHBDNN and E2EBDNN improve mAP when increasing the code length.
Table VII presents the comparative mAP between the proposed SHBDNN, E2EBDNN and the competitors DSRH [28], DRSCH [29] on the CIFAR10 dataset with the setting 2. The results clearly show that the proposed E2EBDNN significantly outperforms the DSRH [28] and DRSCH [29] at all code lengths. Compare to the best competitor DRSCH [29], the improvements of E2EBDNN over DRSCH are from 5% to 6% at different code lengths. Furthermore, we can see that even with non endtoend approach, the proposed SHBDNN also outperforms DSRH [28] and DRSCH [29].
Vi Conclusion
In this paper, we propose three deep hashing neural networks for learning compact binary presentations. Firstly, we introduce UHBDNN and SHBDNN for unsupervised and supervised hashing respectively. Our network designs constrain to directly produce binary codes at one layer. The designs also ensure good properties for codes: similarity preserving, independence and balance. Together with designs, we also propose alternating optimization schemes which allow to effectively deal with binary constraints on the codes. We then propose to extend the SHBDNN to an endtoend binary deep neural network (E2EBDNN) framework which jointly learns the image representations and the binary codes. The solid experimental results on various benchmark datasets show that the proposed methods compare favorably or outperform stateoftheart hashing methods.
References
 [1] R. Arandjelovic and A. Zisserman, “All about VLAD,” in CVPR, 2013.
 [2] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in CVPR, 2010.
 [3] T.T. Do and N.M. Cheung, “Embedding based on function approximation for large scale image search,” TPAMI, 2017.
 [4] H. Jégou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in CVPR, 2014.
 [5] K. Grauman and R. Fergus, “Learning binary hash codes for largescale image search,” Machine Learning for Computer Vision, 2013.
 [6] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” CoRR, 2014.
 [7] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” TPAMI, 2017.
 [8] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash for indexing big data  A survey,” Proceedings of the IEEE, pp. 34–57, 2016.
 [9] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB, 1999.
 [10] B. Kulis and K. Grauman, “Kernelized localitysensitive hashing for scalable image search,” in ICCV, 2009.
 [11] M. Raginsky and S. Lazebnik, “Localitysensitive binary codes from shiftinvariant kernels,” in NIPS, 2009.
 [12] B. Kulis, P. Jain, and K. Grauman, “Fast similarity search for learned metrics,” TPAMI, pp. 2143–2157, 2009.
 [13] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, 2008.
 [14] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” TPAMI, pp. 2916–2929, 2013.
 [15] K. He, F. Wen, and J. Sun, “Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes,” in CVPR, 2013.
 [16] J.P. Heo, Y. Lee, J. He, S.F. Chang, and S.e. Yoon, “Spherical hashing,” in CVPR, 2012.
 [17] W. Kong and W.J. Li, “Isotropic hashing,” in NIPS, 2012.
 [18] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAhash: Improved matching with smaller descriptors,” TPAMI, pp. 66–78, 2012.
 [19] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang, “Supervised hashing with kernels,” in CVPR, 2012.
 [20] J. Wang, S. Kumar, and S. Chang, “Semisupervised hashing for largescale search,” TPAMI, pp. 2393–2406, 2012.
 [21] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in NIPS, 2012.
 [22] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general twostep approach to learningbased hashing,” in ICCV, 2013.
 [23] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in NIPS, 2009.
 [24] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in CVPR, 2015.

[25]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
NIPS, 2012.  [26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features offtheshelf: An astounding baseline for recognition,” in CVPRW, 2014.
 [27] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
 [28] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in CVPR, 2015.
 [29] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification,” TIP, pp. 4766–4779, 2015.
 [30] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, 2015.
 [31] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen, “Deep quantization network for efficient image retrieval,” in AAAI, 2016.
 [32] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, pp. 1527–1554, 2006.
 [33] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in CVPR, 2015.
 [34] M. A. CarreiraPerpinan and R. Raziperchikolaei, “Hashing with binary autoencoders,” in CVPR, 2015.
 [35] T.T. Do, A.D. Doan, and N.M. Cheung, “Learning to hash with binary deep neural network,” in ECCV, 2016.

[36]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Largescale scene recognition from abbey to zoo,” in
CVPR, 2010.  [37] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J. Approx. Reasoning, pp. 969–978, 2009.
 [38] J. Lu, V. E. Liong, and J. Zhou, “Deep hashing for scalable image search,” TIP, 2017.
 [39] J. Nocedal and S. J. Wright, Numerical Optimization, Chapter 17, 2nd ed. World Scientific, 2006.
 [40] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical Programming, vol. 45, pp. 503–528, 1989.
 [41] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.

[42]
Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,”
http://yann.lecun.com/exdb/mnist/.  [43] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” TPAMI, pp. 117–128, 2011.
 [44] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” IJCV, pp. 91–110, 2004.
 [45] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, pp. 145–175, 2001.
 [46] V. A. Nguyen, J. Lu, and M. N. Do, “Supervised discriminative hashing for compact binary codes,” in ACM Multimedia, 2014.
 [47] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in ACM Multimedia, 2015.
 [48] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in AAAI, 2016.
Comments
There are no comments yet.