Compact Hash Code Learning with Binary Deep Neural Network

12/08/2017 ∙ by Thanh-Toan Do, et al. ∙ The University of Adelaide 0

In this work, we firstly propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners. Then, by leveraging the powerful capacity of convolutional neural networks, we propose an end-to-end architecture which jointly learns to extract visual features and produce binary hash codes. Our novel network designs constrain one hidden layer to directly output the binary codes. This addresses a challenging issue in some previous works: optimizing nonsmooth objective functions due to binarization. Additionally, we incorporate independence and balance properties in the direct and strict forms into the learning schemes. Furthermore, we also include similarity preserving property in our objective functions. Our resulting optimizations involving these binary, independence, and balance constraints are difficult to solve. We propose to attack them with alternating optimization and careful relaxation. Experimental results on the benchmark datasets show that our proposed methods compare favorably with the state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We are interested in learning binary hash codes for visual search problems. Given an input image treated as a query, the visual search systems search for visually similar images in a database. In the state-of-the-art image retrieval systems 

[1, 2, 3, 4]

, images are represented as high-dimensional feature vectors which later can be searched via classical distance such as the Euclidean or Cosin distance. However, when the database is scaled up, there are two main requirements for retrieval systems, i.e., efficient storage and fast search. Among solutions, binary hashing is an attractive approach for achieving those requirements 

[5, 6, 7, 8] due to its fast computation and efficient storage.

Briefly, in binary hashing problem, each original high dimensional vector is mapped into a very compact binary vector , where .

Many hashing methods have been proposed in the literature. They can be divided into two categories: data-independence methods and data-dependence methods. The former ones [9, 10, 11, 12] rely on random projections to construct hash functions. The representative methods in this category are Locality Sensitive Hashing (LSH) [9] and its kernelized or discriminative extensions [10, 11]. The latter ones use the available training data to learn the hash functions in unsupervised [13, 14, 15, 16, 17] or (semi-)supervised [18, 19, 20, 21, 22, 23, 24] manners. The prepresentative unsupervised hashing methods, e.g. Spectral Hashing [13], Iterative Quantization (ITQ) [14]

, K-means Hashing 

[15], Spherical Hashing [16], try to learn binary codes which preserve the distance similarity between samples. The prepresentative supervised hashing methods, e.g., ITQ-CCA [14], Binary Reconstructive Embedding [23], Kernel Supervised Hashing [19], Two-step Hashing [22], Supervised Discrete Hashing [24], try to learn binary codes which preserve the label similarity between samples. The detailed review of data-independent and data-dependent hashing methods can be found in the recent surveys [5, 6, 7, 8].

One difficult problem in hashing is to deal with the binary constraint on the codes. Specifically, the outputs of the hash functions have to be binary. In general, this binary constraint leads to an NP-hard mixed-integer optimization problem. To handle this difficulty, most aforementioned methods relax the constraint during the learning of hash functions. With this relaxation, the continuous codes are learned first. Then, the codes are binarized (e.g., by thresholding). This relaxation greatly simplifies the original binary constrained problem. However, the solution can be suboptimal, i.e., the binary codes resulting from thresholded continuous codes could be inferior to those that are obtained by directly including the binary constraint in the learning.

Furthermore, a good hashing method should produce binary codes with the following properties [13]: (i) similarity preserving, i.e., (dis)similar inputs should likely have (dis)similar binary codes; (ii) independence, i.e., different bits in the binary codes are independent to each other so that no redundant information is captured; (iii) bit balance, i.e., each bit has a chance of being or . Note that direct incorporation of the independent and balance properties can complicate the learning. Previous works have used some relaxation or approximation to overcome the difficulties [14, 20], but there may be some performance degradation.

Recently, the deep learning has given great attention to the computer vision community due to its superiority in many vision tasks such as classification, detection, segmentation 

[25, 26, 27]. Inspired by the success of deep learning in different vision tasks, recently, some researchers have used deep learning for joint learning image representations and binary hash codes in an end-to-end deep learning-based supervised hashing framework [28, 29, 30, 31]. However, learning binary codes in deep networks is challenging. This is because one has to deal with the binary constraint on the hash codes, i.e., the final network outputs must be binary. A naive solution is to adopt the activation layer to produce binary codes. However, due to the non-smoothness of the function, it causes the vanishing gradient problem when training the network with the standard back propagation [32].

Contributions: In this work, firstly, we propose a novel deep network model and a learning algorithm for unsupervised hashing. In order to achieve binary codes, instead of involving the or step function as in recent works [33, 34], our proposed network design constrains one layer to directly output the binary codes (therefore, the network is named as Binary Deep Neural Network). In addition, we propose to directly incorporate the independence and balance properties. Furthermore, we include the similarity preserving in our objective function. The resulting optimization with these binary and direct constraints is NP-hard. We propose to attack this challenging problem with alternating optimization and careful relaxation. Secondly, in order to enhance the discriminative power of the binary codes, we extend our method to supervised hashing by leveraging the label information so that the binary codes preserve the semantic similarity between samples. Finally, to demonstrate the flexibility of our proposed method and to leverage the powerful capacity of convolution deep neural networks, we adapt our optimization strategy and the proposed supervised hashing model to an end-to-end deep hashing network framework. The solid experiments on various benchmark datasets show the improvements of the proposed methods over state-of-the-art hashing methods.

A preliminary version of this work has been reported in [35]. In this work, we present substantial extension to our previous work. In particular, the main extension is that we propose the end-to-end binary deep neural network framework which jointly learns the image features and the binary codes. The experimental results show that the proposed end-to-end hashing framework significantly boosts the retrieval accuracy. Other minor extensions are we conduct more experiments (e.g., new experiments on SUN397 dataset [36], comparison to state-of-the-art end-to-end hashing methods) to evaluate the effectiveness of the proposed methods.

The remaining of this paper is organized as follows. Section II presents related works. Section III presents and evaluates the proposed unsupervised hashing method. Section IV presents and evaluates the proposed supervised hashing method. Section V presents and evaluates the proposed end-to-end deep hashing network. Section VI concludes the paper.

Ii Related work

Our work is inspired by the recent successful hashing methods which define hash functions as a neural network [37, 33, 38, 34, 28, 29, 30]. We propose an improved design to address their limitations. In Semantic Hashing [37]

, the model is formed by a stack of Restricted Boltzmann Machine, and a pretraining step is required. Additionally, this model does not consider the independence and balance of the codes. In Binary Autoencoder 

[34], a linear autoencoder is used as hash functions. As this model only uses one hidden layer, it may not well capture the information of inputs. Extending [34] with multiple, nonlinear layers is not straight-forward because of the binary constraint. They also do not consider the independence and balance of codes. In Deep Hashing [33, 38], a deep neural network is used as a hash function. However, this model does not fully take into account the similarity preserving. They also apply some relaxation in arriving the independence and balance of codes and this may degrade the performance. In the end-to-end deep hashing [29, 30], the independence and balance of codes are not considered.

In order to handle the binary constraint, Semantic Hashing [37] first solves the relaxed problem by discarding the constraint and then thresholds the solved continuous solution. In Deep Hashing (DH) [33, 38], the output of the last layer, , is binarized by the function. They include a term in the objective function to reduce this binarization loss: . Solving the objective function of DH [33, 38] is difficult because the function is non-differentiable. The authors in [33, 38] overcome this difficulty by assuming that the function is differentiable everywhere. In Binary Autoencoder (BA) [34], the outputs of the hidden layer are passed into a step function to binarize the codes. Incorporating the step function in the learning leads to a non-smooth objective function, hence the optimization is NP-complete. To handle this difficulty, the authors [34] use binary SVMs to learn the model parameters in the case when there is only a single hidden layer.

Joint learning image representations and binary hash codes in an end-to-end deep learning-based supervised hashing framework [28, 29, 30]

has shown considerable boost in retrieval accuracy. By joint optimization, the produced hash codes are more sufficient to preserve the semantic similarity between images. In those works, the network architectures often consist of a feature extraction sub-network and a subsequent hashing layer to produce hash codes. Ideally, the hashing layer should adopt a

activation function to output exactly binary codes. However, due to the vanishing gradient difficulty of the function, an approximation procedure must be employed. For example, can be approximated by a tanh-like function , where is a free parameter controlling the trade off between the smoothness and the binary quantization loss [29]. However, it is non-trivial to determine an optimal . A small causes large binary quantization loss while a big makes the output of the function close to the binary values, but the gradient of the function almost vanishes, making back-propagation infeasible. The problem still remains when the logistic-like functions [28, 30] are used.

Iii Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN)

Iii-a Formulation of UH-BDNN

Notation Meaning
: set of training samples;
each column of corresponds to one sample
: binary code of
Number of required bits to encode a sample
Number of layers (including input and output layers)
Number of units in layer
Activation function of layer
: weight matrix connecting layer
and layer

:bias vector for units in layer

:
output values of layer ; convention:
Matrix has rows, columns and all elements equal to 1
TABLE I: Notations and their corresponding meanings.
Fig. 1: The illustration of our UH-BDNN (). In our proposed network design, the outputs of layer are constrained to and are used as the binary codes. During training, these codes are used to reconstruct the input samples at the final layer.

For easy following, we first summarize the notations in Table I

. In our work, the hash functions are defined by a deep neural network. In our proposed architecture, we use different activation functions in different layers. Specifically, we use the sigmoid function as the activation function for layers

, and the identity function as the activation function for layer and layer . Our idea is to learn the network such that the output values of the penultimate layer (layer ) can be used as the binary codes. We introduce constraints in the learning algorithm such that the output values at the layer have the following desirable properties: (i) belonging to ; (ii) similarity preserving; (iii) independence and (iv) balancing. Fig. 1 illustrates our network for the case .

Let us start with first two properties of the codes, i.e., belonging to and similarity preserving. To achieve the binary codes having these two properties, we propose to optimize the following constrained objective function

(1)
(2)

The constraint (2) is to ensure the first property. As the activation function for the last layer is the identity function, the term is the output of the last layer. The first term of (1) makes sure that the binary code gives a good reconstruction of . It is worth noting that the reconstruction criterion has been used as an indirect approach for preserving the similarity in state-of-the-art unsupervised hashing methods [14, 34, 37], i.e., it encourages (dis)similar inputs map to (dis)similar binary codes. The second term is a regularization that tends to decrease the magnitude of the weights, and this helps to prevent overfitting. Note that in our proposed design, we constrain the network to directly output the binary codes at one layer, which avoids the difficulty of the / function which is non-differentiability. On the other hand, our formulation with (1) under the binary constraint (2) is very difficult to solve. It is a mixed-integer problem which is NP-hard. In order to attack the problem, we propose to introduce an auxiliary variable and use alternating optimization. Consequently, we reformulate the objective function (1) under constraint (2) as the following

(3)
(4)
(5)

The benefit of introducing the auxiliary variable is that we can decompose the difficult constrained optimization problem (1) into two sub-optimization problems. Consequently, we are able to iteratively solve the optimization problem by using alternating optimization with respect to and while holding the other fixed. Inspired from the quadratic penalty method [39], we relax the equality constraint (4) by converting it into a penalty term. We achieve the following constrained objective function

(6)
(7)

in which, the third term in (6) measures the (equality) constraint violation. By setting the penalty parameter sufficiently large, we penalize the constraint violation severely, thereby forcing the minimizer of the penalty function (6) closer to the feasible region of the original constrained function (3).

We now consider the two remaining properties of the codes, i.e., independence and balance. Unlike previous works which use some relaxation or approximation on the independence and balance properties [14, 33, 20], we propose to encode these properties strictly and directly based on the binary outputs of our layer . Specifically, we encode the independence and balance properties of the codes by introducing the fourth and the fifth terms respectively in the following constrained objective function

(8)
(9)

The objective function (8) under constraint (9) is our final formulation. Before discussing how to solve it, let us present the differences between our work and the recent deep learning based hashing models Deep Hashing [33], Binary Autoencoder [34], and end-to-end hashing methods [28, 29, 30].

The main important difference between our model and other deep learning-based hashing methods Deep Hashing [33], Binary Autoencoder [34], end-to-end hashing [28, 29, 30] is the way to achieve the binary codes. Instead of involving the , step function as in [33, 34] or the relaxation of as  [28, 30],  [29], we constrain the network to directly output the binary codes at one layer. Other differences to the most related works Deep Hashing [33] and Binary Autoencoder [34] are presented as follows.

Comparison to Deep Hashing (DH) [33, 38]

the deep model of DH is learned by the following objective function

The DH’s model does not have the reconstruction layer. The authors apply the function to the outputs at the top layer of the network to obtain the binary codes. The first term aims to minimize quantization loss by applying the function to the outputs at the top layer. The balancing and the independent properties are presented in the second and the third terms. It is worth noting that minimizing DH’s objective function is difficult due to the non-differentiability of the sgn function. The authors work around this difficulty by assuming that the sgn function is differentiable everywhere.

Contrary to DH, we propose a different model design. In particular, our model encourages the similarity preserving by obtaining the reconstruction layer in the network. For the balancing property, they maximize . According to [20], maximizing this term is only an approximation in arriving the balancing property. In our objective function, the balancing property is directly enforced on the codes by the term . For the independent property, DH uses a relaxed orthogonality constraint on the network weights , i.e., . On the contrary, we (once again) directly constrain the code independence using . Incorporating the direct constraints can lead to better performance.

Comparison to Binary Autoencoder (BA) [34]

the differences between our model and BA are quite clear. Firstly, BA, as described in [34]

, is a shallow linear autoencoder network with one hidden layer. Secondly, the BA’s hash function is a linear transformation of the input followed by the step function to obtain the binary codes. In BA, by treating the encoder layer as binary classifiers, they use binary SVMs to learn the weights of the linear transformation. On the contrary, our hash function is defined by multiple, hierarchical layers of nonlinear and linear transformations. It is not clear if the binary SVMs approach in BA can be used to learn the weights in our deep architecture with multiple layers. Instead, we use alternating optimization to derive a backpropagation algorithm to learn the weights in all layers. Additionally, another difference is that our model ensures the independence and balance of the binary codes while BA does not. Note that independence and balance properties may not be easily incorporated in their framework, as these would complicate their objective function and the optimization problem may become very difficult to solve.

Iii-B Optimization

In order to solve (8) under constraint (9), we propose to use alternating optimization over and .

Iii-B1 step

When fixing , the problem becomes the unconstrained optimization. We use L-BFGS [40] optimizer with backpropagation for solving. The gradients of the objective function (8) w.r.t. different parameters are computed as follows.

At , we have

(10)
(11)

For other layers, let us define

(12)
(13)

where , ; denotes the Hadamard product.

Then, , we have

(14)
(15)

Iii-B2 step

When fixing , we can rewrite problem (8) as

(16)
(17)

We adaptively use the recent method discrete cyclic coordinate descent [24] to iteratively solve , i.e., row by row. The advantage of this method is that if we fix rows of and only solve for the remaining row, we can achieve a closed-form solution for that row.

Let ; . For , let be column of ; be matrix excluding ; be column of ; be row of ; be matrix of excluding . We have closed-form for as

(18)

The proposed UH-BDNN method is summarized in Algorithm 1. In the Algorithm 1, and are values of and at iteration , respectively.

1:
2:: training data; : code length; : maximum iteration number; : number of layers; : number of units of layers (note: , ); .
3:
4:Parameters
5:
6:Initialize using ITQ [14]
7:Initialize . Initialize by getting the top eigenvectors from the covariance matrix of . Initialize
8:Fix , compute with step using initialized (line 2) as starting point for L-BFGS.
9:for  do
10:      Fix , compute with step
11:      Fix , compute with step using as starting point for L-BFGS.
12:end for
13:Return
Algorithm 1 Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN)
(a) CIFAR10
(b) MNIST
(c) SIFT1M
Fig. 2: mAP comparison between UH-BDNN and state-of-the-art unsupervised hashing methods on CIFAR10, MNIST, and SIFT1M.
CIFAR10 MNIST SIFT1M
8 16 24 32 8 16 24 32 8 16 24 32
UH-BDNN 0.55 5.79 22.14 18.35 0.53 6.80 29.38 38.50 4.80 25.20 62.20 80.55
BA[34] 0.55 5.65 20.23 17.00 0.51 6.44 27.65 35.29 3.85 23.19 61.35 77.15
ITQ[14] 0.54 5.05 18.82 17.76 0.51 5.87 23.92 36.35 3.19 14.07 35.80 58.69
SH[13] 0.39 4.23 14.60 15.22 0.43 6.50 27.08 36.69 4.67 24.82 60.25 72.40
SPH[16] 0.43 3.45 13.47 13.67 0.44 5.02 22.24 30.80 4.25 20.98 47.09 66.42
KMH[15] 0.53 5.49 19.55 15.90 0.50 6.36 25.68 36.24 3.74 20.74 48.86 76.04
TABLE II: Precision at Hamming distance comparison between UH-BDNN and state-of-the-art unsupervised hashing methods on CIFAR10, MNIST, and SIFT1M.

Iii-C Evaluation of Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN)

CIFAR10 MNIST
mAP precision mAP precision
L 16 32 16 32 16 32 16 32
DH [33] 16.17 16.62 23.33 15.77 43.14 44.97 66.10 73.29
UH-BDNN 17.83 18.52 24.97 18.85 45.38 47.21 69.13 75.26
TABLE III: Comparison with Deep Hashing (DH) [33].

This section evaluates the proposed UH-BDNN and compares it with the following state-of-the-art unsupervised hashing methods: Spectral Hashing (SH) [13], Iterative Quantization (ITQ) [14], Binary Autoencoder (BA) [34], Spherical Hashing (SPH) [16], K-means Hashing (KMH) [15]. For all compared methods, we use the implementations and the suggested parameters provided by the authors.

Iii-C1 Dataset, evaluation protocol, and implementation notes

Dataset

CIFAR10 [41] dataset consists of 60,000 images of 10 classes. The training set (also used as the database for the retrieval) contains 50,000 images. The query set contains 10,000 images. Each image is represented by a 800-dimensional feature vector extracted by PCA from 4096-dimensional CNN feature produced by the AlexNet [25].

MNIST [42] dataset consists of 70,000 handwritten digit images of 10 classes. The training set (also used as database for retrieval) contains 60,000 images. The query set contains 10,000 images. Each image is represented by a 784 dimensional gray-scale feature vector by using its intensity.

SIFT1M [43] dataset contains 128 dimensional SIFT vectors [44]. There are 1M vectors used as database for retrieval; 100K vectors for training (separated from retrieval database) and 10K vectors for query.

Evaluation protocol

We follow the standard setting in unsupervised hashing methods [14, 16, 15, 34] which use Euclidean nearest neighbors as the ground truths for queries. The number of ground truths are set as in [34], i.e., for CIFAR10 and MNIST datasets, for each query, we use its Euclidean nearest neighbors as the ground truths; for large scale dataset SIFT1M, for each query, we use

its Euclidean nearest neighbors as the ground truths. We use the following evaluation metrics which have been used in the state of the art 

[14, 34, 33] to measure the performance of methods. 1) mean Average Precision (mAP); 2) precision of Hamming radius (precision) which measures precision on retrieved images having Hamming distance to query (if no images satisfy, we report zero precision). Note that as computing mAP is slow on large dataset SIFT1M, we consider top returned neighbors when computing mAP.

Implementation notes

In our deep model, we use layers. The parameters , , and are empirically set by cross validation as , , and , respectively. The max iteration number is empirically set to 10. The number of units in hidden layers are empirically set as , , and for the code length 8, 16, 24 and 32 bits, respectively.

Iii-C2 Retrieval results

Fig. 2 and Table II show comparative mAP and precision of Hamming radius (precision) of methods, respectively.

We find the following observations are consistent for all three datasets. In term of mAP, the proposed UH-BDNN is comparable to or outperforms other methods at all code lengths. The improvement is more clear at high code length, i.e., . The best competitor to UH-BDNN is binary autoencoder (BA) [34] which is the current state-of-the-art unsupervised hashing method. Compare to BA, at high code length, UH-BDNN consistently outperforms BA on all datasets. In term of precision, UH-BDNN is comparable to other methods at low code lengths, i.e., . At , UH-BDNN significantly achieve better performance than other methods. Specifically, the improvements of UH-BDNN over the best competitor BA [34] are clearly observed at on the MNIST and SIFT1M datasets.

Comparison with Deep Hashing (DH) [33, 38] As the implementation of DH is not available, we set up the experiments on CIFAR10 and MNIST similar to [33] to make a fair comparison. For each dataset, we randomly sample 1,000 images (i.e., 100 images per class) as query set; the remaining images are used as training and database set. Follow [33], for CIFAR10, each image is represented by 512- GIST descriptor [45].

Follow Deep Hashing [33], the ground truths of queries are based on their class labels111It is worth noting that in the evaluation of unsupervised hashing, instead of using class label as ground truths, most state-of-the-art methods [14, 16, 15, 34] use Euclidean nearest neighbors as ground truths for queries.. Similar to [33], we report comparative results in term of mAP and the precision of Hamming radius . The results are presented in the Table III. It is clearly showed that the proposed UH-BDNN outperforms DH [33] at all code lengths, in both mAP and precision of Hamming radius.

Iv Supervised Hashing with Binary Deep Neural Network (SH-BDNN)

In order to enhance the discriminative power of the binary codes, we extend UH-BDNN to supervised hashing by leveraging the label information. There are several approaches proposed to leverage the label information, leading to different criteria on binary codes. In [18, 46], binary codes are learned such that they minimize Hamming distances between samples belonging to the same class, while maximizing the Hamming distance between samples belonging to different classes. In [24], the binary codes are learned such that they are optimal for linear classification.

In this work, in order to exploit the label information, we follow the approach proposed in Kernel-based Supervised Hashing (KSH) [19]. The benefit of this approach is that it directly encourages the Hamming distances between binary codes of within-class samples equal to , and the Hamming distances between binary codes of between-class samples equal to . In the other words, it tries to perfectly preserve the semantic similarity. To achieve this goal, it enforces the Hamming distances between learned binary codes to be highly correlated with the pre-computed pairwise label matrix.

In general, the network structure of SH-BDNN is similar to UH-BDNN, excepting that the last layer preserving the reconstruction of UH-BDNN is removed. The layer in UH-BDNN becomes the last layer in SH-BDNN. All desirable properties, i.e., semantic similarity preserving, independence, and balance, in SH-BDNN are constrained on the outputs of its last layer.

Iv-a Formulation of SH-BDNN

We define the pairwise label matrix as

(19)

In order to achieve the semantic similarity preserving property, we learn the binary codes such that the Hamming distance between learned binary codes highly correlates with the matrix , i.e., we want to minimize the quantity . In addition, to achieve the independence and balance properties of codes, we want to minimize the quantities and , respectively.

Follow the same reformulation and relaxation as UH-BDNN (Sec. III-A), we solve the following constrained optimization which ensures the binary constraint, the semantic similarity preserving, the independence, and the balance properties of codes.

(20)
(21)

(20) under constraint (21) is our formulation for supervised hashing. The main difference in formulation between UH-BDNN (8) and SH-BDNN (20) is that the reconstruction term preserving the neighbor similarity in UH-BDNN (8) is replaced by the term preserving the label similarity in SH-BDNN (20).

Iv-B Optimization

1:
2:: labeled training data; : code length; : maximum iteration number; : number of layers; : number of units of layers (note: ); .
3:
4:Parameters
5:
6:Compute pairwise label matrix using (19).
7:Initialize using ITQ [14]
8:Initialize . Initialize by getting the top eigenvectors from the covariance matrix of .
9:Fix , compute with step using initialized (line 3) as starting point for L-BFGS.
10:for  do
11:      Fix , compute with step
12:      Fix , compute with step using as starting point for L-BFGS.
13:end for
14:Return
Algorithm 2 Supervised Hashing with Binary Deep Neural Network (SH-BDNN)

In order to solve (20) under constraint (21), we use alternating optimization, which comprises two steps over and .

Iv-B1 step

When fixing , (20) becomes unconstrained optimization. We used L-BFGS [40] optimizer with backpropagation for solving. The gradients of objective function (20) w.r.t. different parameters are computed as follows.

Let us define

(22)

where .

(23)

where , ; denotes the Hadamard product.

Then , we have

(24)
(25)

Iv-B2 step

When fixing , we can rewrite problem (20) as

(26)
(27)

It is easy to see that the optimal solution for (26) under constraint (27) is .

The proposed SH-BDNN method is summarized in Algorithm 2. In Algorithm 2, and are values of and at iteration , respectively.

(a) CIFAR10
(b) MNIST
(c) SUN397
Fig. 3: mAP comparison between SH-BDNN and state-of-the-art supervised hashing methods on CIFAR10, MNIST and SUN397 datasets.
CIFAR10 MNIST SUN397
8 16 24 32 8 16 24 32 8 16 24 32
SH-BDNN 54.12 67.32 69.36 69.62 84.26 94.67 94.69 95.51 15.52 41.98 52.53 56.82
SDH[24] 31.60 62.23 67.65 67.63 36.49 93.00 93.98 94.43 13.89 40.39 49.54 53.25
ITQ-CCA[14] 49.14 65.68 67.47 67.19 54.35 79.99 84.12 84.57 13.22 37.53 50.07 53.12
KSH[19] 44.81 64.08 67.01 65.76 68.07 90.79 92.86 92.41 12.64 40.67 49.29 46.45
BRE[23] 23.84 41.11 47.98 44.89 37.67 69.80 83.24 84.61 9.26 26.95 38.36 40.36
TABLE IV: Precision at Hamming distance comparison between SH-BDNN and state-of-the-art supervised hashing methods on CIFAR10, MNIST, and SUN397 datasets.

Iv-C Evaluation of Supervised Hashing with Binary Deep Neural Network

This section evaluates our proposed SH-BDNN and compares it to the following state-of-the-art supervised hashing methods including Supervised Discrete Hashing (SDH) [24], ITQ-CCA [14], Kernel-based Supervised Hashing (KSH) [19], Binary Reconstructive Embedding (BRE) [23]. For all compared methods, we use the implementations and the suggested parameters provided by the authors.

Iv-C1 Dataset, evaluation protocol, and implementation notes

Dataset

We evaluate and compare methods on CIFAR-10 and MNIST, and SUN397 datasets. The descriptions of the first two datasets are presented in section III-C1.

SUN397 [36] dataset contains about 108K images from 397 scene categories. We use a subset of this dataset including 42 largest categories in which each category contains more than 500 images. This results about 35K images in total. The query set contains 4,200 images (100 images per class) randomly sampled from the dataset. The rest images are used as the training set and also the database set. Each image is represented by a 800-dimensional feature vector extracted by PCA from 4096-dimensional CNN feature produced by the AlexNet [25].

Evaluation protocol

Follow the literature [24, 14, 19], we report the retrieval results in two metrics: 1) mean Average Precision (mAP) and 2) precision of Hamming radius (precision).

Implementation notes

The network configuration is same as UH-BDNN except the final layer is removed. The values of parameters , , , and are empirically set using cross validation as , , , and , respectively. The max iteration number is empirically set to 5.

Follow the settings in ITQ-CCA [14], SDH [24], all training samples are used in the learning for these two methods. For SH-BDNN, KSH [19] and BRE [23] where the label information is leveraged by the pairwise label matrix, we randomly select training samples from each class and use them for learning. The ground truths of queries are defined by the class labels from the datasets.

Iv-C2 Retrieval results

Fig. 3 and Table IV show comparative results between the proposed SH-BDNN and other supervised hashing methods on CIFAR10, MNIST, and SUN397 datasets.

On the CIFAR10 dataset, Fig. 3(a) and Table IV clearly show that the proposed SH-BDNN outperforms all compared methods by a fair margin at all code lengths in both mAP and precision@2. The best competitor to SH-BDNN on this dataset is CCA-ITQ [14]. The more improvements of SH-BDNN over CCA-ITQ are observed at high code lengths, i.e., SH-BDNN outperforms CCA-ITQ about 4% at and .

On the MNIST dataset, Fig. 3(b) and Table IV show that the proposed SH-BDNN significant outperforms the current state-of-the-art SDH [24] at low code length, i.e., . When increases, SH-BDNN and SDH [24] achieve similar performance. SH-BDNN significantly outperforms other methods, i.e., KSH [19], ITQ-CCA [14], BRE [23], on both mAP and precision@2.

On the SUN397 dataset, the proposed SH-BDNN outperforms other competitors at all code lengths in terms of both mAP and precision@2. The best competitor to SH-BDNN on this dataset is SDH [24]. At high code lengths (e.g., ), SH-BDNN achieves more improvements over SDH.

V Supervised Hashing with End-to-End Binary Deep Neural Network (E2E-BDNN)

Fig. 4: The illustration of the end-to-end hashing framework (E2E-BDNN).

Even though the proposed SH-BDNN can significantly enhance the discriminative power of the binary codes, similar to other hashing methods, its capability is partially depended on the discriminative power of image features. The recent end-to-end deep learning-based supervised hashing methods [28, 29, 30] have shown that joint learning image representations and binary hash codes in an end-to-end fashion can boost the retrieval accuracy. Therefore, in this section, we propose to extend the proposed SH-BDNN to an end-to-end framework. Specifically, we integrate the convolutional neural network (CNN) with our supervised hashing network (SH-BDNN) into a unified end-to-end deep architecture, namely End-to-End Binary Deep Neural Network (E2E-BDNN), which can jointly learn visual features and binary representations of images. In the following, we first introduce our proposed network architecture. We then describe the training process. Finally, we present experiments on various benchmark datasets.

V-a Network architecture

Fig. 4 illustrates the overall architecture of the end-to-end binary deep neural network – E2E-BDNN. In details, the network consists of three main components: (i) a feature extractor, (ii) a dimensional reduction layer, and (iii) a binary optimizer component. We utilize AlexNet [25] as the feature extractor component of the E2E-BDNN. In our configuration, we remove the last layer of AlexNet, namely the softmax layer, and consider its last fully connected layer (fc7) as the image representation.

The dimensional reduction component (the DR layer) involves a fully connected layer for reducing the high dimensional image representations outputted by the feature extractor component into lower dimensional representations. We use the identity function as the activation function for this DR layer. The reduced representations are then used as inputs for the following binary optimizer component.

The binary optimizer component of E2E-BDNN is similar to SH-BDNN. It means that we also constrain the output codes of E2E-BDNN to be binary. These codes also have desired properties such as semantic similarity preserving, independence, and balance. By using the same design as SH-BDNN for the last component of E2E-BDNN, it allows us to observe the advantages of the end-to-end architecture over SH-BDNN.

The training data for the E2E-BDNN is labelled images, which contrasts to SH-BDNN which uses visual features such as GIST [45], SIFT [44]

or deep features from convolutional deep networks. Given the input labeled images, we aim to learn binary codes with aforementioned desired properties, i.e., semantic similarity preserving, independence, and balance. In order to achieve these properties on the codes, we use the similar objective function as SH-BDNN. However, it is important to mention that in SH-BDNN, by its non end-to-end architecture, we can feed the whole training set to the network at a time for training. This does not hold for E2E-BDNN. Due to the memory consuming of the end-to-end architecture, we can only feed a minibatch of images to the network at a time for training. Technically, let

be the output of the last fully connected layer of E2E-BDNN for a minibatch of size ; be similarity matrix defined over the minibatch (using equation (19));

is an auxiliary variable. Similar to SH-BDNN, we train the network to minimize the following constrained loss function

(28)
(29)

V-B Training

1:
2:: labeled training images; : minibatch size; : code length; , : maximum iteration.

: hyperparameters.

3:
4:Network parameters
5:
6:Initialize the network
7:Initialize via ITQ [14]
8:for  do
9:      for  do
10:            A minibatch is sampled from
11:            Compute the corresponding similarity matrix
12:            From , sample corresponding to
13:            Fix , optimize via SGD
14:      end for
15:      Update by
16:end for
17:Return
Algorithm 3 End-to-End Binary Deep Neural Network (E2E-BDNN) Learning

The training procedure of E2E-BDNN is presented in the Algorithm 3. In the Algorithm 3, and are a minibatch sampled from the training set at iteration and its corresponding binary codes, respectively. is the binary codes of the whole training set at iteration ; is the network weight when learning up to iterations and .

At beginning (line 1 in the Algorithm 3), we initialize the network parameter as follows. For the feature extractor component, we initialize it by the pretrained AlexNet model [25]. The dimensional reduction (DR) layer is initialized by applying PCA on the AlexNet features, i.e. the outputs of fc7 layer, of the training set. Specifically, let and be the weights and bias of the DR layer, respectively, we initialize them as follows,

where

are the top eigenvectors corresponding to the largest eigenvalues extracted from the covariance matrix of training features;

is the mean of the training features.

The binary optimizer component is initialized by training SH-BDNN using AlexNet features (of training set) as inputs. We then initialize the binary code matrix of the whole dataset via ITQ [14] (line 2 in the Algorithm 3). Here, AlexNet features are used as training inputs for ITQ.

In each iteration of the Algorithm 3, we only sample a minibatch from the training set to feed to the network (line 5 in the Algorithm 3). So after iterations, we exhaustively sample all training data. In each iteration , we first create the similarity matrix (using equation (19)) corresponding to as well as the matrix (line 6 and line 7 in the Algorithm 3). Since has been already computed, we can fix that variable and update the network parameter

by standard backpropagation with Stochastic Gradient Descent (SGD) (line 8 in the Algorithm 

3). After iterations, since the network was exhaustively learned from the whole training set, we update (line 10 in the Algorithm 3), where is the outputs of the last fully connected layer for all training samples. We then repeat the optimization procedure until it reaches a criterion, i.e., after iterations.

Implementation details The proposed E2E-BDNN is implemented in MATLAB with MatConvNet library [47]. All experiments are conducted on a workstation machine with a GPU Titan X. Regarding the hyperparameters of the loss function, we empirically set , , and . The learning rate is set to and the weight decay is set to . The minibatch size is .

CIFAR10 MNIST SUN397
8 16 24 32 8 16 24 32 8 16 24 32
SH-BDNN 57.15 66.04 68.81 69.66 84.65 94.24 94.80 95.25 33.06 47.13 57.02 61.89
E2E-BDNN 64.83 71.02 72.37 73.56 88.82 98.03 98.26 98.21 34.15 48.21 59.51 64.58
TABLE V: mAP comparison between SH-BDNN and E2E-BDNN on CIFAR10, MNIST, and SUN397 datasets.

V-C Evaluation of End-to-End Binary Deep Neural Network (E2E-BDNN)

Since we have already compare SH-DBNN to other supervised hashing methods in Section IV-C, in this experiment we focus to compare E2E-BDNN with SH-BDNN. We also compare the proposed E2E-BDNN to other end-to-end hashing methods [29, 28, 30, 31].

Comparison between SH-BDNN and E2E-BDNN

Table V presents comparative mAP between SH-BDNN and E2E-BDNN. The results shows that E2E-BDNN consistently improves over SH-BDNN at all code lengths on all datasets. The large improvements of E2E-BDNN over SH-BDNN are observed on the CIFAR10 and MNIST datasets, especially at low code lengths, i.e., on CIFAR10, E2E-BDNN outperforms SH-BDNN 7.7% and 5% at and , respectively; on MNIST, E2E-BDNN outperforms SH-BDNN 4.2% and 3.8% at and , respectively. On SUN397 dataset, the improvements of E2E-BDNN over SH-BDNN are more clear at high code length, i.e., E2E-BDNN outperforms SH-BDNN 2.5% and 2.7% at and , respectively. The improvements of E2E-BDNN over SH-BDNN confirm the effectiveness of the proposed end-to-end architecture for learning discriminative binary codes.

Comparison between E2E-BDNN and other end-to-end supervised hashing methods

We also compare our proposed deep networks SH-BDNN and E2E-BDNN with other end-to-end supervised hashing architectures, i.e., Hashing with Deep Neural Network (DNNH) [30], Deep Hashing Network (DHN) [48], Deep Quantization Network (DQN) [31], Deep Semantic Ranking Hashing (DSRH) [28], and Deep Regularized Similarity Comparison Hashing (DRSCH) [29]. In those works, the authors propose the frameworks in which the image features and hash codes are simultaneously learned by combining CNN layers and a binary quantization layer into a large model. However, their binary mapping layer only applies a simple operation, e.g., an approximation of function ( [28, 30],  [29]), norm approximation of binary constraints [48]. Our SH-BDNN and E2E-BDNN advances over those works in the way to map the image features to the binary codes. Furthermore, our learned codes ensure good properties, i.e. independence and balance, while [30, 29, 31] do not consider such properties, and [28] only considers the balance of codes. It is worth noting that, in [30, 28, 29, 48], different settings are used when evaluating. For fair comparison, follow those works, we setup two different experimental settings as follows

  • Setting 1: according to [30, 31, 48], we randomly sample 100 images per class to form 1K testing images. The remaining 59K images are used as database images. Furthermore, 500 images per class are sampled from the database to form 5K training images.

  • Setting 2: following [28, 29], we randomly sample 1K images per class to form 10K testing images. The remaining 50K images are served as the training set. In the test phase, each test image is searched through the test set by the leave-one-out procedure.

Table VI shows the comparative mAP between our methods and DNNH [30], DHN [48], DQN [31] on the CIFAR10 dataset with the setting 1. Interestingly, even with non end-to-end approach, our SH-BDNN outperforms DNNH and DQN at all code lengths. This confirms the effectiveness of the proposed approach for dealing with binary constrains and the having of desire properties such as independence and balance on the produced codes. With the end-to-end approach, it helps to boost the performance of the SH-BDNN. The proposed E2E-BDNN outperforms all compared methods, DNNH [30], DHN [48], DQN [31]. It is worth noting that in Lai et al. [30], increasing the code length does not necessary boost the retrieval accuracy, i.e., they report a mAP 55.80 at , while a higher mAP, i.e., 56.60 is reported at . Different from [30], both SH-BDNN and E2E-BDNN improve mAP when increasing the code length.

Table VII presents the comparative mAP between the proposed SH-BDNN, E2E-BDNN and the competitors DSRH [28], DRSCH [29] on the CIFAR10 dataset with the setting 2. The results clearly show that the proposed E2E-BDNN significantly outperforms the DSRH [28] and DRSCH [29] at all code lengths. Compare to the best competitor DRSCH [29], the improvements of E2E-BDNN over DRSCH are from 5% to 6% at different code lengths. Furthermore, we can see that even with non end-to-end approach, the proposed SH-BDNN also outperforms DSRH [28] and DRSCH [29].

24 32 48
SH-E2E 60.02 61.35 63.59
SH-BDNN 57.30 58.66 60.08
DNNH[30] 56.60 55.80 58.10
DQN[31] 55.80 56.40 58.00
DHN[48] 59.40 60.30 62.10
TABLE VI: mAP comparison between E2E-BDNN, SH-BDNN and DNNH[30], DQN[31] on CIFAR10 (Setting 1).
24 32 48
SH-E2E 67.16 68.72 69.23
SH-BDNN 65.21 66.22 66.53
DSRH [28] 61.08 61.74 61.77
DRSCH [29] 62.19 62.87 63.05
TABLE VII: mAP comparison between E2E-BDNN, SH-BDNN and end-to-end hashing methods DSRH[28], DRSCH[29] on CIFAR10 (Setting 2).

Vi Conclusion

In this paper, we propose three deep hashing neural networks for learning compact binary presentations. Firstly, we introduce UH-BDNN and SH-BDNN for unsupervised and supervised hashing respectively. Our network designs constrain to directly produce binary codes at one layer. The designs also ensure good properties for codes: similarity preserving, independence and balance. Together with designs, we also propose alternating optimization schemes which allow to effectively deal with binary constraints on the codes. We then propose to extend the SH-BDNN to an end-to-end binary deep neural network (E2E-BDNN) framework which jointly learns the image representations and the binary codes. The solid experimental results on various benchmark datasets show that the proposed methods compare favorably or outperform state-of-the-art hashing methods.

References

  • [1] R. Arandjelovic and A. Zisserman, “All about VLAD,” in CVPR, 2013.
  • [2] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in CVPR, 2010.
  • [3] T.-T. Do and N.-M. Cheung, “Embedding based on function approximation for large scale image search,” TPAMI, 2017.
  • [4] H. Jégou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in CVPR, 2014.
  • [5] K. Grauman and R. Fergus, “Learning binary hash codes for large-scale image search,” Machine Learning for Computer Vision, 2013.
  • [6] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” CoRR, 2014.
  • [7] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” TPAMI, 2017.
  • [8] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash for indexing big data - A survey,” Proceedings of the IEEE, pp. 34–57, 2016.
  • [9] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB, 1999.
  • [10] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in ICCV, 2009.
  • [11] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in NIPS, 2009.
  • [12] B. Kulis, P. Jain, and K. Grauman, “Fast similarity search for learned metrics,” TPAMI, pp. 2143–2157, 2009.
  • [13] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, 2008.
  • [14] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” TPAMI, pp. 2916–2929, 2013.
  • [15] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compact codes,” in CVPR, 2013.
  • [16] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-e. Yoon, “Spherical hashing,” in CVPR, 2012.
  • [17] W. Kong and W.-J. Li, “Isotropic hashing,” in NIPS, 2012.
  • [18] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAhash: Improved matching with smaller descriptors,” TPAMI, pp. 66–78, 2012.
  • [19] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in CVPR, 2012.
  • [20] J. Wang, S. Kumar, and S. Chang, “Semi-supervised hashing for large-scale search,” TPAMI, pp. 2393–2406, 2012.
  • [21] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in NIPS, 2012.
  • [22] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general two-step approach to learning-based hashing,” in ICCV, 2013.
  • [23] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in NIPS, 2009.
  • [24] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in CVPR, 2015.
  • [25]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    NIPS, 2012.
  • [26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” in CVPRW, 2014.
  • [27] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [28] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in CVPR, 2015.
  • [29] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,” TIP, pp. 4766–4779, 2015.
  • [30] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, 2015.
  • [31] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen, “Deep quantization network for efficient image retrieval,” in AAAI, 2016.
  • [32] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, pp. 1527–1554, 2006.
  • [33] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in CVPR, 2015.
  • [34] M. A. Carreira-Perpinan and R. Raziperchikolaei, “Hashing with binary autoencoders,” in CVPR, 2015.
  • [35] T.-T. Do, A.-D. Doan, and N.-M. Cheung, “Learning to hash with binary deep neural network,” in ECCV, 2016.
  • [36]

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in

    CVPR, 2010.
  • [37] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J. Approx. Reasoning, pp. 969–978, 2009.
  • [38] J. Lu, V. E. Liong, and J. Zhou, “Deep hashing for scalable image search,” TIP, 2017.
  • [39] J. Nocedal and S. J. Wright, Numerical Optimization, Chapter 17, 2nd ed.   World Scientific, 2006.
  • [40] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical Programming, vol. 45, pp. 503–528, 1989.
  • [41] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
  • [42]

    Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,”

    http://yann.lecun.com/exdb/mnist/.
  • [43] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” TPAMI, pp. 117–128, 2011.
  • [44] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, pp. 91–110, 2004.
  • [45] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, pp. 145–175, 2001.
  • [46] V. A. Nguyen, J. Lu, and M. N. Do, “Supervised discriminative hashing for compact binary codes,” in ACM Multimedia, 2014.
  • [47] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in ACM Multimedia, 2015.
  • [48] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in AAAI, 2016.