Deep Hashing: A Joint Approach for Image Signature Learning

08/12/2016 ∙ by Yadong Mu, et al. ∙ Att Peking University 0

Similarity-based image hashing represents crucial technique for visual data storage reduction and expedited image search. Conventional hashing schemes typically feed hand-crafted features into hash functions, which separates the procedures of feature extraction and hash function learning. In this paper, we propose a novel algorithm that concurrently performs feature engineering and non-linear supervised hashing function learning. Our technical contributions in this paper are two-folds: 1) deep network optimization is often achieved by gradient propagation, which critically requires a smooth objective function. The discrete nature of hash codes makes them not amenable for gradient-based optimization. To address this issue, we propose an exponentiated hashing loss function and its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled; 2) pre-training is an important trick in supervised deep learning. The impact of pre-training on the hash code quality has never been discussed in current deep hashing literature. We propose a pre-training scheme inspired by recent advance in deep network based image classification, and experimentally demonstrate its effectiveness. Comprehensive quantitative evaluations are conducted on several widely-used image benchmarks. On all benchmarks, our proposed deep hashing algorithm outperforms all state-of-the-art competitors by significant margins. In particular, our algorithm achieves a near-perfect 0.99 in terms of Hamming ranking accuracy with only 12 bits on MNIST, and a new record of 0.74 on the CIFAR10 dataset. In comparison, the best accuracies obtained on CIFAR10 by existing hashing algorithms without or with deep networks are known to be 0.36 and 0.58 respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed spectacular progress on similarity-based hash code learning in a variety of computer vision tasks, such as image search 

[7], object recognition [26] and local descriptor compression [5] etc. The hash codes are highly compact (e.g., several bytes for each image) in most cases, which significantly reduces the overhead of storing visual big data and also expedites similarity-based image search. The theoretic ground of similarity-oriented hashing is rooted from Johnson-Lindenstrause theorem [8], which elucidates that for arbitrary samples, some -dimensional subspace exists and can be found in polynomial time complexity. When embedded into this subspace, pairwise affinities among these samples are preserved with tight approximation error bounds. This seminal theoretic discovery sheds light on trading similarity preservation for high compression of large data set. The classic locality-sensitive hashing (LSH) [11] is a good demonstration for above tradeoff, instantiated in various similarity metrics such as Hamming distance [11]

, cosine similarity 

[6], distance with  [9]

, Jaccard index 

[4] and Euclidean distance [2].

Images are often accompanied with supervised information in various forms, such as semantically similar / dissimilar data pairs. Supervised hash code learning [21, 27] harnesses such supervisory information during parameter optimization and has demonstrated superior image search accuracy compared with unsupervised hashing algorithms [1, 28, 10]. Exemplar supervised hashing schemes include LDAHash [5], two-step hashing [18], and kernel-based supervised hashing [20] etc.

Importantly, two factors are known to be crucial for hashing-based image search accuracy: the discriminative power of the features and the choice of hashing functions. In a typical pipeline of existing hashing methods, these two factors are separately treated. Each image is often represented by a vector of hand-crafted visual features (such as SIFT-based bag-of-words feature or sparse codes). Regarding hash functions, a large body of existing works have adopted linear functions owing to the simplicity. More recently, researchers have also explored a number of non-linear hashing functions, such as anchor-based kernalized hashing function 

[20]

and decision tree based function 

[18].

This paper attacks the problem of supervised hashing by concurrently conducting visual feature engineering and hash function learning. Most of existing image features are designated for general computer vision tasks. Intuitively, by unifying these two sub-tasks in the same formulation, one can expect the extracted image features to be more amenable for the hashing purpose. Our work is inspired by recent prevalence and success of deep learning techniques [17, 3, 13]. Though the unreasonable effectiveness of deep learning has been successfully demonstrated in tasks like image classification [13] and face analysis [25], deep learning for supervised hashing still remains inadequately explored in the literature.

Salakhutdinov et al. proposed semantic hashing in [24]

, where stacked Restricted Boltzmann Machines (RBMs) are employed for hash code generation. Nonetheless, the algorithm is primarily devised for indexing textual data and its extension to visual data is unclear. Xia et al. 

[29] adopted a two-step hashing strategy similar to [18]. It firstly factorizes the data similarity matrix to obtain the target binary code for each image. In the next stage, the target codes and the image labels are jointly utilized to guide the network parameter optimization. Since the target codes are not updated once approximately learned in the first stage, the final model is only sub-optimal. Lai et al. [16] developed a convolutional deep network for hashing, comprised of shared sub-networks and a divide-and-encode module. However, the parameters of these two components are still separately learned. After the shared sub-networks are initialized, their parameters (including all convolutional/pooling layers) are frozen during optimizing the divide-and-encode module. Intrinsically, the method in [16] shall be categorized to two-step hashing, rather than simultaneous feature / hashing learning. Liong et al. [19] presented a binary encoding network built with purely fully-connected layers. The method essentially assumes that the visual features (e.g., GIST as used in the experiments therein) have been learned elsewhere and fed into its first layer as the input.

As revealed by above literature overview, a deep hashing method which simultaneously learns the features and hash codes remains missing in this research field, which inspires our work. The key contributions of this work include:

  • We propose the first deep hashing algorithm of its kind, which performs concurrent feature and hash function learning over a unified network.

  • We investigate the key pitfalls in designing such deep networks. Particularly, there are two major obstacles: the gradient calculation from non-differentiable binary hash codes, and network pre-training in order to eventually stay at a “good” local optimum. To address the first issue, we propose an exponentiated hashing loss function and devise its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled. Moreover, an efficient pre-training scheme is also proposed. We verify its effectiveness through comprehensive evaluations on real-world visual data.

  • The proposed deep hashing method establishes new performance records on four image benchmarks which are widely used in this research area. For instance, on the CIFAR10 dataset, our method achieves a mean average precision of 0.73 for Hamming ranking based image search, which represents some drastic improvement compared with the state-of-the-art methods (0.58 for [16] and 0.36 for  [20]).

2 The Proposed Method

Figure 1: Illustration of our proposed deep network and the pre-training / fine-tuning process. Due to space limit, non-linear activation layers are not plotted in the diagram. See text for more explanations.

Throughout this paper we will use bold symbols to denote vectors or matrices, and italic ones for scalars unless otherwise instructed. Suppose a data set with supervision information is provided as the input. Prior works on supervised hashing have considered various forms of supervision, including triplet of items where the pair is more alike than the pair  [21, 23, 16], pairwise similar/dissimilar relations [20] or specifying the label of each sample. Observing that triplet-type supervision incurs tremendous complexity during hashing function learning and semantic-level sample labels can be effortlessly converted into pairwise relations, hereafter the discussion focuses on supervision in pairwise fashion. Let , collect all similar / dissimilar pairs respectively. For notational convenience, we further introduce a supervision matrix as

(1)

Figure 1 illustrates our proposed pipeline of learning a deep convolutional network for supervised hashing. The network is comprised of two components: a topmost layer meticulously-customized for the hashing task and other conventional layers. The network takes a -sized images with channels as the inputs. The neurons on the top layer output either -1 or 1 as the hash code. Formally, each top neuron represents a hashing function , where denotes the 3-D raw image. For notational clarity, let us denote the response vector on the second topmost layer as , where implicitly defines the highly non-linear mapping from the raw data to a specified intermediate layer.

For the topmost layer, we adopt a simple linear transformation, followed by a signum operation, which is formally presented as

(2)

The reminder of this section firstly introduces the hashing loss function and the calculation of its smoothed surrogate in Section 2.1. More algorithmic details of our proposed pretraining-finetuning procedure are delivered in Section 2.2.

2.1 Exponentiated Code Product Optimization

The key purpose of supervised hashing is to elevate the image search accuracy. The goal can be intuitively achieved by generating discriminative hash codes, such that similar data pairs can be perfectly distinguished from dissimilar pairs according to the Hamming distances calculated over the hash codes. A number of hashing loss functions have been devised by using above design principal. In particular, Norouzi et al. [22] propose a hinge-like loss function. Critically, hinge loss is known to be non-smooth and thus complicates gradient-based optimization. Two other works [20, 18] adopt smooth loss defined on the inner product between hash codes.

It largely remains unclear for designing optimal hashing loss functions in perceptron-like learning. The major complication stems from the discrete nature of hash codes, which prohibits direct gradient computation and propagation as in typical deep networks. As such, prior works have investigated several tricks to mitigate this issue. Examples include optimizing a variational upper bound of the original non-smooth loss 

[22]

, or simply computing some heuristic-oriented sub-gradients 

[16]. In this work we advocate an exponential discrete loss function which directly optimizes the hash code product and enjoys a bilinear smoothed approximation. Compared with other alternative hashing losses, here we first show the proposed exponential loss arguably more amenable for mini-batch based iterative update and later exhibit its empirical superiority in the experiments.

Let denote hash bits in vector format for data object . We also use the notations , to stand for bit of and the hash code with bit absent respectively. As a widely-known fact in the hashing literature [20], code product admits a one-to-one correspondence to Hamming distance and comparably easier to manipulate. A normalized version of code product ranging over is described as

(3)

and when bit is absent, the code product using partial hash codes is

(4)

Exponential Loss: Given the observation that faithfully indicates the pairwise similarity, we propose to minimize an exponentiated objective function defined as the accumulation over all data pairs:

(5)

where represents the collection of parameters in the deep networks excluding the hashing loss layer. The atomic loss term is

(6)

This novel loss function enjoys some elegant traits desired by deep hashing compared with those in BRE [14], MLH [22] and KSH [20]. It establishes more direct connection to the hashing function parameters by maximizing the correlation of code product and pairwise labeling. In comparison, BRE and MLH optimize the parameters by aligning Hamming distance with original metric distances or enforcing the Hamming distance larger/smaller than pre-specified thresholds. Both formulations incur complicated optimization procedures, and their optimality conditions are unclear. KSH adopts a least-squares formulation for regressing code product onto the target labels, where a smooth surrogate for gradient computation is proposed. However, the surrogate heavily deviates from the original loss function due to its high non-linearity.

Gradient Computation: A prominent advantage of exponential loss is its easy conversion into multiplicative form, which elegantly simplifies the derivation of its gradient. For presentation clarity, we hereafter only focus on the calculation conducted over the topmost hashing loss layer. Namely, for bit , where are the response values at the second top layer and are parameters to be learned for bit ().

Following the common practice in deep learning, two groups of quantities , and (

ranges over the index set of current mini-batch) need to be estimated on the hashing loss layer at each iteration. The former group of quantities are used for updating

, , and the latter are propagated backwards to the bottom layers. The additive algebra of hash code product in Eqn. (3) inspires us to estimate the gradients in a leave-one-out mode. For atomic loss in Eqn. (6), it is easily verified

where only the latter factor is related to . Since the product can only be -1 or 1, we can linearize the latter factor through exhaustively enumerating all possible values, namely

(7)

where are two sample-specific constants, calculated by and . Since the hardness of calculating the gradient of Eqn. (7) lies in the bit product , we replace the signum function using the sigmoid-shaped function , obtaining

(8)

Freezing the partial code product , we define an approximate atomic loss with only bits active:

(9)

where the first factor plays a role of re-weighting specific data pair, conditioned on the rest bits. Iterating over all ’s, the original loss function can now be approximated by

(10)
1:  Input: Training set , data labels, and step size ;
2:  Output: network parameters , for the hashing-loss layer, and for other layers;
pre-training stage #1: initialize
3:

  Concatenate all layers (excluding top hashing-loss layer) with a softmax layer that defines an image classification task;

4:  Apply AlexNet [13]) style supervised parameter learning algorithm, obtaining .
5:  Calculate neuron responses on second topmost layer through ;
pre-training stage #2: initialize
6:  Replicate all ’s from previous stage;
7:  while not converged do
8:     Forward computation starting from ;
9:     for  to  do
10:         Update by minimizing the image classification error;
11:     end for
12:  end while
simultaneous supervised fine-tuning
13:  while not converged do
14:     Forward computation starting from the raw images;
15:     for  to  do
16:         Estimate ;
17:         Update ;
18:     end for
19:     Estimate , ;
20:     Propagate to bottom layers, updating ;
21:  end while
Algorithm 1 DeepHash Algorithm

Compared with other sigmoid-based approximations in previous hashing algorithms (e.g., KSH [20]), ours only requires (rather than both and ) is sufficiently large. This bilinearity-oriented relaxation is more favorable for reducing approximation error, which will be corroborated by the subsequent experiments.

Since the objective in Eqn. (5) is a composition of atomic losses on data pairs, we only need to instantiate the gradient computation on specific data pair . Applying basic calculus rules and discarding some scaling factors, we first obtain

and further using calculus chain rule brings

Importantly, the formulas below obviously hold by the construction of :

(11)

The gradient computations on other deep network layers simply follow the regular calculus rules. We thus omit the introduction.

2.2 Two-Stage Supervised Pre-Training

Deep hashing algorithms (including ours) mostly strive to optimize pairwise (or even triplet as in [16]) similarity in Hamming space. This raises an intrinsic distinction compared with conventional applications of deep networks (such as image classification via AlexNet [13]). The total count of data pairs quadratically increases with regard to the training sample number, and in conventional applications the number of atomic losses in the objective only linearly grows. This entails a much larger mini-batch size in order to combat numerical instability caused by under-sampling111For instance, a training set with 100,000 samples demands a mini-batch of 1,000 data for sampling rate in image classification. In contrast, in deep hashing, capturing pairwise similarity requires a tremendous mini-batch of 10,000 data., which unfortunately often exceeds the maximal memory space on modern CPU/GPUs.

We adopt a simple two-stage supervised pre-training approach as an effective network pre-conditioner, initializing the parameter values in the appropriate range for further supervised fine-tuning. In the first stage, the network (excluding the hashing loss layer) is concatenated to a regular softmax layer. The network parameters are learned through optimizing the objective of a relevant semantics learning task (e.g., image classification). After stage one is complete, we extract the neuron outputs of all training samples from the second topmost layer (i.e., the variable ’s in Section 2.1), feed them into another two-layer shallow network as shown in Figure 1 and initialize the hashing parameters , . Finally, all layers are jointly optimized in a fine-tuning process, minimizing the hashing loss objective . The entire procedure is illustrated in Figure 1 and detailed in Algorithm 1.

3 Experiments

This section reports the quantitative evaluations between our proposed deep hashing algorithm and other competitors.

Description of Datasets: We conduct quantitative comparisons over four image benchmarks which represent different visual classification tasks. They include MNIST222http://yann.lecun.com/exdb/mnist/ for handwritten digits recognition, CIFAR10333http://www.cs.toronto.edu/~kriz/cifar.html which is a subset of 80 million Tiny Images dataset444http://groups.csail.mit.edu/vision/TinyImages/ and consists of images from ten animal or object categories, Kaggle-Face555https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge, which is a Kaggle-hosted facial expression classification dataset to stimulate the research on facial feature representation learning, and SUN397 [30] which is a large scale scene image dataset of 397 categories. Figure 2 shows exemplar images. For all selected datasets, different classes are completely mutually exclusive such that the similarity/dissimilarity sets as in Eqn (1) can be calculated purely based on label consensus. Table 1 summarizes the critical information of these experimental data, wherein the column of feature dimension refers to the neuron numbers on the second topmost layers (i.e., dimensions of feature vector ).

Dataset Train/Query Set #Class #Dim Feature
MNIST 50,000 / 10,000 10 500 CNN
CIFAR10 50,000 / 10,000 10 1,024 CNN
Kaggle-Face 315,799 / 7,178 7 2,304 CNN
SUN397 87,003 / 21,751 397 9,216 CNN
Table 1: Summary of the experimental benchmarks. Feature dimensions correspond to the neuron counts on the second topmost layer.
Figure 2: Exemplar images from MNIST, CIFAR10, Kaggle-Face and SUN397 datasets.

Implementation and Model Specification

: We have implemented a substantially-customized version of the open-source Caffe 

[12]. The proposed hashing loss layer is patched to the original package and we also largely enrich Caffe’s model specification grammar. Moreover, To ensure that mini-batches more faithfully represent the real distribution of pairwise affinities, we re-shuffle the training set at each iteration. This is approximately accomplished by using the trick of random skipping, namely skipping the next few samples666The skipped samples vary at each operation, parameterized by a random integer uniformly drawn from in our experiments. in the image database after adding one into the mini-batch.

MNIST CIFAR10 Kaggle-Face SUN397
12 bits 24 bits 48 bits 12 bits 24 bits 48 bits 12 bits 24 bits 48 bits 12 bits 24 bits 48 bits
LSH [6] 0.3717 0.4933 0.5725 0.1311 0.1619 0.2034 0.1911 0.2011 0.1976 0.0057 0.0060 0.0071
ITQ [10] 0.7578 0.8132 0.8293 0.2711 0.2825 0.2909 0.2435 0.2513 0.2514 0.0268 0.0361 0.0471
PCAH [15] 0.4997 0.4607 0.3641 0.2056 0.1867 0.1695 0.2169 0.2058 0.1991 0.0218 0.0261 0.0315
SH [28] 0.5175 0.5330 0.4898 0.1935 0.1921 0.1750 0.2117 0.2054 0.2015 0.0210 0.0236 0.0273
LDAH [5] 0.5052 0.3685 0.3093 0.2187 0.1794 0.1587 0.2154 0.2032 0.1961 0.0224 0.0262 0.0306
BRE [14] 0.6950 0.7498 0.7785 0.2552 0.2668 0.2864 0.2414 0.2522 0.2587 0.0226 0.0293 0.0372
MLH [22] 0.6731 0.4404 0.4258 0.1737 0.1675 0.1737 0.2000 0.2115 0.2162 0.0070 0.0100 0.0210
KSH [20] 0.9537 0.9713 0.9817 0.3441 0.4617 0.5482 0.2862 0.3668 0.4132 0.0194 0.0261 0.0325
DH-1 [29] 0.957 0.963 0.960 0.439 0.511 0.522
DH-1 [29] 0.969 0.975 0.975 0.465 0.521 0.532
DH-2 [19] 0.4675 0.5101 0.5250 0.1880 0.2083 0.2251
DH-3 [16] 0.552 0.566 0.581
DeepHash 0.9918 0.9931 0.9938 0.6874 0.7289 0.7410 0.5487 0.5552 0.5615 0.0748 0.1054 0.1293
Table 2: Experimental results in terms of mean-average-precision (mAP) under various hash bits. The mAP scores are calculated based on Hamming ranking. Best scores are highlighted in bold. Note that the mAP scores are in the numerical range of . We directly cite the performance reported in [29, 19, 16] since the source codes are not publicly shared. In the table, “–” indicates the corresponding scores are not available. Refer to text for more details.

We designate the network layers for each dataset by referring to Caffe’s model zoo [12]. Table 5 presents the deep network structure used for Kaggle-Face. The non-linear transform layers (e.g

., RELU and local normalization layers) are ignored due to space limit. Specifically, the softmax layer is used only for pre-training the first 7 layers and not included during fine-tuning. We provide the network configuration information in the format of Caffe’s grammar in the supplemental material.

Baselines and Evaluation Protocol: All the evaluations are conducted on a large-scale private cluster, equipped with 12 NVIDIA Tesla K20 GPUs and 8 K40 GPUs. We denote the proposed algorithm as DeepHash. On the chosen benchmarks, DeepHash is compared against classic or state-of-the-art competing hashing schemes, including unsupervised methods such as random projection-based LSH [6], PCAH, SH [28], ITQ [10], and supervised methods like LDAH [5], MLH [22], BRE [14], and KSH [20]. LSH and PCAH are evaluated using our own implementations. For the rest aforementioned baselines, we thank the authors for publicly sharing their code and adopt the parameters as suggested in the original software packages. Moreover, to make the comparisons comprehensive, four previous deep hashing algorithms are also contrasted, denoted as DH-1 and DH-1 from [29], DH-2 [19], and DH-3 [16]. Since the authors do not share the source code or model specifications, we instead cite their reported accuracies under identical (or similar) experimental settings.

Importantly, the performance of a hashing algorithm critically hinges on the semantic discriminatory power of its input features. Previous deep hashing works [29, 16] use traditional hand-crafted features (e.g., GIST and SIFT bag-of-words) for all baselines, which is not an optimal setting for fair comparison with deep hashing. To rule out the effect of less discriminative features, we strictly feed all baselines (except for four deep hashing algorithms from [29, 19, 16]) with features extracted from some intermediate layer of the corresponding networks used in deep hashing. Specifically, after the first supervised pre-training stage in Algorithm 1 is completed, we re-arrange the neuron responses on the layer right below the hashing loss layer (e.g., layer #7 in Table 5) into vector formats (namely the variable ’s) and feed them into baselines.

All methods share identical training and query sets. After the hashing functions are learned on the training set, all methods produce binary hash codes for the querying data respectively. There exist multiple search strategies using hash codes for image search, such as hash table lookup [1] and sparse coding style criterion [18]. Following recent hashing works, we only carry out Hamming ranking once the hashing functions are learned, which refers to the process of ranking the retrieved samples based on their Hamming distances to the query. Under Hamming ranking protocol, we measure each algorithm using both mean-average-precision (mAP) scores and precision-recall curves.

Figure 3: Experimental results in terms of mean-average-precision (mAP) under varying hash code lengths for all algorithms. Best viewing in color mode.
MNIST CIFAR10 Kaggle-Face SUN397
12 bits 24 bits 48 bits 12 bits 24 bits 48 bits 12 bits 24 bits 48 bits 12 bits 24 bits 48 bits
DeepHash (random init.) 0.9806 0.9862 0.9873 0.5728 0.6503 0.6585 0.4125 0.4473 0.4620 0.0211 0.0384 0.0360
DeepHash (pre-training) 0.9673 0.9753 0.9796 0.4986 0.5588 0.5966 0.4282 0.4484 0.4589 0.0335 0.0430 0.0592
DeepHash (fine-tuning) 0.9918 0.9931 0.9938 0.6874 0.7289 0.7410 0.5487 0.5552 0.5615 0.0748 0.1054 0.1293
Table 3: Comparisons of three strategies of parameter initialization and learning for the proposed DeepHash. See text for more details.

Investigation of Hamming Ranking Results: Table 2 and Figure 3 show the mAP scores for our proposed DeepHash algorithms (with supervised pre-training and fine-tuning) and all baselines. To clearly depict the evolving accuracies with respect to the search radius, Figure 4 displays the precision-recall curves for all algorithms with 32 hash bits. There are three key observations from these experimental results that we would highlight:

1) On all four datasets, our proposed DeepHash algorithm significantly perform better than all baselines in terms of mAP. For all non-deep-network based algorithm, KSH achieves the best accuracies on MNIST, CIFAR10 and Kaggle-Face, and ITQ shows top performances on SUN397. Using 48 hash bits, the best mAP scores obtained by KSH or ITQ are 0.9817, 0.5482, 0.4132, and 0.0471 on MNIST / CIFAR10 / Kaggle-Face / SUN397 respectively. In comparison, our proposed DeepHash performs nearly perfect on MNIST (0.9938), and defeat KSH and ITQ by very large margins, scoring 0.7410, 0.5615, and 0.1293 on other three datasets respectively.

2) We also include four deep hashing algorithms by referring to the accuracies reported in the original publications. Recall that the evaluations in [29, 16] feed baseline algorithms with non-CNN features (e.g., GIST). Interestingly, our experiments reveal that, when conventional hashing algorithms take CNN features as the input, the relative performance gain of prior deep hashing algorithms becomes marginal. For example, under 48 hash bits, KSH’s mAP score 0.5482 is comparable with regard to DH-3’s 0.581. We attribute the striking superiority of our proposed deep hashing algorithm to the importance of jointly conducting feature engineering and hash function learning (i.e., the fine-tuning process in Algorithm 1).

3) Elevating inter-bit mutual complementarity is overly crucial for the final performance. For those methods that generate hash bits independently (such as LSH) or by enforcing performance-irrelevant inter-bit constraints (such as LDAH), the mAP scores only show slight gains or even drop when increasing hash code length. Among all algorithms, two code-product oriented algorithm, KSH and our proposed DeepHash, show steady improvement by using more hash bits. Moreover, our results also validate some known insights exposed by previous works, such as the advantage of supervised hashing methods over the unsupervised alternatives.

Figure 4: Precision-recall curves under 32 hash bits on all image benchmarks.
Hand-Crafted Feature CNN Feature
16 bits 32 bits 16 bits 32 bits
LSH 0.1215 0.1385 0.1354 0.1752
ITQ 0.1528 0.1604 0.2757 0.2862
BRE 0.1308 0.1362 0.2634 0.2803
MLH 0.1373 0.1334 0.1810 0.1800
KSH 0.2191 0.2081 0.3958 0.5039
DeepHash 0.2166 0.2304 0.5472 0.5674
Hand-Crafted Feature CNN Feature
16 bits 32 bits 16 bits 32 bits
LSH 0.0067 0.0072 0.0059 0.0063
ITQ 0.0159 0.0157 0.0309 0.0410
BRE 0.0070 0.0075 0.0252 0.0319
MLH 0.0148 0.0147 0.0083 0.0144
KSH 0.0105 0.0095 0.0216 0.0300
DeepHash 0.0166 0.0189 0.0387 0.0525
Table 4: mAP scores using hand-crafted features and CNN features in hashing-based image search. The method “DeepHash” refers to the variant without fine-tuning. The top and bottom tables correspond to the results on CIFAR10 and SUN397 respectively.

Effect of Supervised Pre-Training: We now further highlight the effectiveness of the two-stage supervised pre-training process. To this end, in Table 3 we show the mAP scores achieved by three different strategies of learning the network parameters. The scheme “DeepHash (random init.)” refers to initializing all parameters with random numbers without any pre-training. A typical supervised gradient back-propagation procedure as in AlexNet [13] is then used. The second scheme “DeepHash (pre-training)” refers to initializing the network using two-stage pre-training in Algorithm 1, without any subsequent fine-tuning. It serves as an appropriate baseline for assessing the benefit of the fine-tuning process as in the third scheme “DeepHash (fine-tuning)”. In all cases, the learning rate in gradient descent drops at a constant factor (0.1 in all of our experiments) until the training converges.

There are two major observations from the results in Table 3. First, simultaneous tuning all the layers (including the hashing loss layer) often significantly boosts the performance. As a key evidence, “DeepHash (random init.)” demonstrates prominent superiority on MNIST and CIFAR10 compared with “DeepHash (pre-training)”. The joint parameter tuning of “DeepHash (random init.)” is supposed to compensate the low-quality random parameter initialization. Secondly, positioning the initial solution near a “good” local optimum is crucial for learning on challenging data. For example, the dataset of SUN397 has as many as 397 unique scene categories. However, due to the limitation of GPU memory, even a K40 GPU with 12GB memory only support a mini-batch of 600 samples at maximum. State differently, each mini-batch only comprises 1.5 samples per category on average, which results in a heavily biased sampling towards the pairwise affinities. We attribute the relatively low accuracies of “DeepHash (random init.)” to this issue. In contrast, training deep networks with both supervised pre-training and fine-tuning (i.e., the third scheme in Table 3) exhibit robust performances over all datasets.

Comparison with Hand-Crafted Features: To complement a missing comparison in other deep hashing works [29, 19, 16]), we also compare the hashing performance with conventional hand-crafted features and CNN features extracted from our second topmost layers. Following the choices in relevant literature, we extract 800-D GIST feature from CIFAR10 images, and 5000-D DenseSIFT Bag-of-Words feature from SUN397 images. The comparisons under 16 and 32 hash bits are found in Table 4, exhibiting huge performance gaps between these two kinds of features. It clearly reveals how the feature quality impacts the final performance of a hashing algorithm, and a fair setting shall be established when comparing conventional and deep hashing algorithms.

4 Concluding Remarks

In this paper a novel image hashing technique is presented. We accredit the success of the proposed deep hashing to the following aspects: 1) it jointly does the feature engineering and hash function learning, rather than feeding hand-crafted visual features to hashing algorithms, 2) the proposed exponential loss function excellently fits the paradigm of mini-batch based training and the treatment in Eqn. (10) naturally encourages inter-bit complementarity, and 3) to combat the under-sampling issue in the training phase, we introduce the idea of two-stage supervised pre-training and validate its effectiveness by comparisons.

Our comprehensive quantitative evaluations consistently demonstrate the power of deep hashing for the data hashing task. The proposed algorithm enjoys both scalability to large training data and millisecond-level testing time for processing a new image. We thus believe that deep hashing is promising for efficiently analyzing visual big data.

References

  • [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
  • [2] A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn. Beyond locality-sensitive hashing. CoRR, abs/1306.1547, 2013.
  • [3] Y. Bengio. Learning deep architectures for AI. Found. Trends Mach. Learn., 2(1):1–127, Jan. 2009.
  • [4] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60:327–336, 1998.
  • [5] M. M. B. C. Strecha, A. M. Bronstein and P. Fua. LDAHash: improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 2012.
  • [6] M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.
  • [7] O. Chum, J. Philbin, and A. Zisserman.

    Near duplicate image detection: min-hash and tf-idf weighting.

    In BMVC, 2008.
  • [8] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
  • [9] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In STOC, 2004.
  • [10] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.
  • [11] P. Indyk and R. Motwani.

    Approximate nearest neighbors: Towards removing the curse of dimensionality.

    In STOC, 1998.
  • [12] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [14] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.
  • [15] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012.
  • [16] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, 2015.
  • [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [18] G. Lin, C. Shen, and A. van den Hengel. Supervised hashing using graph cuts and boosted decision trees. IEEE Trans. Pattern Anal. Mach. Intell., 37(11):2317–2331, 2015.
  • [19] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
  • [20] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. In CVPR, 2012.
  • [21] Y. Mu, J. Shen, and S. Yan. Weakly-supervised hashing in kernel space. In CVPR, 2010.
  • [22] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011.
  • [23] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS, 2012.
  • [24] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–978, 2009.
  • [25] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10, 000 classes. In CVPR, 2014.
  • [26] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.
  • [27] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012.
  • [28] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
  • [29] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, 2014.
  • [30] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.

    SUN database: Large-scale scene recognition from abbey to zoo.

    In CVPR, 2010.

Appendix: Network Configurations

Tables 6-8

present the configurations of the deep networks used for selected benchmarks. The non-linear transform layers are majorly ReLU (rectified linear unit) and LRN (local response normalization). Specifically, the softmax layers are used only for pre-training the convolutional/innerProduct layers and not included during fine-tuning, which are thus not enumerated in these tables.

Layer ID Layer Type

Filter / Stride

#Dim of Output
1 data N/A
2 convolution / 1
3 max-pooling / 2
4 convolution / 1
5 avg-pooling / 2
6 convolution / 1
7 avg-pooling / 2
soft-max N/A 10
8 hash-loss N/A #(bit number)
Table 5: Network configuration for the deep hashing task on Kaggle-Face. Information for other three benchmarks is found in the supplemental material.
Layer ID Layer Type Filter / Stride #Dim of Output
1 data N/A
2 convolution / 1
3 max-pooling / 2
4 convolution / 1
5 max-pooling / 2
6 innerProduct N/A 500
7 ReLU N/A 500
soft-max N/A 10
8 hash-loss N/A #(bit number)
Table 6: Network configuration for the deep hashing task on the MNIST digit data.
Layer ID Layer Type Filter / Stride #Dim of Output
1 data N/A
2 convolution / 1
3 max-pooling / 2
4 ReLU N/A
5 LRN N/A
6 convolution / 1
7 ReLU N/A
8 avg-pooling / 2
9 LRN N/A
10 convolution / 1
11 ReLU N/A
12 avg-pooling / 2
soft-max N/A 7 or 10
13 hash-loss N/A #(bit number)
Table 7: Network configuration for the deep hashing tasks on Kaggle-Face and CIFAR10. The dimension of output in the softmax layer is 7 for Kaggle-Face and 10 for CIFAR10.
Layer ID Layer Type Filter / Stride #Dim of Output
1 data N/A
2 convolution / 4
3 ReLU N/A
4 max-pooling / 2
5 LRN N/A
6 convolution / 1
7 ReLU N/A
8 max-pooling / 2
9 LRN N/A
10 convolution / 1
11 ReLU N/A
12 convolution / 1
13 ReLU N/A
14 convolution / 1
15 ReLU N/A
16 max-pooling / 2
soft-max N/A 397
17 hash-loss N/A #(bit number)
Table 8: Network configuration for the deep hashing task on the SUN397 image benchmark.