1 Introduction
Recent years have witnessed spectacular progress on similaritybased hash code learning in a variety of computer vision tasks, such as image search
[7], object recognition [26] and local descriptor compression [5] etc. The hash codes are highly compact (e.g., several bytes for each image) in most cases, which significantly reduces the overhead of storing visual big data and also expedites similaritybased image search. The theoretic ground of similarityoriented hashing is rooted from JohnsonLindenstrause theorem [8], which elucidates that for arbitrary samples, some dimensional subspace exists and can be found in polynomial time complexity. When embedded into this subspace, pairwise affinities among these samples are preserved with tight approximation error bounds. This seminal theoretic discovery sheds light on trading similarity preservation for high compression of large data set. The classic localitysensitive hashing (LSH) [11] is a good demonstration for above tradeoff, instantiated in various similarity metrics such as Hamming distance [11][6], distance with [9][4] and Euclidean distance [2].Images are often accompanied with supervised information in various forms, such as semantically similar / dissimilar data pairs. Supervised hash code learning [21, 27] harnesses such supervisory information during parameter optimization and has demonstrated superior image search accuracy compared with unsupervised hashing algorithms [1, 28, 10]. Exemplar supervised hashing schemes include LDAHash [5], twostep hashing [18], and kernelbased supervised hashing [20] etc.
Importantly, two factors are known to be crucial for hashingbased image search accuracy: the discriminative power of the features and the choice of hashing functions. In a typical pipeline of existing hashing methods, these two factors are separately treated. Each image is often represented by a vector of handcrafted visual features (such as SIFTbased bagofwords feature or sparse codes). Regarding hash functions, a large body of existing works have adopted linear functions owing to the simplicity. More recently, researchers have also explored a number of nonlinear hashing functions, such as anchorbased kernalized hashing function
[20]and decision tree based function
[18].This paper attacks the problem of supervised hashing by concurrently conducting visual feature engineering and hash function learning. Most of existing image features are designated for general computer vision tasks. Intuitively, by unifying these two subtasks in the same formulation, one can expect the extracted image features to be more amenable for the hashing purpose. Our work is inspired by recent prevalence and success of deep learning techniques [17, 3, 13]. Though the unreasonable effectiveness of deep learning has been successfully demonstrated in tasks like image classification [13] and face analysis [25], deep learning for supervised hashing still remains inadequately explored in the literature.
Salakhutdinov et al. proposed semantic hashing in [24]
, where stacked Restricted Boltzmann Machines (RBMs) are employed for hash code generation. Nonetheless, the algorithm is primarily devised for indexing textual data and its extension to visual data is unclear. Xia et al.
[29] adopted a twostep hashing strategy similar to [18]. It firstly factorizes the data similarity matrix to obtain the target binary code for each image. In the next stage, the target codes and the image labels are jointly utilized to guide the network parameter optimization. Since the target codes are not updated once approximately learned in the first stage, the final model is only suboptimal. Lai et al. [16] developed a convolutional deep network for hashing, comprised of shared subnetworks and a divideandencode module. However, the parameters of these two components are still separately learned. After the shared subnetworks are initialized, their parameters (including all convolutional/pooling layers) are frozen during optimizing the divideandencode module. Intrinsically, the method in [16] shall be categorized to twostep hashing, rather than simultaneous feature / hashing learning. Liong et al. [19] presented a binary encoding network built with purely fullyconnected layers. The method essentially assumes that the visual features (e.g., GIST as used in the experiments therein) have been learned elsewhere and fed into its first layer as the input.As revealed by above literature overview, a deep hashing method which simultaneously learns the features and hash codes remains missing in this research field, which inspires our work. The key contributions of this work include:

We propose the first deep hashing algorithm of its kind, which performs concurrent feature and hash function learning over a unified network.

We investigate the key pitfalls in designing such deep networks. Particularly, there are two major obstacles: the gradient calculation from nondifferentiable binary hash codes, and network pretraining in order to eventually stay at a “good” local optimum. To address the first issue, we propose an exponentiated hashing loss function and devise its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled. Moreover, an efficient pretraining scheme is also proposed. We verify its effectiveness through comprehensive evaluations on realworld visual data.

The proposed deep hashing method establishes new performance records on four image benchmarks which are widely used in this research area. For instance, on the CIFAR10 dataset, our method achieves a mean average precision of 0.73 for Hamming ranking based image search, which represents some drastic improvement compared with the stateoftheart methods (0.58 for [16] and 0.36 for [20]).
2 The Proposed Method
Throughout this paper we will use bold symbols to denote vectors or matrices, and italic ones for scalars unless otherwise instructed. Suppose a data set with supervision information is provided as the input. Prior works on supervised hashing have considered various forms of supervision, including triplet of items where the pair is more alike than the pair [21, 23, 16], pairwise similar/dissimilar relations [20] or specifying the label of each sample. Observing that triplettype supervision incurs tremendous complexity during hashing function learning and semanticlevel sample labels can be effortlessly converted into pairwise relations, hereafter the discussion focuses on supervision in pairwise fashion. Let , collect all similar / dissimilar pairs respectively. For notational convenience, we further introduce a supervision matrix as
(1) 
Figure 1 illustrates our proposed pipeline of learning a deep convolutional network for supervised hashing. The network is comprised of two components: a topmost layer meticulouslycustomized for the hashing task and other conventional layers. The network takes a sized images with channels as the inputs. The neurons on the top layer output either 1 or 1 as the hash code. Formally, each top neuron represents a hashing function , where denotes the 3D raw image. For notational clarity, let us denote the response vector on the second topmost layer as , where implicitly defines the highly nonlinear mapping from the raw data to a specified intermediate layer.
For the topmost layer, we adopt a simple linear transformation, followed by a signum operation, which is formally presented as
(2) 
The reminder of this section firstly introduces the hashing loss function and the calculation of its smoothed surrogate in Section 2.1. More algorithmic details of our proposed pretrainingfinetuning procedure are delivered in Section 2.2.
2.1 Exponentiated Code Product Optimization
The key purpose of supervised hashing is to elevate the image search accuracy. The goal can be intuitively achieved by generating discriminative hash codes, such that similar data pairs can be perfectly distinguished from dissimilar pairs according to the Hamming distances calculated over the hash codes. A number of hashing loss functions have been devised by using above design principal. In particular, Norouzi et al. [22] propose a hingelike loss function. Critically, hinge loss is known to be nonsmooth and thus complicates gradientbased optimization. Two other works [20, 18] adopt smooth loss defined on the inner product between hash codes.
It largely remains unclear for designing optimal hashing loss functions in perceptronlike learning. The major complication stems from the discrete nature of hash codes, which prohibits direct gradient computation and propagation as in typical deep networks. As such, prior works have investigated several tricks to mitigate this issue. Examples include optimizing a variational upper bound of the original nonsmooth loss
[22], or simply computing some heuristicoriented subgradients
[16]. In this work we advocate an exponential discrete loss function which directly optimizes the hash code product and enjoys a bilinear smoothed approximation. Compared with other alternative hashing losses, here we first show the proposed exponential loss arguably more amenable for minibatch based iterative update and later exhibit its empirical superiority in the experiments.Let denote hash bits in vector format for data object . We also use the notations , to stand for bit of and the hash code with bit absent respectively. As a widelyknown fact in the hashing literature [20], code product admits a onetoone correspondence to Hamming distance and comparably easier to manipulate. A normalized version of code product ranging over is described as
(3) 
and when bit is absent, the code product using partial hash codes is
(4) 
Exponential Loss: Given the observation that faithfully indicates the pairwise similarity, we propose to minimize an exponentiated objective function defined as the accumulation over all data pairs:
(5) 
where represents the collection of parameters in the deep networks excluding the hashing loss layer. The atomic loss term is
(6) 
This novel loss function enjoys some elegant traits desired by deep hashing compared with those in BRE [14], MLH [22] and KSH [20]. It establishes more direct connection to the hashing function parameters by maximizing the correlation of code product and pairwise labeling. In comparison, BRE and MLH optimize the parameters by aligning Hamming distance with original metric distances or enforcing the Hamming distance larger/smaller than prespecified thresholds. Both formulations incur complicated optimization procedures, and their optimality conditions are unclear. KSH adopts a leastsquares formulation for regressing code product onto the target labels, where a smooth surrogate for gradient computation is proposed. However, the surrogate heavily deviates from the original loss function due to its high nonlinearity.
Gradient Computation: A prominent advantage of exponential loss is its easy conversion into multiplicative form, which elegantly simplifies the derivation of its gradient. For presentation clarity, we hereafter only focus on the calculation conducted over the topmost hashing loss layer. Namely, for bit , where are the response values at the second top layer and are parameters to be learned for bit ().
Following the common practice in deep learning, two groups of quantities , and (
ranges over the index set of current minibatch) need to be estimated on the hashing loss layer at each iteration. The former group of quantities are used for updating
, , and the latter are propagated backwards to the bottom layers. The additive algebra of hash code product in Eqn. (3) inspires us to estimate the gradients in a leaveoneout mode. For atomic loss in Eqn. (6), it is easily verifiedwhere only the latter factor is related to . Since the product can only be 1 or 1, we can linearize the latter factor through exhaustively enumerating all possible values, namely
(7) 
where are two samplespecific constants, calculated by and . Since the hardness of calculating the gradient of Eqn. (7) lies in the bit product , we replace the signum function using the sigmoidshaped function , obtaining
(8)  
Freezing the partial code product , we define an approximate atomic loss with only bits active:
(9)  
where the first factor plays a role of reweighting specific data pair, conditioned on the rest bits. Iterating over all ’s, the original loss function can now be approximated by
(10) 
Compared with other sigmoidbased approximations in previous hashing algorithms (e.g., KSH [20]), ours only requires (rather than both and ) is sufficiently large. This bilinearityoriented relaxation is more favorable for reducing approximation error, which will be corroborated by the subsequent experiments.
Since the objective in Eqn. (5) is a composition of atomic losses on data pairs, we only need to instantiate the gradient computation on specific data pair . Applying basic calculus rules and discarding some scaling factors, we first obtain
and further using calculus chain rule brings
Importantly, the formulas below obviously hold by the construction of :
(11) 
The gradient computations on other deep network layers simply follow the regular calculus rules. We thus omit the introduction.
2.2 TwoStage Supervised PreTraining
Deep hashing algorithms (including ours) mostly strive to optimize pairwise (or even triplet as in [16]) similarity in Hamming space. This raises an intrinsic distinction compared with conventional applications of deep networks (such as image classification via AlexNet [13]). The total count of data pairs quadratically increases with regard to the training sample number, and in conventional applications the number of atomic losses in the objective only linearly grows. This entails a much larger minibatch size in order to combat numerical instability caused by undersampling^{1}^{1}1For instance, a training set with 100,000 samples demands a minibatch of 1,000 data for sampling rate in image classification. In contrast, in deep hashing, capturing pairwise similarity requires a tremendous minibatch of 10,000 data., which unfortunately often exceeds the maximal memory space on modern CPU/GPUs.
We adopt a simple twostage supervised pretraining approach as an effective network preconditioner, initializing the parameter values in the appropriate range for further supervised finetuning. In the first stage, the network (excluding the hashing loss layer) is concatenated to a regular softmax layer. The network parameters are learned through optimizing the objective of a relevant semantics learning task (e.g., image classification). After stage one is complete, we extract the neuron outputs of all training samples from the second topmost layer (i.e., the variable ’s in Section 2.1), feed them into another twolayer shallow network as shown in Figure 1 and initialize the hashing parameters , . Finally, all layers are jointly optimized in a finetuning process, minimizing the hashing loss objective . The entire procedure is illustrated in Figure 1 and detailed in Algorithm 1.
3 Experiments
This section reports the quantitative evaluations between our proposed deep hashing algorithm and other competitors.
Description of Datasets: We conduct quantitative comparisons over four image benchmarks which represent different visual classification tasks. They include MNIST^{2}^{2}2http://yann.lecun.com/exdb/mnist/ for handwritten digits recognition, CIFAR10^{3}^{3}3http://www.cs.toronto.edu/~kriz/cifar.html which is a subset of 80 million Tiny Images dataset^{4}^{4}4http://groups.csail.mit.edu/vision/TinyImages/ and consists of images from ten animal or object categories, KaggleFace^{5}^{5}5https://www.kaggle.com/c/challengesinrepresentationlearningfacialexpressionrecognitionchallenge, which is a Kagglehosted facial expression classification dataset to stimulate the research on facial feature representation learning, and SUN397 [30] which is a large scale scene image dataset of 397 categories. Figure 2 shows exemplar images. For all selected datasets, different classes are completely mutually exclusive such that the similarity/dissimilarity sets as in Eqn (1) can be calculated purely based on label consensus. Table 1 summarizes the critical information of these experimental data, wherein the column of feature dimension refers to the neuron numbers on the second topmost layers (i.e., dimensions of feature vector ).
Dataset  Train/Query Set  #Class  #Dim  Feature 

MNIST  50,000 / 10,000  10  500  CNN 
CIFAR10  50,000 / 10,000  10  1,024  CNN 
KaggleFace  315,799 / 7,178  7  2,304  CNN 
SUN397  87,003 / 21,751  397  9,216  CNN 
Implementation and Model Specification
: We have implemented a substantiallycustomized version of the opensource Caffe
[12]. The proposed hashing loss layer is patched to the original package and we also largely enrich Caffe’s model specification grammar. Moreover, To ensure that minibatches more faithfully represent the real distribution of pairwise affinities, we reshuffle the training set at each iteration. This is approximately accomplished by using the trick of random skipping, namely skipping the next few samples^{6}^{6}6The skipped samples vary at each operation, parameterized by a random integer uniformly drawn from in our experiments. in the image database after adding one into the minibatch.MNIST  CIFAR10  KaggleFace  SUN397  

12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  
LSH [6]  0.3717  0.4933  0.5725  0.1311  0.1619  0.2034  0.1911  0.2011  0.1976  0.0057  0.0060  0.0071 
ITQ [10]  0.7578  0.8132  0.8293  0.2711  0.2825  0.2909  0.2435  0.2513  0.2514  0.0268  0.0361  0.0471 
PCAH [15]  0.4997  0.4607  0.3641  0.2056  0.1867  0.1695  0.2169  0.2058  0.1991  0.0218  0.0261  0.0315 
SH [28]  0.5175  0.5330  0.4898  0.1935  0.1921  0.1750  0.2117  0.2054  0.2015  0.0210  0.0236  0.0273 
LDAH [5]  0.5052  0.3685  0.3093  0.2187  0.1794  0.1587  0.2154  0.2032  0.1961  0.0224  0.0262  0.0306 
BRE [14]  0.6950  0.7498  0.7785  0.2552  0.2668  0.2864  0.2414  0.2522  0.2587  0.0226  0.0293  0.0372 
MLH [22]  0.6731  0.4404  0.4258  0.1737  0.1675  0.1737  0.2000  0.2115  0.2162  0.0070  0.0100  0.0210 
KSH [20]  0.9537  0.9713  0.9817  0.3441  0.4617  0.5482  0.2862  0.3668  0.4132  0.0194  0.0261  0.0325 
DH1 [29]  0.957  0.963  0.960  0.439  0.511  0.522  –  –  –  –  –  – 
DH1 [29]  0.969  0.975  0.975  0.465  0.521  0.532  –  –  –  –  –  – 
DH2 [19]  0.4675  0.5101  0.5250  0.1880  0.2083  0.2251  –  –  –  –  –  – 
DH3 [16]  –  –  –  0.552  0.566  0.581  –  –  –  –  –  – 
DeepHash  0.9918  0.9931  0.9938  0.6874  0.7289  0.7410  0.5487  0.5552  0.5615  0.0748  0.1054  0.1293 
We designate the network layers for each dataset by referring to Caffe’s model zoo [12]. Table 5 presents the deep network structure used for KaggleFace. The nonlinear transform layers (e.g
., RELU and local normalization layers) are ignored due to space limit. Specifically, the softmax layer is used only for pretraining the first 7 layers and not included during finetuning. We provide the network configuration information in the format of Caffe’s grammar in the supplemental material.
Baselines and Evaluation Protocol: All the evaluations are conducted on a largescale private cluster, equipped with 12 NVIDIA Tesla K20 GPUs and 8 K40 GPUs. We denote the proposed algorithm as DeepHash. On the chosen benchmarks, DeepHash is compared against classic or stateoftheart competing hashing schemes, including unsupervised methods such as random projectionbased LSH [6], PCAH, SH [28], ITQ [10], and supervised methods like LDAH [5], MLH [22], BRE [14], and KSH [20]. LSH and PCAH are evaluated using our own implementations. For the rest aforementioned baselines, we thank the authors for publicly sharing their code and adopt the parameters as suggested in the original software packages. Moreover, to make the comparisons comprehensive, four previous deep hashing algorithms are also contrasted, denoted as DH1 and DH1 from [29], DH2 [19], and DH3 [16]. Since the authors do not share the source code or model specifications, we instead cite their reported accuracies under identical (or similar) experimental settings.
Importantly, the performance of a hashing algorithm critically hinges on the semantic discriminatory power of its input features. Previous deep hashing works [29, 16] use traditional handcrafted features (e.g., GIST and SIFT bagofwords) for all baselines, which is not an optimal setting for fair comparison with deep hashing. To rule out the effect of less discriminative features, we strictly feed all baselines (except for four deep hashing algorithms from [29, 19, 16]) with features extracted from some intermediate layer of the corresponding networks used in deep hashing. Specifically, after the first supervised pretraining stage in Algorithm 1 is completed, we rearrange the neuron responses on the layer right below the hashing loss layer (e.g., layer #7 in Table 5) into vector formats (namely the variable ’s) and feed them into baselines.
All methods share identical training and query sets. After the hashing functions are learned on the training set, all methods produce binary hash codes for the querying data respectively. There exist multiple search strategies using hash codes for image search, such as hash table lookup [1] and sparse coding style criterion [18]. Following recent hashing works, we only carry out Hamming ranking once the hashing functions are learned, which refers to the process of ranking the retrieved samples based on their Hamming distances to the query. Under Hamming ranking protocol, we measure each algorithm using both meanaverageprecision (mAP) scores and precisionrecall curves.
MNIST  CIFAR10  KaggleFace  SUN397  

12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  12 bits  24 bits  48 bits  
DeepHash (random init.)  0.9806  0.9862  0.9873  0.5728  0.6503  0.6585  0.4125  0.4473  0.4620  0.0211  0.0384  0.0360 
DeepHash (pretraining)  0.9673  0.9753  0.9796  0.4986  0.5588  0.5966  0.4282  0.4484  0.4589  0.0335  0.0430  0.0592 
DeepHash (finetuning)  0.9918  0.9931  0.9938  0.6874  0.7289  0.7410  0.5487  0.5552  0.5615  0.0748  0.1054  0.1293 
Investigation of Hamming Ranking Results: Table 2 and Figure 3 show the mAP scores for our proposed DeepHash algorithms (with supervised pretraining and finetuning) and all baselines. To clearly depict the evolving accuracies with respect to the search radius, Figure 4 displays the precisionrecall curves for all algorithms with 32 hash bits. There are three key observations from these experimental results that we would highlight:
1) On all four datasets, our proposed DeepHash algorithm significantly perform better than all baselines in terms of mAP. For all nondeepnetwork based algorithm, KSH achieves the best accuracies on MNIST, CIFAR10 and KaggleFace, and ITQ shows top performances on SUN397. Using 48 hash bits, the best mAP scores obtained by KSH or ITQ are 0.9817, 0.5482, 0.4132, and 0.0471 on MNIST / CIFAR10 / KaggleFace / SUN397 respectively. In comparison, our proposed DeepHash performs nearly perfect on MNIST (0.9938), and defeat KSH and ITQ by very large margins, scoring 0.7410, 0.5615, and 0.1293 on other three datasets respectively.
2) We also include four deep hashing algorithms by referring to the accuracies reported in the original publications. Recall that the evaluations in [29, 16] feed baseline algorithms with nonCNN features (e.g., GIST). Interestingly, our experiments reveal that, when conventional hashing algorithms take CNN features as the input, the relative performance gain of prior deep hashing algorithms becomes marginal. For example, under 48 hash bits, KSH’s mAP score 0.5482 is comparable with regard to DH3’s 0.581. We attribute the striking superiority of our proposed deep hashing algorithm to the importance of jointly conducting feature engineering and hash function learning (i.e., the finetuning process in Algorithm 1).
3) Elevating interbit mutual complementarity is overly crucial for the final performance. For those methods that generate hash bits independently (such as LSH) or by enforcing performanceirrelevant interbit constraints (such as LDAH), the mAP scores only show slight gains or even drop when increasing hash code length. Among all algorithms, two codeproduct oriented algorithm, KSH and our proposed DeepHash, show steady improvement by using more hash bits. Moreover, our results also validate some known insights exposed by previous works, such as the advantage of supervised hashing methods over the unsupervised alternatives.
HandCrafted Feature  CNN Feature  
16 bits  32 bits  16 bits  32 bits  
LSH  0.1215  0.1385  0.1354  0.1752 
ITQ  0.1528  0.1604  0.2757  0.2862 
BRE  0.1308  0.1362  0.2634  0.2803 
MLH  0.1373  0.1334  0.1810  0.1800 
KSH  0.2191  0.2081  0.3958  0.5039 
DeepHash  0.2166  0.2304  0.5472  0.5674 
HandCrafted Feature  CNN Feature  
16 bits  32 bits  16 bits  32 bits  
LSH  0.0067  0.0072  0.0059  0.0063 
ITQ  0.0159  0.0157  0.0309  0.0410 
BRE  0.0070  0.0075  0.0252  0.0319 
MLH  0.0148  0.0147  0.0083  0.0144 
KSH  0.0105  0.0095  0.0216  0.0300 
DeepHash  0.0166  0.0189  0.0387  0.0525 
Effect of Supervised PreTraining: We now further highlight the effectiveness of the twostage supervised pretraining process. To this end, in Table 3 we show the mAP scores achieved by three different strategies of learning the network parameters. The scheme “DeepHash (random init.)” refers to initializing all parameters with random numbers without any pretraining. A typical supervised gradient backpropagation procedure as in AlexNet [13] is then used. The second scheme “DeepHash (pretraining)” refers to initializing the network using twostage pretraining in Algorithm 1, without any subsequent finetuning. It serves as an appropriate baseline for assessing the benefit of the finetuning process as in the third scheme “DeepHash (finetuning)”. In all cases, the learning rate in gradient descent drops at a constant factor (0.1 in all of our experiments) until the training converges.
There are two major observations from the results in Table 3. First, simultaneous tuning all the layers (including the hashing loss layer) often significantly boosts the performance. As a key evidence, “DeepHash (random init.)” demonstrates prominent superiority on MNIST and CIFAR10 compared with “DeepHash (pretraining)”. The joint parameter tuning of “DeepHash (random init.)” is supposed to compensate the lowquality random parameter initialization. Secondly, positioning the initial solution near a “good” local optimum is crucial for learning on challenging data. For example, the dataset of SUN397 has as many as 397 unique scene categories. However, due to the limitation of GPU memory, even a K40 GPU with 12GB memory only support a minibatch of 600 samples at maximum. State differently, each minibatch only comprises 1.5 samples per category on average, which results in a heavily biased sampling towards the pairwise affinities. We attribute the relatively low accuracies of “DeepHash (random init.)” to this issue. In contrast, training deep networks with both supervised pretraining and finetuning (i.e., the third scheme in Table 3) exhibit robust performances over all datasets.
Comparison with HandCrafted Features: To complement a missing comparison in other deep hashing works [29, 19, 16]), we also compare the hashing performance with conventional handcrafted features and CNN features extracted from our second topmost layers. Following the choices in relevant literature, we extract 800D GIST feature from CIFAR10 images, and 5000D DenseSIFT BagofWords feature from SUN397 images. The comparisons under 16 and 32 hash bits are found in Table 4, exhibiting huge performance gaps between these two kinds of features. It clearly reveals how the feature quality impacts the final performance of a hashing algorithm, and a fair setting shall be established when comparing conventional and deep hashing algorithms.
4 Concluding Remarks
In this paper a novel image hashing technique is presented. We accredit the success of the proposed deep hashing to the following aspects: 1) it jointly does the feature engineering and hash function learning, rather than feeding handcrafted visual features to hashing algorithms, 2) the proposed exponential loss function excellently fits the paradigm of minibatch based training and the treatment in Eqn. (10) naturally encourages interbit complementarity, and 3) to combat the undersampling issue in the training phase, we introduce the idea of twostage supervised pretraining and validate its effectiveness by comparisons.
Our comprehensive quantitative evaluations consistently demonstrate the power of deep hashing for the data hashing task. The proposed algorithm enjoys both scalability to large training data and millisecondlevel testing time for processing a new image. We thus believe that deep hashing is promising for efficiently analyzing visual big data.
References
 [1] A. Andoni and P. Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
 [2] A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn. Beyond localitysensitive hashing. CoRR, abs/1306.1547, 2013.
 [3] Y. Bengio. Learning deep architectures for AI. Found. Trends Mach. Learn., 2(1):1–127, Jan. 2009.
 [4] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Minwise independent permutations. Journal of Computer and System Sciences, 60:327–336, 1998.
 [5] M. M. B. C. Strecha, A. M. Bronstein and P. Fua. LDAHash: improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 2012.
 [6] M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.

[7]
O. Chum, J. Philbin, and A. Zisserman.
Near duplicate image detection: minhash and tfidf weighting.
In BMVC, 2008.  [8] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
 [9] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In STOC, 2004.
 [10] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.

[11]
P. Indyk and R. Motwani.
Approximate nearest neighbors: Towards removing the curse of dimensionality.
In STOC, 1998.  [12] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
 [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [14] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.
 [15] B. Kulis and K. Grauman. Kernelized localitysensitive hashing. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012.
 [16] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, 2015.
 [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [18] G. Lin, C. Shen, and A. van den Hengel. Supervised hashing using graph cuts and boosted decision trees. IEEE Trans. Pattern Anal. Mach. Intell., 37(11):2317–2331, 2015.
 [19] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
 [20] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. In CVPR, 2012.
 [21] Y. Mu, J. Shen, and S. Yan. Weaklysupervised hashing in kernel space. In CVPR, 2010.
 [22] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011.
 [23] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS, 2012.
 [24] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–978, 2009.
 [25] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10, 000 classes. In CVPR, 2014.
 [26] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.
 [27] J. Wang, S. Kumar, and S.F. Chang. Semisupervised hashing for largescale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012.
 [28] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
 [29] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, 2014.

[30]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
SUN database: Largescale scene recognition from abbey to zoo.
In CVPR, 2010.
Appendix: Network Configurations
present the configurations of the deep networks used for selected benchmarks. The nonlinear transform layers are majorly ReLU (rectified linear unit) and LRN (local response normalization). Specifically, the softmax layers are used only for pretraining the convolutional/innerProduct layers and not included during finetuning, which are thus not enumerated in these tables.
Layer ID  Layer Type  Filter / Stride 
#Dim of Output 

1  data  N/A  
2  convolution  / 1  
3  maxpooling  / 2  
4  convolution  / 1  
5  avgpooling  / 2  
6  convolution  / 1  
7  avgpooling  / 2  
softmax  N/A  10  
8  hashloss  N/A  #(bit number) 
Layer ID  Layer Type  Filter / Stride  #Dim of Output 

1  data  N/A  
2  convolution  / 1  
3  maxpooling  / 2  
4  convolution  / 1  
5  maxpooling  / 2  
6  innerProduct  N/A  500 
7  ReLU  N/A  500 
softmax  N/A  10  
8  hashloss  N/A  #(bit number) 
Layer ID  Layer Type  Filter / Stride  #Dim of Output 

1  data  N/A  
2  convolution  / 1  
3  maxpooling  / 2  
4  ReLU  N/A  
5  LRN  N/A  
6  convolution  / 1  
7  ReLU  N/A  
8  avgpooling  / 2  
9  LRN  N/A  
10  convolution  / 1  
11  ReLU  N/A  
12  avgpooling  / 2  
softmax  N/A  7 or 10  
13  hashloss  N/A  #(bit number) 
Layer ID  Layer Type  Filter / Stride  #Dim of Output 

1  data  N/A  
2  convolution  / 4  
3  ReLU  N/A  
4  maxpooling  / 2  
5  LRN  N/A  
6  convolution  / 1  
7  ReLU  N/A  
8  maxpooling  / 2  
9  LRN  N/A  
10  convolution  / 1  
11  ReLU  N/A  
12  convolution  / 1  
13  ReLU  N/A  
14  convolution  / 1  
15  ReLU  N/A  
16  maxpooling  / 2  
softmax  N/A  397  
17  hashloss  N/A  #(bit number) 