Recent years have witnessed spectacular progress on similarity-based hash code learning in a variety of computer vision tasks, such as image search, object recognition  and local descriptor compression  etc. The hash codes are highly compact (e.g., several bytes for each image) in most cases, which significantly reduces the overhead of storing visual big data and also expedites similarity-based image search. The theoretic ground of similarity-oriented hashing is rooted from Johnson-Lindenstrause theorem , which elucidates that for arbitrary samples, some -dimensional subspace exists and can be found in polynomial time complexity. When embedded into this subspace, pairwise affinities among these samples are preserved with tight approximation error bounds. This seminal theoretic discovery sheds light on trading similarity preservation for high compression of large data set. The classic locality-sensitive hashing (LSH)  is a good demonstration for above tradeoff, instantiated in various similarity metrics such as Hamming distance 6], distance with 4] and Euclidean distance .
Images are often accompanied with supervised information in various forms, such as semantically similar / dissimilar data pairs. Supervised hash code learning [21, 27] harnesses such supervisory information during parameter optimization and has demonstrated superior image search accuracy compared with unsupervised hashing algorithms [1, 28, 10]. Exemplar supervised hashing schemes include LDAHash , two-step hashing , and kernel-based supervised hashing  etc.
Importantly, two factors are known to be crucial for hashing-based image search accuracy: the discriminative power of the features and the choice of hashing functions. In a typical pipeline of existing hashing methods, these two factors are separately treated. Each image is often represented by a vector of hand-crafted visual features (such as SIFT-based bag-of-words feature or sparse codes). Regarding hash functions, a large body of existing works have adopted linear functions owing to the simplicity. More recently, researchers have also explored a number of non-linear hashing functions, such as anchor-based kernalized hashing function
and decision tree based function.
This paper attacks the problem of supervised hashing by concurrently conducting visual feature engineering and hash function learning. Most of existing image features are designated for general computer vision tasks. Intuitively, by unifying these two sub-tasks in the same formulation, one can expect the extracted image features to be more amenable for the hashing purpose. Our work is inspired by recent prevalence and success of deep learning techniques [17, 3, 13]. Though the unreasonable effectiveness of deep learning has been successfully demonstrated in tasks like image classification  and face analysis , deep learning for supervised hashing still remains inadequately explored in the literature.
Salakhutdinov et al. proposed semantic hashing in 
, where stacked Restricted Boltzmann Machines (RBMs) are employed for hash code generation. Nonetheless, the algorithm is primarily devised for indexing textual data and its extension to visual data is unclear. Xia et al. adopted a two-step hashing strategy similar to . It firstly factorizes the data similarity matrix to obtain the target binary code for each image. In the next stage, the target codes and the image labels are jointly utilized to guide the network parameter optimization. Since the target codes are not updated once approximately learned in the first stage, the final model is only sub-optimal. Lai et al.  developed a convolutional deep network for hashing, comprised of shared sub-networks and a divide-and-encode module. However, the parameters of these two components are still separately learned. After the shared sub-networks are initialized, their parameters (including all convolutional/pooling layers) are frozen during optimizing the divide-and-encode module. Intrinsically, the method in  shall be categorized to two-step hashing, rather than simultaneous feature / hashing learning. Liong et al.  presented a binary encoding network built with purely fully-connected layers. The method essentially assumes that the visual features (e.g., GIST as used in the experiments therein) have been learned elsewhere and fed into its first layer as the input.
As revealed by above literature overview, a deep hashing method which simultaneously learns the features and hash codes remains missing in this research field, which inspires our work. The key contributions of this work include:
We propose the first deep hashing algorithm of its kind, which performs concurrent feature and hash function learning over a unified network.
We investigate the key pitfalls in designing such deep networks. Particularly, there are two major obstacles: the gradient calculation from non-differentiable binary hash codes, and network pre-training in order to eventually stay at a “good” local optimum. To address the first issue, we propose an exponentiated hashing loss function and devise its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled. Moreover, an efficient pre-training scheme is also proposed. We verify its effectiveness through comprehensive evaluations on real-world visual data.
The proposed deep hashing method establishes new performance records on four image benchmarks which are widely used in this research area. For instance, on the CIFAR10 dataset, our method achieves a mean average precision of 0.73 for Hamming ranking based image search, which represents some drastic improvement compared with the state-of-the-art methods (0.58 for  and 0.36 for ).
2 The Proposed Method
Throughout this paper we will use bold symbols to denote vectors or matrices, and italic ones for scalars unless otherwise instructed. Suppose a data set with supervision information is provided as the input. Prior works on supervised hashing have considered various forms of supervision, including triplet of items where the pair is more alike than the pair [21, 23, 16], pairwise similar/dissimilar relations  or specifying the label of each sample. Observing that triplet-type supervision incurs tremendous complexity during hashing function learning and semantic-level sample labels can be effortlessly converted into pairwise relations, hereafter the discussion focuses on supervision in pairwise fashion. Let , collect all similar / dissimilar pairs respectively. For notational convenience, we further introduce a supervision matrix as
Figure 1 illustrates our proposed pipeline of learning a deep convolutional network for supervised hashing. The network is comprised of two components: a topmost layer meticulously-customized for the hashing task and other conventional layers. The network takes a -sized images with channels as the inputs. The neurons on the top layer output either -1 or 1 as the hash code. Formally, each top neuron represents a hashing function , where denotes the 3-D raw image. For notational clarity, let us denote the response vector on the second topmost layer as , where implicitly defines the highly non-linear mapping from the raw data to a specified intermediate layer.
For the topmost layer, we adopt a simple linear transformation, followed by a signum operation, which is formally presented as
The reminder of this section firstly introduces the hashing loss function and the calculation of its smoothed surrogate in Section 2.1. More algorithmic details of our proposed pretraining-finetuning procedure are delivered in Section 2.2.
2.1 Exponentiated Code Product Optimization
The key purpose of supervised hashing is to elevate the image search accuracy. The goal can be intuitively achieved by generating discriminative hash codes, such that similar data pairs can be perfectly distinguished from dissimilar pairs according to the Hamming distances calculated over the hash codes. A number of hashing loss functions have been devised by using above design principal. In particular, Norouzi et al.  propose a hinge-like loss function. Critically, hinge loss is known to be non-smooth and thus complicates gradient-based optimization. Two other works [20, 18] adopt smooth loss defined on the inner product between hash codes.
It largely remains unclear for designing optimal hashing loss functions in perceptron-like learning. The major complication stems from the discrete nature of hash codes, which prohibits direct gradient computation and propagation as in typical deep networks. As such, prior works have investigated several tricks to mitigate this issue. Examples include optimizing a variational upper bound of the original non-smooth loss
, or simply computing some heuristic-oriented sub-gradients. In this work we advocate an exponential discrete loss function which directly optimizes the hash code product and enjoys a bilinear smoothed approximation. Compared with other alternative hashing losses, here we first show the proposed exponential loss arguably more amenable for mini-batch based iterative update and later exhibit its empirical superiority in the experiments.
Let denote hash bits in vector format for data object . We also use the notations , to stand for bit of and the hash code with bit absent respectively. As a widely-known fact in the hashing literature , code product admits a one-to-one correspondence to Hamming distance and comparably easier to manipulate. A normalized version of code product ranging over is described as
and when bit is absent, the code product using partial hash codes is
Exponential Loss: Given the observation that faithfully indicates the pairwise similarity, we propose to minimize an exponentiated objective function defined as the accumulation over all data pairs:
where represents the collection of parameters in the deep networks excluding the hashing loss layer. The atomic loss term is
This novel loss function enjoys some elegant traits desired by deep hashing compared with those in BRE , MLH  and KSH . It establishes more direct connection to the hashing function parameters by maximizing the correlation of code product and pairwise labeling. In comparison, BRE and MLH optimize the parameters by aligning Hamming distance with original metric distances or enforcing the Hamming distance larger/smaller than pre-specified thresholds. Both formulations incur complicated optimization procedures, and their optimality conditions are unclear. KSH adopts a least-squares formulation for regressing code product onto the target labels, where a smooth surrogate for gradient computation is proposed. However, the surrogate heavily deviates from the original loss function due to its high non-linearity.
Gradient Computation: A prominent advantage of exponential loss is its easy conversion into multiplicative form, which elegantly simplifies the derivation of its gradient. For presentation clarity, we hereafter only focus on the calculation conducted over the topmost hashing loss layer. Namely, for bit , where are the response values at the second top layer and are parameters to be learned for bit ().
Following the common practice in deep learning, two groups of quantities , and (
ranges over the index set of current mini-batch) need to be estimated on the hashing loss layer at each iteration. The former group of quantities are used for updating, , and the latter are propagated backwards to the bottom layers. The additive algebra of hash code product in Eqn. (3) inspires us to estimate the gradients in a leave-one-out mode. For atomic loss in Eqn. (6), it is easily verified
where only the latter factor is related to . Since the product can only be -1 or 1, we can linearize the latter factor through exhaustively enumerating all possible values, namely
where are two sample-specific constants, calculated by and . Since the hardness of calculating the gradient of Eqn. (7) lies in the bit product , we replace the signum function using the sigmoid-shaped function , obtaining
Freezing the partial code product , we define an approximate atomic loss with only bits active:
where the first factor plays a role of re-weighting specific data pair, conditioned on the rest bits. Iterating over all ’s, the original loss function can now be approximated by
Compared with other sigmoid-based approximations in previous hashing algorithms (e.g., KSH ), ours only requires (rather than both and ) is sufficiently large. This bilinearity-oriented relaxation is more favorable for reducing approximation error, which will be corroborated by the subsequent experiments.
Since the objective in Eqn. (5) is a composition of atomic losses on data pairs, we only need to instantiate the gradient computation on specific data pair . Applying basic calculus rules and discarding some scaling factors, we first obtain
and further using calculus chain rule brings
Importantly, the formulas below obviously hold by the construction of :
The gradient computations on other deep network layers simply follow the regular calculus rules. We thus omit the introduction.
2.2 Two-Stage Supervised Pre-Training
Deep hashing algorithms (including ours) mostly strive to optimize pairwise (or even triplet as in ) similarity in Hamming space. This raises an intrinsic distinction compared with conventional applications of deep networks (such as image classification via AlexNet ). The total count of data pairs quadratically increases with regard to the training sample number, and in conventional applications the number of atomic losses in the objective only linearly grows. This entails a much larger mini-batch size in order to combat numerical instability caused by under-sampling111For instance, a training set with 100,000 samples demands a mini-batch of 1,000 data for sampling rate in image classification. In contrast, in deep hashing, capturing pairwise similarity requires a tremendous mini-batch of 10,000 data., which unfortunately often exceeds the maximal memory space on modern CPU/GPUs.
We adopt a simple two-stage supervised pre-training approach as an effective network pre-conditioner, initializing the parameter values in the appropriate range for further supervised fine-tuning. In the first stage, the network (excluding the hashing loss layer) is concatenated to a regular softmax layer. The network parameters are learned through optimizing the objective of a relevant semantics learning task (e.g., image classification). After stage one is complete, we extract the neuron outputs of all training samples from the second topmost layer (i.e., the variable ’s in Section 2.1), feed them into another two-layer shallow network as shown in Figure 1 and initialize the hashing parameters , . Finally, all layers are jointly optimized in a fine-tuning process, minimizing the hashing loss objective . The entire procedure is illustrated in Figure 1 and detailed in Algorithm 1.
This section reports the quantitative evaluations between our proposed deep hashing algorithm and other competitors.
Description of Datasets: We conduct quantitative comparisons over four image benchmarks which represent different visual classification tasks. They include MNIST222http://yann.lecun.com/exdb/mnist/ for handwritten digits recognition, CIFAR10333http://www.cs.toronto.edu/~kriz/cifar.html which is a subset of 80 million Tiny Images dataset444http://groups.csail.mit.edu/vision/TinyImages/ and consists of images from ten animal or object categories, Kaggle-Face555https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge, which is a Kaggle-hosted facial expression classification dataset to stimulate the research on facial feature representation learning, and SUN397  which is a large scale scene image dataset of 397 categories. Figure 2 shows exemplar images. For all selected datasets, different classes are completely mutually exclusive such that the similarity/dissimilarity sets as in Eqn (1) can be calculated purely based on label consensus. Table 1 summarizes the critical information of these experimental data, wherein the column of feature dimension refers to the neuron numbers on the second topmost layers (i.e., dimensions of feature vector ).
|MNIST||50,000 / 10,000||10||500||CNN|
|CIFAR10||50,000 / 10,000||10||1,024||CNN|
|Kaggle-Face||315,799 / 7,178||7||2,304||CNN|
|SUN397||87,003 / 21,751||397||9,216||CNN|
Implementation and Model Specification
: We have implemented a substantially-customized version of the open-source Caffe. The proposed hashing loss layer is patched to the original package and we also largely enrich Caffe’s model specification grammar. Moreover, To ensure that mini-batches more faithfully represent the real distribution of pairwise affinities, we re-shuffle the training set at each iteration. This is approximately accomplished by using the trick of random skipping, namely skipping the next few samples666The skipped samples vary at each operation, parameterized by a random integer uniformly drawn from in our experiments. in the image database after adding one into the mini-batch.
|12 bits||24 bits||48 bits||12 bits||24 bits||48 bits||12 bits||24 bits||48 bits||12 bits||24 bits||48 bits|
., RELU and local normalization layers) are ignored due to space limit. Specifically, the softmax layer is used only for pre-training the first 7 layers and not included during fine-tuning. We provide the network configuration information in the format of Caffe’s grammar in the supplemental material.
Baselines and Evaluation Protocol: All the evaluations are conducted on a large-scale private cluster, equipped with 12 NVIDIA Tesla K20 GPUs and 8 K40 GPUs. We denote the proposed algorithm as DeepHash. On the chosen benchmarks, DeepHash is compared against classic or state-of-the-art competing hashing schemes, including unsupervised methods such as random projection-based LSH , PCAH, SH , ITQ , and supervised methods like LDAH , MLH , BRE , and KSH . LSH and PCAH are evaluated using our own implementations. For the rest aforementioned baselines, we thank the authors for publicly sharing their code and adopt the parameters as suggested in the original software packages. Moreover, to make the comparisons comprehensive, four previous deep hashing algorithms are also contrasted, denoted as DH-1 and DH-1 from , DH-2 , and DH-3 . Since the authors do not share the source code or model specifications, we instead cite their reported accuracies under identical (or similar) experimental settings.
Importantly, the performance of a hashing algorithm critically hinges on the semantic discriminatory power of its input features. Previous deep hashing works [29, 16] use traditional hand-crafted features (e.g., GIST and SIFT bag-of-words) for all baselines, which is not an optimal setting for fair comparison with deep hashing. To rule out the effect of less discriminative features, we strictly feed all baselines (except for four deep hashing algorithms from [29, 19, 16]) with features extracted from some intermediate layer of the corresponding networks used in deep hashing. Specifically, after the first supervised pre-training stage in Algorithm 1 is completed, we re-arrange the neuron responses on the layer right below the hashing loss layer (e.g., layer #7 in Table 5) into vector formats (namely the variable ’s) and feed them into baselines.
All methods share identical training and query sets. After the hashing functions are learned on the training set, all methods produce binary hash codes for the querying data respectively. There exist multiple search strategies using hash codes for image search, such as hash table lookup  and sparse coding style criterion . Following recent hashing works, we only carry out Hamming ranking once the hashing functions are learned, which refers to the process of ranking the retrieved samples based on their Hamming distances to the query. Under Hamming ranking protocol, we measure each algorithm using both mean-average-precision (mAP) scores and precision-recall curves.
|12 bits||24 bits||48 bits||12 bits||24 bits||48 bits||12 bits||24 bits||48 bits||12 bits||24 bits||48 bits|
|DeepHash (random init.)||0.9806||0.9862||0.9873||0.5728||0.6503||0.6585||0.4125||0.4473||0.4620||0.0211||0.0384||0.0360|
Investigation of Hamming Ranking Results: Table 2 and Figure 3 show the mAP scores for our proposed DeepHash algorithms (with supervised pre-training and fine-tuning) and all baselines. To clearly depict the evolving accuracies with respect to the search radius, Figure 4 displays the precision-recall curves for all algorithms with 32 hash bits. There are three key observations from these experimental results that we would highlight:
1) On all four datasets, our proposed DeepHash algorithm significantly perform better than all baselines in terms of mAP. For all non-deep-network based algorithm, KSH achieves the best accuracies on MNIST, CIFAR10 and Kaggle-Face, and ITQ shows top performances on SUN397. Using 48 hash bits, the best mAP scores obtained by KSH or ITQ are 0.9817, 0.5482, 0.4132, and 0.0471 on MNIST / CIFAR10 / Kaggle-Face / SUN397 respectively. In comparison, our proposed DeepHash performs nearly perfect on MNIST (0.9938), and defeat KSH and ITQ by very large margins, scoring 0.7410, 0.5615, and 0.1293 on other three datasets respectively.
2) We also include four deep hashing algorithms by referring to the accuracies reported in the original publications. Recall that the evaluations in [29, 16] feed baseline algorithms with non-CNN features (e.g., GIST). Interestingly, our experiments reveal that, when conventional hashing algorithms take CNN features as the input, the relative performance gain of prior deep hashing algorithms becomes marginal. For example, under 48 hash bits, KSH’s mAP score 0.5482 is comparable with regard to DH-3’s 0.581. We attribute the striking superiority of our proposed deep hashing algorithm to the importance of jointly conducting feature engineering and hash function learning (i.e., the fine-tuning process in Algorithm 1).
3) Elevating inter-bit mutual complementarity is overly crucial for the final performance. For those methods that generate hash bits independently (such as LSH) or by enforcing performance-irrelevant inter-bit constraints (such as LDAH), the mAP scores only show slight gains or even drop when increasing hash code length. Among all algorithms, two code-product oriented algorithm, KSH and our proposed DeepHash, show steady improvement by using more hash bits. Moreover, our results also validate some known insights exposed by previous works, such as the advantage of supervised hashing methods over the unsupervised alternatives.
|Hand-Crafted Feature||CNN Feature|
|16 bits||32 bits||16 bits||32 bits|
|Hand-Crafted Feature||CNN Feature|
|16 bits||32 bits||16 bits||32 bits|
Effect of Supervised Pre-Training: We now further highlight the effectiveness of the two-stage supervised pre-training process. To this end, in Table 3 we show the mAP scores achieved by three different strategies of learning the network parameters. The scheme “DeepHash (random init.)” refers to initializing all parameters with random numbers without any pre-training. A typical supervised gradient back-propagation procedure as in AlexNet  is then used. The second scheme “DeepHash (pre-training)” refers to initializing the network using two-stage pre-training in Algorithm 1, without any subsequent fine-tuning. It serves as an appropriate baseline for assessing the benefit of the fine-tuning process as in the third scheme “DeepHash (fine-tuning)”. In all cases, the learning rate in gradient descent drops at a constant factor (0.1 in all of our experiments) until the training converges.
There are two major observations from the results in Table 3. First, simultaneous tuning all the layers (including the hashing loss layer) often significantly boosts the performance. As a key evidence, “DeepHash (random init.)” demonstrates prominent superiority on MNIST and CIFAR10 compared with “DeepHash (pre-training)”. The joint parameter tuning of “DeepHash (random init.)” is supposed to compensate the low-quality random parameter initialization. Secondly, positioning the initial solution near a “good” local optimum is crucial for learning on challenging data. For example, the dataset of SUN397 has as many as 397 unique scene categories. However, due to the limitation of GPU memory, even a K40 GPU with 12GB memory only support a mini-batch of 600 samples at maximum. State differently, each mini-batch only comprises 1.5 samples per category on average, which results in a heavily biased sampling towards the pairwise affinities. We attribute the relatively low accuracies of “DeepHash (random init.)” to this issue. In contrast, training deep networks with both supervised pre-training and fine-tuning (i.e., the third scheme in Table 3) exhibit robust performances over all datasets.
Comparison with Hand-Crafted Features: To complement a missing comparison in other deep hashing works [29, 19, 16]), we also compare the hashing performance with conventional hand-crafted features and CNN features extracted from our second topmost layers. Following the choices in relevant literature, we extract 800-D GIST feature from CIFAR10 images, and 5000-D DenseSIFT Bag-of-Words feature from SUN397 images. The comparisons under 16 and 32 hash bits are found in Table 4, exhibiting huge performance gaps between these two kinds of features. It clearly reveals how the feature quality impacts the final performance of a hashing algorithm, and a fair setting shall be established when comparing conventional and deep hashing algorithms.
4 Concluding Remarks
In this paper a novel image hashing technique is presented. We accredit the success of the proposed deep hashing to the following aspects: 1) it jointly does the feature engineering and hash function learning, rather than feeding hand-crafted visual features to hashing algorithms, 2) the proposed exponential loss function excellently fits the paradigm of mini-batch based training and the treatment in Eqn. (10) naturally encourages inter-bit complementarity, and 3) to combat the under-sampling issue in the training phase, we introduce the idea of two-stage supervised pre-training and validate its effectiveness by comparisons.
Our comprehensive quantitative evaluations consistently demonstrate the power of deep hashing for the data hashing task. The proposed algorithm enjoys both scalability to large training data and millisecond-level testing time for processing a new image. We thus believe that deep hashing is promising for efficiently analyzing visual big data.
-  A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
-  A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn. Beyond locality-sensitive hashing. CoRR, abs/1306.1547, 2013.
-  Y. Bengio. Learning deep architectures for AI. Found. Trends Mach. Learn., 2(1):1–127, Jan. 2009.
-  A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60:327–336, 1998.
-  M. M. B. C. Strecha, A. M. Bronstein and P. Fua. LDAHash: improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 2012.
-  M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.
O. Chum, J. Philbin, and A. Zisserman.
Near duplicate image detection: min-hash and tf-idf weighting.In BMVC, 2008.
-  S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
-  M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In STOC, 2004.
-  Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.
P. Indyk and R. Motwani.
Approximate nearest neighbors: Towards removing the curse of dimensionality.In STOC, 1998.
-  Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.
-  B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, 2015.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
-  G. Lin, C. Shen, and A. van den Hengel. Supervised hashing using graph cuts and boosted decision trees. IEEE Trans. Pattern Anal. Mach. Intell., 37(11):2317–2331, 2015.
-  V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
-  W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. In CVPR, 2012.
-  Y. Mu, J. Shen, and S. Yan. Weakly-supervised hashing in kernel space. In CVPR, 2010.
-  M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011.
-  M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS, 2012.
-  R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–978, 2009.
-  Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10, 000 classes. In CVPR, 2014.
-  A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.
-  J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
-  R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, 2014.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
SUN database: Large-scale scene recognition from abbey to zoo.In CVPR, 2010.
Appendix: Network Configurations
present the configurations of the deep networks used for selected benchmarks. The non-linear transform layers are majorly ReLU (rectified linear unit) and LRN (local response normalization). Specifically, the softmax layers are used only for pre-training the convolutional/innerProduct layers and not included during fine-tuning, which are thus not enumerated in these tables.
|Layer ID||Layer Type||
Filter / Stride
|#Dim of Output|
|Layer ID||Layer Type||Filter / Stride||#Dim of Output|
|Layer ID||Layer Type||Filter / Stride||#Dim of Output|
|soft-max||N/A||7 or 10|
|Layer ID||Layer Type||Filter / Stride||#Dim of Output|