1 Introduction
Efficient computation of similarity between entries in largescale databases has attracted increasing interest, given the explosive growth of data that has to be collected, processed, stored, and searched for. This problem arises naturally in applications such as imagebased retrieval, ranking, classification, detection, tracking, and registration. In all these problems, given a query object (usually represented as a feature vector), one has to determine the closest entries (nearest neighbors) in a large (or huge) database. Since the notion of similarity of (for example) visual objects is rather elusive and cannot be measured explicitly, one often resorts to machine learning techniques that allow constructing similarity from examples of data. Such methods are generally referred to as
similarity or metric learning.Traditionally, similarity learning methods can be divided into unsupervised and supervised, with the former relying on the data only without using any side information. PCAtype methods (Schoelkopf et al., 1997) use global structure of the data, while manifold learning techniques such as locally linear embedding (Roweis & Saul, 2000), eigenmaps (Belkin & Niyogi, 2003), and diffusion maps (Coifman & Lafon, 2006) consider data as a lowdimensional manifold and use its local intrinsic structure to represent similarity. Supervised methods assume that additional information, such as class labels (Johnson & Wichern, 2002; Mika et al., 1999; Weinberger & Saul, 2009; Xing et al., 2002): distances, similar and dissimilar pairs (Davis et al., 2007), or order relations (McFee & Lanckriet, 2009; Shen et al., 2009), is provided together with the data examples. Many similarity learning methods use some representation of the distance, e.g., in the form of a parametric embedding from the original data space to some target space. In the simplest case, such an embedding is a linear projection acting as dimensionality reduction, and the metric of the target space is Euclidean or Mahalanobis distance (Shen et al., 2009; Weinberger & Saul, 2009).
More recently, motivated by the need for efficient techniques for big data, there has been an increased interest in similarity learning methods based on embedding the data in spaces of binary codes with the Hamming metric (Gong et al., 2012; Gong & Lazebnik, 2011; Kulis & Darrell, 2009; Liu et al., 2012; Norouzi et al., 2012; Norouzi & Fleet, 2011; Wang et al., 2010). Such an embedding can be considered as a hashing function acting on the data trying to preserve some underlying similarity. Notable examples of the unsupervised setting of this problem include locality sensitive hashing (LSH) (Gionis et al., 1999) and spectral hashing (Weiss et al., 2008; Liu et al., 2011)
, which try to approximate some trusted standard similarity such as the Jaccard index or the cosine distance. Similarly, Yagnik et al. (
2011) proposed computing ordinal embeddings based on partial order statistics such that Hamming distance in the resulting space closely correlates with rank similarity measures. Unsupervised methods cannot be used to learn semantic similarities given by example data. Shakhnarovich et al. (2003) proposed to construct optimal LSHlike similaritysensitive hashes(SSH) for data with given binary similarity function using boosting, considering each dimension of the hashing function as a weak classifier. In the same setting, a simple method based on eigendecomposition of covariance matrices of positive and negative samples was proposed by
(Strecha et al., 2012). Masci et al. (2011) posed the problem as a neural network learning. Hashing methods have been used successfully in various vision applications such largescale retrieval (Torralba et al., 2008b), feature descriptor learning (Strecha et al., 2012; Masci et al., 2011), image matching (Korman & Avidan, 2011) and alignment (Bronstein et al., 2010).The appealing property of such similaritypreserving hashing methods is the compactness of the representation and the low complexity involved in distance computation: finding similar objects is done through determining hash collisions (i.e. looking for nearest neighbors in Hamming metric balls of radius zero), with complexity practically constant in the database size. In practice, however, most methods consider nearest neighbors lying at larger than zero radii, which then cannot be done as efficiently. The reason behind this is the difficulty of simple hash functions (typically lowdimensional linear projections) to achieve simultaneously high precision and recall by only requiring hash collisions.
Main contributions. In this paper, we propose to introduce structure into the binary representation at the expense of its length, an idea that has been shown spectacularly powerful and led to numerous applications of sparse redundant representation and compressed sensing techniques. We introduce a sparse similaritypreserving hashing technique, SparseHash, and show a substantial evidence of its superior recall at precision comparable to that of stateoftheart methods, on top of its intrinsic computational benefits. To the best of our knowledge, this is the first time sparse structure are employed in similaritypreserving hashing. We also show that the proposed sparse hashing technique can be thought of as a feedforward neural network, whose architecture is motivated by the iterative shrinkage algorithms used for sparse representation pursuit (Daubechies et al., 2004). The network is trained using stochastic gradient, scalable to very large training sets. Finally, we present an extension of SparseHash to multimodal data, allowing its use in multimodal and crossmodality retrieval tasks.
2 Background
Let be the data (or feature) space with a binary similarity function . In some cases, the similarity function can be obtained by thresholding some trusted metric on such as the metric; in other cases, the data form a (typically, latent) lowdimensional manifold, whose geodesic metric is more meaningful than that of the embedding Euclidean space. In yet other cases, represent a semantic rather than geometric notion of similarity, and may thus violate metric properties. It is customary to partition into similar pairs of points (positives) , and dissimilar pairs of points (negatives) .
Similaritypreserving hashing is the problem of representing the data from the space in the space of dimensional binary vectors with the Hamming metric by means of an embedding that preserves the original similarity relation, in the sense that there exist two radii,
such that with high probability
and .In practice, the similarity is frequently unknown and hard to model, however, it is possible to sample it on some subset of the data. In this setting, the problem of similaritypreserving hashing boils down to finding an embedding minimizing the aggregate of false positive and false negative rates,
(1) 
Problem (1) is highly nonlinear and nonconvex. We list below several methods for its optimization.
Similaritysensitive hashing (SSH). Shakhnarovich et al. (2003) studied a particular setting of problem (1) with embedding of the form , where is an projection matrix and is an bias vector, and proposed the SSH algorithm constructing the dimensions of onebyone using boosting. The expectations in (1) are weighted, where stronger weights are given to misclassified pairs from previous iteration.
Diffhash (DH). Strecha et al. (2012) linearized the embedding to , observing that in this case (1) can be written as
(2) 
where are the covariance matrices of the differences of positive and negative samples, respectively. Solving (2) w.r.t. to the projection matrix
amounts to finding the smallest eigenvectors of the covariance difference matrix
. The vector is found separately, independently for each dimension.Neural network hashing (NNhash). Masci et al. (2011) realized the function as a singlelayer neural network with activation function, where the coefficients act as the layer weights and bias, respectively. Coupling two such networks with identical parameters in a socalled siamese architecture (Hadsell et al., 2006; Taylor et al., 2011), one can represent the loss (1) as
The second term in (2) is a hingeloss
providing robustness to outliers and producing a mapping for which negatives are pulled
apart.Finding the network parameters
minimizing loss function (
2) is done using standard NN learning techniques, e.g. the backpropagation algorithm (LeCun, 1985). Compared to SSH and DH, the NNhash attempts to solve the full nonlinear problem rather than using often suboptimal solutions of the relaxed linearized or separable problem such as (2).3 Sparse similaritypreserving hashing
The selection of the number of bits and the rejection Hamming radius in a similaritypreserving hash has an important influence on the tradeoff between precision and recall. The increase of increases the precision, as a higherdimensional embedding space allows representing more complicated decision boundaries. At the same time, with the increase of , the relative volume of the ball
containing the positives decays exponentially fast, a phenomenon known as the curse of dimensionality, resulting in a rapid decrease of the recall. This is a welldocumented phenomenon that affects all hashing techniques
(Grauman & Fergus, 2013). For instance, in the context of LSH, it can be shown that the collision probability between two points decreases exponentially with the codelength (Goemans & Williamson, 1995). Furthermore, increasing slows down the retrieval.The low recall typical to long codes can be improved by increasing the rejection radius . However, this comes at the expense of increased query time, since the search complexity directly depends on the rejection radius . For (collision), a lookup table (LUT) is used: the query code is fed into the LUT, containing all entries in the database having the same code. The complexity is , independent of the database size , but often with a large constant. For small (partial collision), the search is done as for using perturbation of the query: at most bits of the query are changed, and then it is fed into the LUT. The final result is the union of all the retrieved results. Complexity in this case is . Finally, for large radii it is often cheaper in practice to use exhaustive search with complexity (for typical code lengths and database sizes used in vision applications, using is slower than bruteforce search (Grauman & Fergus, 2013)). Consequently, practical retrieval based on similaritypreserving hashing schemes suffers from a fundamental limitation of the precisionrecallspeed tradeoff: one has to choose between fast retrieval (small and , resulting in low recall), high recall (large , small , slow retrieval), or high precision (large , small recall, and slow retrieval).
The key idea of this paper is to control the exploding volume of the embedding space by introducing structure into the binary code. While different types of structure can be considered in principle, we limit our attention to sparse
hash codes. A number of recent studies has demonstrated that sparse overcomplete representations have several theoretical and practical advantages when modeling compressible data leading to stateoftheart results in many applications in computer vision and machine learning. We argue, and show experimentally in Section
5, that compared to its “dense” counterpart, an bit sparse similaritypreserving hash can enjoy from the high precision typical for long hashes, while having higher recall roughly comparable to that of a dense hashing scheme with bits (which has the same number of degrees of freedom of the bit sparse hash).SparseHash. In order to achieve sparsity, a regularization needs to be incorporated into problem (1) so that the obtained embedding will produce codes having only a small number of nonzero elements. In this work we employ an norm regularization, extensively used in the compressed sensing literature to promote sparsity. Specifically, the loss considered in the minimization of the proposed SparseHash framework is given by the average of
(4)  
over the training set, where is the groundtruth similarity function (1: similar, 0: dissimilar), is a parameter controlling the level of sparsity, is a weighting parameter governing the false positive and negative rate tradeoff, and is a margin.
With the new loss function given in (4), solving (1) will produce a sparse embedding that minimizes the aggregate of false positive and false negative rates for a given parametrization. Now the question is what parametrized family of embedding functions would lead to the best sparse similaritypreserving hashing codes? While there is no absolute answer to this question, recent approaches aimed at finding fast approximations of sparse codes have shed some light on this issue from a practical perspective (Gregor & LeCun, 2010; Sprechmann et al., 2012), and we use this same criterion for our proposed framework.
Gregor and LeCun (2010) proposed tailored feedforward architectures capable of producing highly accurate approximations of the true sparse codes. These architectures were designed to mimic the iterations of successful first order optimization algorithms such as the iterative thresholding algorithm (ISTA) (Daubechies et al., 2004). The close relation between the iterative solvers and the network architectures plays a fundamental role in the quality of the approximation. This particular design of the encoder architecture was shown to lead to considerably better approximations than other oftheshelf feedforward neural networks (Gregor & LeCun, 2010). These ideas can be generalized to many different uses of sparse coding, in particular, they can be very effective in discriminative scenarios, performing similarly or better than exact algorithms at a small fraction of the computational cost (Sprechmann et al., 2012). These architectures are flexible enough for approximating the sparse hash in (4).^{1}^{1}1The hash codes produced by the proposed architecture can only be made on average sparse by tuning the parameter . In order to guarantee that the codes contain no more than nonzeros, one can resort to the CoD encoders, derived from the coordinate descent pursuit algorithm (Gregor & LeCun, 2010), wherein is upperbounded by the number of network layers. We stress that in our application the exact sparsity is not important, since we get the same qualitative behavior.
Implementation. We implement SparseHash by coupling two ISTAtype networks, sharing the same set of parameters as the ones described in (Gregor & LeCun, 2010; Sprechmann et al., 2012), and trained using the loss (4). The architecture of an ISTAtype network (Figure 1) can also be seen as a recurrent network with a soft threshold activation function. A conventional ISTA network designed to obtain sparse representations with fixed complexity has continuous output units. We follow the approach of (Masci et al., 2011) to obtain binary codes by adding a activation function. Such smooth approximation of the binary outputs is also similar to the logistic function used by KSH (Liu et al., 2012). We initialize with a unit length normalized random subset of the training vectors, as in the original ISTA algorithm considering as dictionary and the thresholds with zeros. The shrinkage activation is defined as .
The application of the learned hash function to a new data sample involves few matrix multiplications and the computation of the elementwise soft thresholding, and is on par with the fastest methods available, such as SSH (Shakhnarovich et al., 2003), DH (Strecha et al., 2012), and AGH (Liu et al., 2011).
4 Multimodal SparseHash
In modern retrieval applications, a single object is often represented by more than one data modality. For example, images are frequently accompanied by textual tags, and video by an audio track. The need to search multimodal data requires comparing objects incommensurable in their original representation. Similaritypreserving hashing can address this need by mapping the modalities into a common space, thus making them comparable in terms of a single similarity function, such as the Hamming metric. For simplicity, we will henceforth limit our discussion to two modalities, though the presented ideas can be straightforwardly generalized to any number of modalities.
We assume the data comes from two distinct data spaces and , equipped with intramodality similarity functions and , respectively. We furthermore assume the existence of an intermodality similarity . Typically, examples of similar and dissimilar objects across modalities are more expensive to obtain compared to their intramodality counterparts. We construct two embeddings and , in such way that the Hamming metric preserves the similarity relations of the modalities. We distinguish between a crossmodal similaritypreserving hashing, preserving only the intramodality similarity (Bronstein et al., 2010), and the full multimodal setting, also preserving intermodal similarities.
Our sparse similaritypreserving hashing technique can be generalized to both settings. We construct an independent SparseHash network for each modality, and train them by minimizing an aggregate loss of the form
with respect to the parameters of the networks. The parameters and control the relative importance of the intramodality similarity, and are set to zero in the crossmodal regime. We refer to the networks constructed this way as MMSparseHash.
Hamming radius  Hamming radius  
Method  mAP  Prec.  Recall  F1  Prec.  Recall  F1  
17.42  –  –  –  –  –  –  
KSH  48  31.10  18.22  0.86  0.44  5.39  5.6  0.11  
64  32.49  10.86  0.13  0.26  2.49  9.6  1.9  
AGH1  48  14.55  15.95  2.8  1.4  4.88  2.2  4.4  
64  14.22  6.50  4.1  8.1  3.06  1.2  2.4  
AGH2  48  15.34  17.43  7.1  3.6  5.44  3.5  6.9  
64  14.99  7.63  7.2  1.4  3.61  1.4  2.7  
SSH  48  15.78  9.92  6.6  1.3  0.30  5.1  1.0  
64  17.18  1.52  3.0  6.1  1.0  1.69  3.3  
DH  48  13.13  3.0  1.0  5.1  1.0  1.7  3.4  
64  13.07  1.0  1.7  3.3  0.00  0.00  0.00  
NN  48  30.18  32.69  1.45  0.74  9.47  5.2  0.10  
64  34.74  22.78  0.28  5.5  5.70  8.8  1.8  
Sparse  48  16  0.01  0.1  23.07  32.69  1.81  0.93  16.65  5.0  0.10 
7  0.001  0.1  21.08  26.03  17.00  12.56  26.65  3.04  5.46  
64  11  0.005  0.1  23.80  31.74  6.87  11.30  31.12  0.86  1.70  
7  0.001  0.1  21.29  21.41  41.68  28.30  25.27  10.17  14.50  

5 Experimental results
We compare SparseHash to several stateoftheart supervised and semisupervised hashing methods: DH (Strecha et al., 2012), SSH (Shakhnarovich et al., 2003), AGH (Liu et al., 2011), KSH (Liu et al., 2012), and NNhash (Masci et al., 2011)
, using codes provided by the authors. For SparseHash, we use fully online training via stochastic gradient descent with annealed learning rate and momentum, fixing the maximum number of epochs to 250. A single layer ISTA net is used in all experiments. All dense hash methods achieve an average sparsity of about
per sample whereas SparseHash achieves much sparser and structured codes; i.e.sparsity on CIFAR10 with hash length of 128. Both sparse and dense codes are well distributed; i.e. small variance of non zero components per code.
Evaluation. We use several criteria to evaluate the performance of the methods: precision and recall (PR) for different Hamming radii, and the F1 score (their harmonic average); mean average precision at , defined as , where is the relevance of the th results (one if relevant and zero otherwise), and is the precision at (percentage of relevant results in the first topranked matches); and the mean precision (MP), defined as the percentage of correct matches for a fixed number of retrieved elements. For the PR curves we use the ranking induced by the Hamming distance between the query and the database samples. In case of we considered only the results falling into the Hamming ball of radius .
CIFAR10 (Krizhevsky, 2009) is a standard set of 60K labeled images belonging to 10 different classes, sampled from the 80M tiny image benchmark (Torralba et al., 2008a). The images are represented using 384dimensional GIST descriptors. Following (Liu et al., 2012), we used a training set of 200 images for each class; for testing, we used a disjoint query set of 100 images per class and the remaining 59K images as database.
Figure 4 shows examples of nearest neighbors retrieval by SparseHash. Performance of different methods is compared in Table 5 and Figures 5–6. In Figure 6 (left), we observe two phenomena: first, the recall of dense hash methods drops significantly with the increase of hash length (as expected from our analysis in Section 3; increasing the hash length is needed for precision performance), while the recall of SparseHash, being dependent on the number of nonzero elements rather than hash length, remains approximately unchanged. Second, SparseHash has significantly higher recall at low compared to other methods. This is also evinced in Figure 3 where we show the tradeoff between precision, recall and retrieval time for hashes of length . We used an efficient implementation of LUTbased and brute force search and took the fastest among the two; with codes of length on CIFAR10 dataset, for radii and LUTbased search showed significant speedup, while for brute force search was faster. In order to further analyze this behavior, we measured the average number of codes which are mapped to the same point for each of the methods. Results are reported in Table 2.
Avg. # of neighbors  

Method  Unique codes  
KSH  57368  3.95  12.38  27.21 
AGH2  55863  1.42  2.33  4.62 
SSH  59733  1.01  1.12  1.88 
DH  59999  1.00  1.00  1.00 
NN  54259  4.83  20.12  56.70 
Sparse  9828  798.47  2034.73  3249.86 
NUS (Chua et al., 2009) is a dataset containing 270K annotated images from Flickr. Every images is associated with one or more of the different 81 concepts, and is described using a 500dimensional bagoffeatures. In the training and evaluation, we followed the protocol of (Liu et al., 2011): two images were considered as neighbors if they share at least one common concept (only 21 most frequent concepts are considered). Testing was done on a query set of 100 images per concept; training was performed on 100K pairs of images.
Performance is shown in Table 6 and Figures 5–6; retrieved neighbors are shown in Figure 2. We again see behavior consistent with our analysis and SparseHash significant outperforms the other methods.
Multimodal hashing. We repeated the experiment on the NUS dataset with the same indices of positive and negative pairs, adding the Tags modality represented as 1Kdimensional bags of words. The training set contained pairs of similar and dissimilar Images, Tags, and crossmodality TagsImages pairs. We compare our MMSparseHash to the crossmodal SSH (CMSSH) method (Bronstein et al., 2010). Results are shown in Table 4 and in Figure 9. MMSparseHash significantly outperforms CMSSH in previously reported stateoftheart crossmodality (ImagesTags and TagsImages) retrieval. Both methods outperform the baseline for intramodal (TagsTags and ImagesImages) retrieval; since the two modalities complement each other, we attribute the improvement to the ability of the model to pick up such correlations.
Hamming radius  Hamming radius  
Method  mAP@10  MP@5K  Prec.  Recall  F1  Prec.  Recall  F1  
68.67  32.77  –  –  –  –  –  –  
KSH  64  72.85  42.74  83.80  6.1  1.2  84.21  1.7  3.3  
256  73.73  45.35  84.24  1.4  2.9  84.24  1.4  2.9  
AGH1  64  69.48  47.28  69.43  0.11  0.22  73.35  3.9  7.9  
256  73.86  46.68  75.90  1.5  2.9  81.64  3.6  7.1  
AGH2  64  68.90  47.27  68.73  0.14  0.28  72.82  5.2  0.10  
256  73.00  47.65  74.90  5.3  0.11  80.45  1.1  2.2  
SSH  64  72.17  44.79  60.06  0.12  0.24  81.73  1.1  2.2  
256  73.52  47.13  84.18  1.8  3.5  84.24  1.5  2.9  
DH  64  71.33  41.69  84.26  1.4  2.9  84.24  1.4  2.9  
256  70.73  39.02  84.24  1.4  2.9  84.24  1.4  2.9  
NN  64  76.39  59.76  75.51  1.59  3.11  81.24  0.10  0.20  
256  78.31  61.21  83.46  5.8  0.11  83.94  4.9  9.8  
Sparse  64  7  0.05  0.3  74.17  56.08  71.67  1.99  3.98  81.11  0.46  0.92 
7  0.05  1.0  74.15  51.52  69.08  0.53  1.06  81.67  0.15  0.30  
16  0.005  0.3  74.51  55.54  79.09  1.21  2.42  82.76  0.17  0.34  
256  4  0.05  1.0  74.05  60.73  78.82  3.85  7.34  81.82  1.20  2.37  
4  0.05  1.0  74.48  59.42  81.95  1.18  2.33  83.24  0.35  0.70  
6  0.005  0.3  71.73  54.76  78.34  6.10  11.30  80.85  1.02  2.01 
mAp10 %  MP@5K %  
Method  ImageImage  TagTag  ImageTag  TagImage  ImageImage  TagTag  ImageTag  TagImage 
68.67  71.38  –  –  32.77  32.85  –  –  
CMSSH  75.19  83.05  55.55  50.43  49.69  61.60  37.05  39.13 
MMSparse  73.79  84.49  61.52  59.52  58.13  66.59  57.35  57.29 
6 Conclusions
We presented a new method for learning sparse similaritypreserving hashing functions. The hashing is obtained by solving an regularized minimization of the aggregate of false positive and false negative rates. The embedding function is learned by using ISTAtype neural networks (Gregor & LeCun, 2010). These networks have a particular architecture very effective for learning discriminative sparse codes. We also show that, once the similaritypreserving hashing problem is stated as training a neural network, it can be straightforwardly extended to the multimodal setting. While in this work we only used networks with a single layer, more generic embeddings could be learned with this exact framework simply by considering multiple layers.
A key contribution of this paper is to show that more accurate nearest neighbor retrieval can be obtained by introducing sparsity into the hashing code. SparseHash can achieve significantly higher recall at the same levels of precision than dense hashing schemes with similar number of degrees of freedom. At the same time, the sparsity in the hash codes allows retrieving partial collisions at much lower computational complexity than their dense counterparts in a Hamming ball with the same radius. Extensive experimental results backup these claims, showing that the proposed SparseHash framework produces comparable, or superior, results to some of the stateoftheart methods.
References
 Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Bronstein et al. (2010) Bronstein, M. M. et al. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, 2010.
 Chua et al. (2009) Chua, T.S. et al. NUSWIDE: A realworld web image database from national university of Singapore. In Proc. CIVR, 2009.
 Coifman & Lafon (2006) Coifman, R. R. and Lafon, S. Diffusion maps. App. Comp. Harmonic Analysis, 21(1):5–30, 2006.
 Daubechies et al. (2004) Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure and App. Math., 57(11):1413–1457, 2004.
 Davis et al. (2007) Davis et al. Informationtheoretic metric learning. In Proc. ICML, 2007.
 Gionis et al. (1999) Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensions via hashing. In Proc. VLDB, 1999.
 Goemans & Williamson (1995) Goemans, M. and Williamson, D. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, 1995.
 Gong & Lazebnik (2011) Gong, Y. and Lazebnik, S. Iterative quantization: A procrustean approach to learning binary codes. In Proc. CVPR, 2011.
 Gong et al. (2012) Gong, Y. et al. Angular quantizationbased binary codes for fast similarity search. In Proc. NIPS, 2012.
 Grauman & Fergus (2013) Grauman, K. and Fergus, R. Learning binary hash codes for largescale image search. In Machine Learning for Computer Vision, pp. 49–87. Springer, 2013.
 Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, 2010.
 Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
 Johnson & Wichern (2002) Johnson, R. A. and Wichern, D. W. Applied multivariate statistical analysis, volume 4. Prentice Hall, 2002.
 Korman & Avidan (2011) Korman, S. and Avidan, S. Coherency sensitive hashing. In Proc. ICCV, 2011.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
 Kulis & Darrell (2009) Kulis, B. and Darrell, T. Learning to hash with binary reconstructive embeddings. In Proc. NIPS, 2009.
 LeCun (1985) LeCun, Y. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pp. 599–604, 1985.
 Liu et al. (2011) Liu, W. et al. Hashing with graphs. In Proc. ICML, 2011.
 Liu et al. (2012) Liu, Wei et al. Supervised hashing with kernels. In Proc. CVPR, 2012.
 Masci et al. (2011) Masci, J. et al. Descriptor learning for omnidirectional image matching. Technical Report arXiv:1112.6291, 2011.
 McFee & Lanckriet (2009) McFee, B. and Lanckriet, G. R. G. Partial order embedding with multiple kernels. In Proc. ICML, 2009.
 Mika et al. (1999) Mika, S. et al. Fisher discriminant analysis with kernels. In Proc. Neural Networks for Signal Processing, 1999.
 Norouzi & Fleet (2011) Norouzi, M. and Fleet, D. Minimal loss hashing for compact binary codes. In Proc. ICML, 2011.
 Norouzi et al. (2012) Norouzi, M., Fleet, D., and Salakhutdinov, R. Hamming distance metric learning. In Proc. NIPS, 2012.
 Roweis & Saul (2000) Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000.

Schoelkopf et al. (1997)
Schoelkopf, B., Smola, A., and Mueller, K. R.
Kernel principal component analysis.
Artificial Neural Networks, pp. 583–588, 1997. 
Shakhnarovich et al. (2003)
Shakhnarovich, G., Viola, P., and Darrell, T.
Fast pose estimation with parametersensitive hashing.
In Proc. CVPR, 2003.  Shen et al. (2009) Shen, C. et al. Positive semidefinite metric learning with boosting. In Proc. NIPS, 2009.
 Sprechmann et al. (2012) Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning efficient sparse and low rank models. Technical Report arXiv:1010.3467, 2012.
 Strecha et al. (2012) Strecha, C. et al. LDAHash: Improved matching with smaller descriptors. PAMI, 34(1):66–78, 2012.
 Taylor et al. (2011) Taylor, G. W. et al. Learning invariance through imitation. In Proc. CVPR, 2011.

Torralba et al. (2008a)
Torralba, A., Fergus, R., and Freeman, W. T.
80 million tiny images: A large data set for nonparametric object and scene recognition.
PAMI, 30(11):1958–1970, 2008a.  Torralba et al. (2008b) Torralba, A., Fergus, R., and Weiss, Y. Small codes and large image databases for recognition. In Proc. CVPR, 2008b.
 Wang et al. (2010) Wang, J., Kumar, S., and Chang, S.F. Sequential projection learning for hashing with compact codes. In Proc. ICML, 2010.
 Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
 Weiss et al. (2008) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. NIPS, 2008.
 Xing et al. (2002) Xing, E. P. et al. Distance metric learning with application to clustering with sideinformation. In Proc. NIPS, 2002.
 Yagnik et al. (2011) Yagnik, J. et al. The power of comparative reasoning. In Proc. CVPR, 2011.
Supplementary Material for Sparse similaritypreserving hashing
Hamming radius  Hamming radius  
Method  mAP  Prec.  Recall  F1  Prec.  Recall  F1  
17.42  –  –  –  –  –  –  
KSH  48  31.10  18.22  0.44  0.86  5.39  5.6  0.11  
64  32.49  10.86  0.13  0.26  2.49  9.6  1.9  
128  33.50  2.91  3.3  6.5  0.67  4.5  8.9  
AGH1  48  14.55  15.95  1.4  2.8  4.88  2.2  4.4  
64  14.22  6.50  4.1  8.1  3.06  1.2  2.4  
128  13.53  2.89  1.1  2.2  1.58  3.4  6.8  
AGH2  48  15.34  17.43  3.6  7.1  5.44  3.5  6.9  
64  14.99  7.63  7.2  1.4  3.61  1.4  2.7  
128  14.38  3.78  1.6  3.2  1.43  3.9  7.8  
SSH  48  15.78  9.92  6.6  1.3  0.30  5.1  1.0  
64  17.18  1.52  3.1  6.1  1.0  1.7  3.3  
128  17.20  0.30  5.1  1.0  0.10  1.7  3.4  
DH  48  13.13  3.0  5.1  1.0  1.0  1.7  3.4  
64  13.07  1.0  1.7  3.3  0.00  0.00  0.00  
128  13.12  0.00  0.00  0.00  0.00  0.00  0.00  
NN  48  30.18  32.69  0.74  1.45  9.47  5.2  0.10  
64  34.74  22.78  0.28  5.5  5.70  8.8  1.8  
128  37.89  5.38  2.9  5.7  1.39  2.2  4.4  
Sparse  48  16  0.01  0.1  23.07  32.69  0.93  1.81  16.65  5.0  0.10 
7  0.001  0.1  21.08  26.03  12.56  17.00  26.65  3.04  5.46  
64  11  0.005  0.1  23.80  31.74  6.87  11.30  31.12  0.86  1.70  
7  0.001  0.1  21.29  21.41  41.68  28.30  25.27  10.17  14.50  
128  16  0  0.1  21.97  25.94  18.11  21.30  27.99  3.81  6.71 
Hamming radius  Hamming radius  
Method  mAP@10  MP@5K  Prec.  Recall  F1  Prec.  Recall  F1  

68.67  32.77  
KSH  64  72.85  42.74  83.80  6.1  1.2  84.21  1.7  3.3  
80  72.76  43.32  84.21  1.8  3.6  84.23  1.4  2.9  
256  73.73  45.35  84.24  1.4  2.9  84.24  1.4  2.9  
AGH1  64  69.48  47.28  69.43  0.11  0.22  73.35  3.9  7.9  
80  69.62  47.23  71.15  7.5  0.15  74.14  2.5  5.1  
256  73.86  46.68  75.90  1.5  2.9  81.64  3.6  7.1  
AGH2  64  68.90  47.27  68.73  0.14  0.28  72.82  5.2  0.10  
80  69.73  47.32  70.57  0.12  0.24  73.85  4.2  8.3  
256  73.00  47.65  74.90  5.3  0.11  80.45  1.1  2.2  
SSH  64  72.17  44.79  60.06  0.12  0.24  81.73  1.1  2.2  
80  72.58  46.96  83.96  1.9  3.9  80.91  1.3  2.6  
256  73.52  47.13  84.18  1.8  3.5  84.24  1.5  2.9  
DH  64  71.33  41.69  84.26  1.4  2.9  84.24  1.4  2.9  
80  70.34  37.75  84.24  4.9  9.8  84.24  4.9  9.8  
256  70.73  39.02  84.24  1.4  2.9  84.24  1.4  2.9  
NN  64  76.39  59.76  75.51  1.59  3.11  81.24  0.10  0.20  
80  75.51  59.59  77.17  2.02  3.94  81.89  0.24  0.48  
256  78.31  61.21  83.46  5.8  0.11  83.94  4.9  9.8  
Sparse  64  7  0.05  0.3  74.17  56.08  71.67  1.99  3.98  81.11  0.46  0.92 
7  0.05  1.0  74.15  51.52  69.08  0.53  1.06  81.67  0.15  0.30  
16  0.005  0.3  74.51  55.54  79.09  1.21  2.42  82.76  0.17  0.34  
256  4  0.05  1.0  74.05  60.73  78.82  3.85  7.34  81.82  1.20  2.37  
4  0.05  1.0  74.48  59.42  81.95  1.18  2.33  83.24  0.35  0.70  
6  0.005  0.3  71.73  54.76  78.34  6.10  11.30  80.85  1.02  2.01 