Efficient computation of similarity between entries in large-scale databases has attracted increasing interest, given the explosive growth of data that has to be collected, processed, stored, and searched for. This problem arises naturally in applications such as image-based retrieval, ranking, classification, detection, tracking, and registration. In all these problems, given a query object (usually represented as a feature vector), one has to determine the closest entries (nearest neighbors) in a large (or huge) database. Since the notion of similarity of (for example) visual objects is rather elusive and cannot be measured explicitly, one often resorts to machine learning techniques that allow constructing similarity from examples of data. Such methods are generally referred to assimilarity or metric learning.
Traditionally, similarity learning methods can be divided into unsupervised and supervised, with the former relying on the data only without using any side information. PCA-type methods (Schoelkopf et al., 1997) use global structure of the data, while manifold learning techniques such as locally linear embedding (Roweis & Saul, 2000), eigenmaps (Belkin & Niyogi, 2003), and diffusion maps (Coifman & Lafon, 2006) consider data as a low-dimensional manifold and use its local intrinsic structure to represent similarity. Supervised methods assume that additional information, such as class labels (Johnson & Wichern, 2002; Mika et al., 1999; Weinberger & Saul, 2009; Xing et al., 2002): distances, similar and dissimilar pairs (Davis et al., 2007), or order relations (McFee & Lanckriet, 2009; Shen et al., 2009), is provided together with the data examples. Many similarity learning methods use some representation of the distance, e.g., in the form of a parametric embedding from the original data space to some target space. In the simplest case, such an embedding is a linear projection acting as dimensionality reduction, and the metric of the target space is Euclidean or Mahalanobis distance (Shen et al., 2009; Weinberger & Saul, 2009).
More recently, motivated by the need for efficient techniques for big data, there has been an increased interest in similarity learning methods based on embedding the data in spaces of binary codes with the Hamming metric (Gong et al., 2012; Gong & Lazebnik, 2011; Kulis & Darrell, 2009; Liu et al., 2012; Norouzi et al., 2012; Norouzi & Fleet, 2011; Wang et al., 2010). Such an embedding can be considered as a hashing function acting on the data trying to preserve some underlying similarity. Notable examples of the unsupervised setting of this problem include locality sensitive hashing (LSH) (Gionis et al., 1999) and spectral hashing (Weiss et al., 2008; Liu et al., 2011)
, which try to approximate some trusted standard similarity such as the Jaccard index or the cosine distance. Similarly, Yagnik et al. (2011) proposed computing ordinal embeddings based on partial order statistics such that Hamming distance in the resulting space closely correlates with rank similarity measures. Unsupervised methods cannot be used to learn semantic similarities given by example data. Shakhnarovich et al. (2003) proposed to construct optimal LSH-like similarity-sensitive hashes
(SSH) for data with given binary similarity function using boosting, considering each dimension of the hashing function as a weak classifier. In the same setting, a simple method based on eigendecomposition of covariance matrices of positive and negative samples was proposed by(Strecha et al., 2012). Masci et al. (2011) posed the problem as a neural network learning. Hashing methods have been used successfully in various vision applications such large-scale retrieval (Torralba et al., 2008b), feature descriptor learning (Strecha et al., 2012; Masci et al., 2011), image matching (Korman & Avidan, 2011) and alignment (Bronstein et al., 2010).
The appealing property of such similarity-preserving hashing methods is the compactness of the representation and the low complexity involved in distance computation: finding similar objects is done through determining hash collisions (i.e. looking for nearest neighbors in Hamming metric balls of radius zero), with complexity practically constant in the database size. In practice, however, most methods consider nearest neighbors lying at larger than zero radii, which then cannot be done as efficiently. The reason behind this is the difficulty of simple hash functions (typically low-dimensional linear projections) to achieve simultaneously high precision and recall by only requiring hash collisions.
Main contributions. In this paper, we propose to introduce structure into the binary representation at the expense of its length, an idea that has been shown spectacularly powerful and led to numerous applications of sparse redundant representation and compressed sensing techniques. We introduce a sparse similarity-preserving hashing technique, SparseHash, and show a substantial evidence of its superior recall at precision comparable to that of state-of-the-art methods, on top of its intrinsic computational benefits. To the best of our knowledge, this is the first time sparse structure are employed in similarity-preserving hashing. We also show that the proposed sparse hashing technique can be thought of as a feed-forward neural network, whose architecture is motivated by the iterative shrinkage algorithms used for sparse representation pursuit (Daubechies et al., 2004). The network is trained using stochastic gradient, scalable to very large training sets. Finally, we present an extension of SparseHash to multimodal data, allowing its use in multi-modal and cross-modality retrieval tasks.
Let be the data (or feature) space with a binary similarity function . In some cases, the similarity function can be obtained by thresholding some trusted metric on such as the metric; in other cases, the data form a (typically, latent) low-dimensional manifold, whose geodesic metric is more meaningful than that of the embedding Euclidean space. In yet other cases, represent a semantic rather than geometric notion of similarity, and may thus violate metric properties. It is customary to partition into similar pairs of points (positives) , and dissimilar pairs of points (negatives) .
Similarity-preserving hashing is the problem of representing the data from the space in the space of -dimensional binary vectors with the Hamming metric by means of an embedding that preserves the original similarity relation, in the sense that there exist two radii,
such that with high probabilityand .
In practice, the similarity is frequently unknown and hard to model, however, it is possible to sample it on some subset of the data. In this setting, the problem of similarity-preserving hashing boils down to finding an embedding minimizing the aggregate of false positive and false negative rates,
Problem (1) is highly non-linear and non-convex. We list below several methods for its optimization.
Similarity-sensitive hashing (SSH). Shakhnarovich et al. (2003) studied a particular setting of problem (1) with embedding of the form , where is an projection matrix and is an bias vector, and proposed the SSH algorithm constructing the dimensions of one-by-one using boosting. The expectations in (1) are weighted, where stronger weights are given to misclassified pairs from previous iteration.
where are the covariance matrices of the differences of positive and negative samples, respectively. Solving (2) w.r.t. to the projection matrix
amounts to finding the smallest eigenvectors of the covariance difference matrix. The vector is found separately, independently for each dimension.
Neural network hashing (NN-hash). Masci et al. (2011) realized the function as a single-layer neural network with activation function, where the coefficients act as the layer weights and bias, respectively. Coupling two such networks with identical parameters in a so-called siamese architecture (Hadsell et al., 2006; Taylor et al., 2011), one can represent the loss (1) as
The second term in (2) is a hinge-loss
providing robustness to outliers and producing a mapping for which negatives are pulled-apart.
Finding the network parameters
minimizing loss function (2) is done using standard NN learning techniques, e.g. the back-propagation algorithm (LeCun, 1985). Compared to SSH and DH, the NN-hash attempts to solve the full non-linear problem rather than using often suboptimal solutions of the relaxed linearized or separable problem such as (2).
3 Sparse similarity-preserving hashing
The selection of the number of bits and the rejection Hamming radius in a similarity-preserving hash has an important influence on the tradeoff between precision and recall. The increase of increases the precision, as a higher-dimensional embedding space allows representing more complicated decision boundaries. At the same time, with the increase of , the relative volume of the ball
containing the positives decays exponentially fast, a phenomenon known as the curse of dimensionality, resulting in a rapid decrease of the recall. This is a well-documented phenomenon that affects all hashing techniques(Grauman & Fergus, 2013). For instance, in the context of LSH, it can be shown that the collision probability between two points decreases exponentially with the code-length (Goemans & Williamson, 1995). Furthermore, increasing slows down the retrieval.
The low recall typical to long codes can be improved by increasing the rejection radius . However, this comes at the expense of increased query time, since the search complexity directly depends on the rejection radius . For (collision), a look-up table (LUT) is used: the query code is fed into the LUT, containing all entries in the database having the same code. The complexity is , independent of the database size , but often with a large constant. For small (partial collision), the search is done as for using perturbation of the query: at most bits of the query are changed, and then it is fed into the LUT. The final result is the union of all the retrieved results. Complexity in this case is . Finally, for large radii it is often cheaper in practice to use exhaustive search with complexity (for typical code lengths and database sizes used in vision applications, using is slower than brute-force search (Grauman & Fergus, 2013)). Consequently, practical retrieval based on similarity-preserving hashing schemes suffers from a fundamental limitation of the precision-recall-speed tradeoff: one has to choose between fast retrieval (small and , resulting in low recall), high recall (large , small , slow retrieval), or high precision (large , small recall, and slow retrieval).
The key idea of this paper is to control the exploding volume of the embedding space by introducing structure into the binary code. While different types of structure can be considered in principle, we limit our attention to sparse
hash codes. A number of recent studies has demonstrated that sparse over-complete representations have several theoretical and practical advantages when modeling compressible data leading to state-of-the-art results in many applications in computer vision and machine learning. We argue, and show experimentally in Section5, that compared to its “dense” counterpart, an -bit -sparse similarity-preserving hash can enjoy from the high precision typical for long hashes, while having higher recall roughly comparable to that of a dense hashing scheme with bits (which has the same number of degrees of freedom of the -bit sparse hash).
SparseHash. In order to achieve sparsity, a regularization needs to be incorporated into problem (1) so that the obtained embedding will produce codes having only a small number of non-zero elements. In this work we employ an -norm regularization, extensively used in the compressed sensing literature to promote sparsity. Specifically, the loss considered in the minimization of the proposed SparseHash framework is given by the average of
over the training set, where is the groundtruth similarity function (1: similar, 0: dissimilar), is a parameter controlling the level of sparsity, is a weighting parameter governing the false positive and negative rate tradeoff, and is a margin.
With the new loss function given in (4), solving (1) will produce a sparse embedding that minimizes the aggregate of false positive and false negative rates for a given parametrization. Now the question is what parametrized family of embedding functions would lead to the best sparse similarity-preserving hashing codes? While there is no absolute answer to this question, recent approaches aimed at finding fast approximations of sparse codes have shed some light on this issue from a practical perspective (Gregor & LeCun, 2010; Sprechmann et al., 2012), and we use this same criterion for our proposed framework.
Gregor and LeCun (2010) proposed tailored feed-forward architectures capable of producing highly accurate approximations of the true sparse codes. These architectures were designed to mimic the iterations of successful first order optimization algorithms such as the iterative thresholding algorithm (ISTA) (Daubechies et al., 2004). The close relation between the iterative solvers and the network architectures plays a fundamental role in the quality of the approximation. This particular design of the encoder architecture was shown to lead to considerably better approximations than other of-the-shelf feed-forward neural networks (Gregor & LeCun, 2010). These ideas can be generalized to many different uses of sparse coding, in particular, they can be very effective in discriminative scenarios, performing similarly or better than exact algorithms at a small fraction of the computational cost (Sprechmann et al., 2012). These architectures are flexible enough for approximating the sparse hash in (4).111The hash codes produced by the proposed architecture can only be made on average -sparse by tuning the parameter . In order to guarantee that the codes contain no more than non-zeros, one can resort to the CoD encoders, derived from the coordinate descent pursuit algorithm (Gregor & LeCun, 2010), wherein is upper-bounded by the number of network layers. We stress that in our application the exact sparsity is not important, since we get the same qualitative behavior.
Implementation. We implement SparseHash by coupling two ISTA-type networks, sharing the same set of parameters as the ones described in (Gregor & LeCun, 2010; Sprechmann et al., 2012), and trained using the loss (4). The architecture of an ISTA-type network (Figure 1) can also be seen as a recurrent network with a soft threshold activation function. A conventional ISTA network designed to obtain sparse representations with fixed complexity has continuous output units. We follow the approach of (Masci et al., 2011) to obtain binary codes by adding a activation function. Such smooth approximation of the binary outputs is also similar to the logistic function used by KSH (Liu et al., 2012). We initialize with a unit length normalized random subset of the training vectors, as in the original ISTA algorithm considering as dictionary and the thresholds with zeros. The shrinkage activation is defined as .
The application of the learned hash function to a new data sample involves few matrix multiplications and the computation of the element-wise soft thresholding, and is on par with the fastest methods available, such as SSH (Shakhnarovich et al., 2003), DH (Strecha et al., 2012), and AGH (Liu et al., 2011).
4 Multi-modal SparseHash
In modern retrieval applications, a single object is often represented by more than one data modality. For example, images are frequently accompanied by textual tags, and video by an audio track. The need to search multi-modal data requires comparing objects incommensurable in their original representation. Similarity-preserving hashing can address this need by mapping the modalities into a common space, thus making them comparable in terms of a single similarity function, such as the Hamming metric. For simplicity, we will henceforth limit our discussion to two modalities, though the presented ideas can be straightforwardly generalized to any number of modalities.
We assume the data comes from two distinct data spaces and , equipped with intra-modality similarity functions and , respectively. We furthermore assume the existence of an inter-modality similarity . Typically, examples of similar and dissimilar objects across modalities are more expensive to obtain compared to their intra-modality counterparts. We construct two embeddings and , in such way that the Hamming metric preserves the similarity relations of the modalities. We distinguish between a cross-modal similarity-preserving hashing, preserving only the intra-modality similarity (Bronstein et al., 2010), and the full multi-modal setting, also preserving inter-modal similarities.
Our sparse similarity-preserving hashing technique can be generalized to both settings. We construct an independent SparseHash network for each modality, and train them by minimizing an aggregate loss of the form
with respect to the parameters of the networks. The parameters and control the relative importance of the intra-modality similarity, and are set to zero in the cross-modal regime. We refer to the networks constructed this way as MM-SparseHash.
|Hamming radius||Hamming radius|
5 Experimental results
We compare SparseHash to several state-of-the-art supervised and semi-supervised hashing methods: DH (Strecha et al., 2012), SSH (Shakhnarovich et al., 2003), AGH (Liu et al., 2011), KSH (Liu et al., 2012), and NNhash (Masci et al., 2011)
, using codes provided by the authors. For SparseHash, we use fully online training via stochastic gradient descent with annealed learning rate and momentum, fixing the maximum number of epochs to 250. A single layer ISTA net is used in all experiments. All dense hash methods achieve an average sparsity of aboutper sample whereas SparseHash achieves much sparser and structured codes; i.e.
sparsity on CIFAR10 with hash length of 128. Both sparse and dense codes are well distributed; i.e. small variance of non zero components per code.
Evaluation. We use several criteria to evaluate the performance of the methods: precision and recall (PR) for different Hamming radii, and the F1 score (their harmonic average); mean average precision at , defined as , where is the relevance of the th results (one if relevant and zero otherwise), and is the precision at (percentage of relevant results in the first top-ranked matches); and the mean precision (MP), defined as the percentage of correct matches for a fixed number of retrieved elements. For the PR curves we use the ranking induced by the Hamming distance between the query and the database samples. In case of we considered only the results falling into the Hamming ball of radius .
CIFAR10 (Krizhevsky, 2009) is a standard set of 60K labeled images belonging to 10 different classes, sampled from the 80M tiny image benchmark (Torralba et al., 2008a). The images are represented using 384-dimensional GIST descriptors. Following (Liu et al., 2012), we used a training set of 200 images for each class; for testing, we used a disjoint query set of 100 images per class and the remaining 59K images as database.
Figure 4 shows examples of nearest neighbors retrieval by SparseHash. Performance of different methods is compared in Table 5 and Figures 5–6. In Figure 6 (left), we observe two phenomena: first, the recall of dense hash methods drops significantly with the increase of hash length (as expected from our analysis in Section 3; increasing the hash length is needed for precision performance), while the recall of SparseHash, being dependent on the number of non-zero elements rather than hash length, remains approximately unchanged. Second, SparseHash has significantly higher recall at low compared to other methods. This is also evinced in Figure 3 where we show the tradeoff between precision, recall and retrieval time for hashes of length . We used an efficient implementation of LUT-based and brute force search and took the fastest among the two; with codes of length on CIFAR10 dataset, for radii and LUT-based search showed significant speedup, while for brute force search was faster. In order to further analyze this behavior, we measured the average number of codes which are mapped to the same point for each of the methods. Results are reported in Table 2.
|Avg. # of -neighbors|
NUS (Chua et al., 2009) is a dataset containing 270K annotated images from Flickr. Every images is associated with one or more of the different 81 concepts, and is described using a 500-dimensional bag-of-features. In the training and evaluation, we followed the protocol of (Liu et al., 2011): two images were considered as neighbors if they share at least one common concept (only 21 most frequent concepts are considered). Testing was done on a query set of 100 images per concept; training was performed on 100K pairs of images.
Performance is shown in Table 6 and Figures 5–6; retrieved neighbors are shown in Figure 2. We again see behavior consistent with our analysis and SparseHash significant outperforms the other methods.
Multi-modal hashing. We repeated the experiment on the NUS dataset with the same indices of positive and negative pairs, adding the Tags modality represented as 1K-dimensional bags of words. The training set contained pairs of similar and dissimilar Images, Tags, and cross-modality Tags-Images pairs. We compare our MM-SparseHash to the cross-modal SSH (CM-SSH) method (Bronstein et al., 2010). Results are shown in Table 4 and in Figure 9. MM-SparseHash significantly outperforms CM-SSH in previously reported state-of-the-art cross-modality (Images-Tags and Tags-Images) retrieval. Both methods outperform the baseline for intra-modal (Tags-Tags and Images-Images) retrieval; since the two modalities complement each other, we attribute the improvement to the ability of the model to pick up such correlations.
|Hamming radius||Hamming radius|
|mAp10 %||MP@5K %|
We presented a new method for learning sparse similarity-preserving hashing functions. The hashing is obtained by solving an regularized minimization of the aggregate of false positive and false negative rates. The embedding function is learned by using ISTA-type neural networks (Gregor & LeCun, 2010). These networks have a particular architecture very effective for learning discriminative sparse codes. We also show that, once the similarity-preserving hashing problem is stated as training a neural network, it can be straightforwardly extended to the multi-modal setting. While in this work we only used networks with a single layer, more generic embeddings could be learned with this exact framework simply by considering multiple layers.
A key contribution of this paper is to show that more accurate nearest neighbor retrieval can be obtained by introducing sparsity into the hashing code. SparseHash can achieve significantly higher recall at the same levels of precision than dense hashing schemes with similar number of degrees of freedom. At the same time, the sparsity in the hash codes allows retrieving partial collisions at much lower computational complexity than their dense counterparts in a Hamming ball with the same radius. Extensive experimental results backup these claims, showing that the proposed SparseHash framework produces comparable, or superior, results to some of the state-of-the-art methods.
- Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
- Bronstein et al. (2010) Bronstein, M. M. et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Proc. CVPR, 2010.
- Chua et al. (2009) Chua, T.-S. et al. NUS-WIDE: A real-world web image database from national university of Singapore. In Proc. CIVR, 2009.
- Coifman & Lafon (2006) Coifman, R. R. and Lafon, S. Diffusion maps. App. Comp. Harmonic Analysis, 21(1):5–30, 2006.
- Daubechies et al. (2004) Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure and App. Math., 57(11):1413–1457, 2004.
- Davis et al. (2007) Davis et al. Information-theoretic metric learning. In Proc. ICML, 2007.
- Gionis et al. (1999) Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensions via hashing. In Proc. VLDB, 1999.
- Goemans & Williamson (1995) Goemans, M. and Williamson, D. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, 1995.
- Gong & Lazebnik (2011) Gong, Y. and Lazebnik, S. Iterative quantization: A procrustean approach to learning binary codes. In Proc. CVPR, 2011.
- Gong et al. (2012) Gong, Y. et al. Angular quantization-based binary codes for fast similarity search. In Proc. NIPS, 2012.
- Grauman & Fergus (2013) Grauman, K. and Fergus, R. Learning binary hash codes for large-scale image search. In Machine Learning for Computer Vision, pp. 49–87. Springer, 2013.
- Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, 2010.
- Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
- Johnson & Wichern (2002) Johnson, R. A. and Wichern, D. W. Applied multivariate statistical analysis, volume 4. Prentice Hall, 2002.
- Korman & Avidan (2011) Korman, S. and Avidan, S. Coherency sensitive hashing. In Proc. ICCV, 2011.
- Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
- Kulis & Darrell (2009) Kulis, B. and Darrell, T. Learning to hash with binary reconstructive embeddings. In Proc. NIPS, 2009.
- LeCun (1985) LeCun, Y. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pp. 599–604, 1985.
- Liu et al. (2011) Liu, W. et al. Hashing with graphs. In Proc. ICML, 2011.
- Liu et al. (2012) Liu, Wei et al. Supervised hashing with kernels. In Proc. CVPR, 2012.
- Masci et al. (2011) Masci, J. et al. Descriptor learning for omnidirectional image matching. Technical Report arXiv:1112.6291, 2011.
- McFee & Lanckriet (2009) McFee, B. and Lanckriet, G. R. G. Partial order embedding with multiple kernels. In Proc. ICML, 2009.
- Mika et al. (1999) Mika, S. et al. Fisher discriminant analysis with kernels. In Proc. Neural Networks for Signal Processing, 1999.
- Norouzi & Fleet (2011) Norouzi, M. and Fleet, D. Minimal loss hashing for compact binary codes. In Proc. ICML, 2011.
- Norouzi et al. (2012) Norouzi, M., Fleet, D., and Salakhutdinov, R. Hamming distance metric learning. In Proc. NIPS, 2012.
- Roweis & Saul (2000) Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000.
Schoelkopf et al. (1997)
Schoelkopf, B., Smola, A., and Mueller, K. R.
Kernel principal component analysis.Artificial Neural Networks, pp. 583–588, 1997.
Shakhnarovich et al. (2003)
Shakhnarovich, G., Viola, P., and Darrell, T.
Fast pose estimation with parameter-sensitive hashing.In Proc. CVPR, 2003.
- Shen et al. (2009) Shen, C. et al. Positive semidefinite metric learning with boosting. In Proc. NIPS, 2009.
- Sprechmann et al. (2012) Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning efficient sparse and low rank models. Technical Report arXiv:1010.3467, 2012.
- Strecha et al. (2012) Strecha, C. et al. LDAHash: Improved matching with smaller descriptors. PAMI, 34(1):66–78, 2012.
- Taylor et al. (2011) Taylor, G. W. et al. Learning invariance through imitation. In Proc. CVPR, 2011.
Torralba et al. (2008a)
Torralba, A., Fergus, R., and Freeman, W. T.
80 million tiny images: A large data set for nonparametric object and scene recognition.PAMI, 30(11):1958–1970, 2008a.
- Torralba et al. (2008b) Torralba, A., Fergus, R., and Weiss, Y. Small codes and large image databases for recognition. In Proc. CVPR, 2008b.
- Wang et al. (2010) Wang, J., Kumar, S., and Chang, S.-F. Sequential projection learning for hashing with compact codes. In Proc. ICML, 2010.
- Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
- Weiss et al. (2008) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. NIPS, 2008.
- Xing et al. (2002) Xing, E. P. et al. Distance metric learning with application to clustering with side-information. In Proc. NIPS, 2002.
- Yagnik et al. (2011) Yagnik, J. et al. The power of comparative reasoning. In Proc. CVPR, 2011.
Supplementary Material for Sparse similarity-preserving hashing
|Hamming radius||Hamming radius|
|Hamming radius||Hamming radius|