Sparse similarity-preserving hashing

12/19/2013 ∙ by Jonathan Masci, et al. ∙ 0

In recent years, a lot of attention has been devoted to efficient nearest neighbor search by means of similarity-preserving hashing. One of the plights of existing hashing techniques is the intrinsic trade-off between performance and computational complexity: while longer hash codes allow for lower false positive rates, it is very difficult to increase the embedding dimensionality without incurring in very high false negatives rates or prohibiting computational costs. In this paper, we propose a way to overcome this limitation by enforcing the hash codes to be sparse. Sparse high-dimensional codes enjoy from the low false positive rates typical of long hashes, while keeping the false negative rates similar to those of a shorter dense hashing scheme with equal number of degrees of freedom. We use a tailored feed-forward neural network for the hashing function. Extensive experimental evaluation involving visual and multi-modal data shows the benefits of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efficient computation of similarity between entries in large-scale databases has attracted increasing interest, given the explosive growth of data that has to be collected, processed, stored, and searched for. This problem arises naturally in applications such as image-based retrieval, ranking, classification, detection, tracking, and registration. In all these problems, given a query object (usually represented as a feature vector), one has to determine the closest entries (nearest neighbors) in a large (or huge) database. Since the notion of similarity of (for example) visual objects is rather elusive and cannot be measured explicitly, one often resorts to machine learning techniques that allow constructing similarity from examples of data. Such methods are generally referred to as

similarity or metric learning.

Traditionally, similarity learning methods can be divided into unsupervised and supervised, with the former relying on the data only without using any side information. PCA-type methods (Schoelkopf et al., 1997) use global structure of the data, while manifold learning techniques such as locally linear embedding (Roweis & Saul, 2000), eigenmaps (Belkin & Niyogi, 2003), and diffusion maps (Coifman & Lafon, 2006) consider data as a low-dimensional manifold and use its local intrinsic structure to represent similarity. Supervised methods assume that additional information, such as class labels (Johnson & Wichern, 2002; Mika et al., 1999; Weinberger & Saul, 2009; Xing et al., 2002): distances, similar and dissimilar pairs (Davis et al., 2007), or order relations (McFee & Lanckriet, 2009; Shen et al., 2009), is provided together with the data examples. Many similarity learning methods use some representation of the distance, e.g., in the form of a parametric embedding from the original data space to some target space. In the simplest case, such an embedding is a linear projection acting as dimensionality reduction, and the metric of the target space is Euclidean or Mahalanobis distance (Shen et al., 2009; Weinberger & Saul, 2009).

More recently, motivated by the need for efficient techniques for big data, there has been an increased interest in similarity learning methods based on embedding the data in spaces of binary codes with the Hamming metric (Gong et al., 2012; Gong & Lazebnik, 2011; Kulis & Darrell, 2009; Liu et al., 2012; Norouzi et al., 2012; Norouzi & Fleet, 2011; Wang et al., 2010). Such an embedding can be considered as a hashing function acting on the data trying to preserve some underlying similarity. Notable examples of the unsupervised setting of this problem include locality sensitive hashing (LSH) (Gionis et al., 1999) and spectral hashing (Weiss et al., 2008; Liu et al., 2011)

, which try to approximate some trusted standard similarity such as the Jaccard index or the cosine distance. Similarly, Yagnik et al. (

2011) proposed computing ordinal embeddings based on partial order statistics such that Hamming distance in the resulting space closely correlates with rank similarity measures. Unsupervised methods cannot be used to learn semantic similarities given by example data. Shakhnarovich et al. (2003) proposed to construct optimal LSH-like similarity-sensitive hashes

(SSH) for data with given binary similarity function using boosting, considering each dimension of the hashing function as a weak classifier. In the same setting, a simple method based on eigendecomposition of covariance matrices of positive and negative samples was proposed by

(Strecha et al., 2012). Masci et al. (2011) posed the problem as a neural network learning. Hashing methods have been used successfully in various vision applications such large-scale retrieval (Torralba et al., 2008b), feature descriptor learning (Strecha et al., 2012; Masci et al., 2011), image matching (Korman & Avidan, 2011) and alignment (Bronstein et al., 2010).

The appealing property of such similarity-preserving hashing methods is the compactness of the representation and the low complexity involved in distance computation: finding similar objects is done through determining hash collisions (i.e. looking for nearest neighbors in Hamming metric balls of radius zero), with complexity practically constant in the database size. In practice, however, most methods consider nearest neighbors lying at larger than zero radii, which then cannot be done as efficiently. The reason behind this is the difficulty of simple hash functions (typically low-dimensional linear projections) to achieve simultaneously high precision and recall by only requiring hash collisions.

Main contributions. In this paper, we propose to introduce structure into the binary representation at the expense of its length, an idea that has been shown spectacularly powerful and led to numerous applications of sparse redundant representation and compressed sensing techniques. We introduce a sparse similarity-preserving hashing technique, SparseHash, and show a substantial evidence of its superior recall at precision comparable to that of state-of-the-art methods, on top of its intrinsic computational benefits. To the best of our knowledge, this is the first time sparse structure are employed in similarity-preserving hashing. We also show that the proposed sparse hashing technique can be thought of as a feed-forward neural network, whose architecture is motivated by the iterative shrinkage algorithms used for sparse representation pursuit (Daubechies et al., 2004). The network is trained using stochastic gradient, scalable to very large training sets. Finally, we present an extension of SparseHash to multimodal data, allowing its use in multi-modal and cross-modality retrieval tasks.

2 Background

Let be the data (or feature) space with a binary similarity function . In some cases, the similarity function can be obtained by thresholding some trusted metric on such as the metric; in other cases, the data form a (typically, latent) low-dimensional manifold, whose geodesic metric is more meaningful than that of the embedding Euclidean space. In yet other cases, represent a semantic rather than geometric notion of similarity, and may thus violate metric properties. It is customary to partition into similar pairs of points (positives) , and dissimilar pairs of points (negatives) .

Similarity-preserving hashing is the problem of representing the data from the space in the space of -dimensional binary vectors with the Hamming metric by means of an embedding that preserves the original similarity relation, in the sense that there exist two radii,

such that with high probability

and .

In practice, the similarity is frequently unknown and hard to model, however, it is possible to sample it on some subset of the data. In this setting, the problem of similarity-preserving hashing boils down to finding an embedding minimizing the aggregate of false positive and false negative rates,

(1)

Problem (1) is highly non-linear and non-convex. We list below several methods for its optimization.

Similarity-sensitive hashing (SSH). Shakhnarovich et al. (2003) studied a particular setting of problem (1) with embedding of the form , where is an projection matrix and is an bias vector, and proposed the SSH algorithm constructing the dimensions of one-by-one using boosting. The expectations in (1) are weighted, where stronger weights are given to misclassified pairs from previous iteration.

Diff-hash (DH). Strecha et al. (2012) linearized the embedding to , observing that in this case (1) can be written as

(2)

where are the covariance matrices of the differences of positive and negative samples, respectively. Solving (2) w.r.t. to the projection matrix

amounts to finding the smallest eigenvectors of the covariance difference matrix

. The vector is found separately, independently for each dimension.

Neural network hashing (NN-hash). Masci et al. (2011) realized the function as a single-layer neural network with activation function, where the coefficients act as the layer weights and bias, respectively. Coupling two such networks with identical parameters in a so-called siamese architecture (Hadsell et al., 2006; Taylor et al., 2011), one can represent the loss (1) as

The second term in (2) is a hinge-loss

providing robustness to outliers and producing a mapping for which negatives are pulled

-apart.

Finding the network parameters

minimizing loss function (

2) is done using standard NN learning techniques, e.g. the back-propagation algorithm (LeCun, 1985). Compared to SSH and DH, the NN-hash attempts to solve the full non-linear problem rather than using often suboptimal solutions of the relaxed linearized or separable problem such as (2).

3 Sparse similarity-preserving hashing

The selection of the number of bits and the rejection Hamming radius in a similarity-preserving hash has an important influence on the tradeoff between precision and recall. The increase of increases the precision, as a higher-dimensional embedding space allows representing more complicated decision boundaries. At the same time, with the increase of , the relative volume of the ball

containing the positives decays exponentially fast, a phenomenon known as the curse of dimensionality, resulting in a rapid decrease of the recall. This is a well-documented phenomenon that affects all hashing techniques

(Grauman & Fergus, 2013). For instance, in the context of LSH, it can be shown that the collision probability between two points decreases exponentially with the code-length (Goemans & Williamson, 1995). Furthermore, increasing slows down the retrieval.

The low recall typical to long codes can be improved by increasing the rejection radius . However, this comes at the expense of increased query time, since the search complexity directly depends on the rejection radius . For (collision), a look-up table (LUT) is used: the query code is fed into the LUT, containing all entries in the database having the same code. The complexity is , independent of the database size , but often with a large constant. For small (partial collision), the search is done as for using perturbation of the query: at most bits of the query are changed, and then it is fed into the LUT. The final result is the union of all the retrieved results. Complexity in this case is . Finally, for large radii it is often cheaper in practice to use exhaustive search with complexity (for typical code lengths and database sizes used in vision applications, using is slower than brute-force search (Grauman & Fergus, 2013)). Consequently, practical retrieval based on similarity-preserving hashing schemes suffers from a fundamental limitation of the precision-recall-speed tradeoff: one has to choose between fast retrieval (small and , resulting in low recall), high recall (large , small , slow retrieval), or high precision (large , small recall, and slow retrieval).

Figure 1: Schematic representation of our ISTA-type network which we use to realize sparse hashing. Two such networks, with the same parametrization, are used.

The key idea of this paper is to control the exploding volume of the embedding space by introducing structure into the binary code. While different types of structure can be considered in principle, we limit our attention to sparse

hash codes. A number of recent studies has demonstrated that sparse over-complete representations have several theoretical and practical advantages when modeling compressible data leading to state-of-the-art results in many applications in computer vision and machine learning. We argue, and show experimentally in Section 

5, that compared to its “dense” counterpart, an -bit -sparse similarity-preserving hash can enjoy from the high precision typical for long hashes, while having higher recall roughly comparable to that of a dense hashing scheme with bits (which has the same number of degrees of freedom of the -bit sparse hash).

SparseHash. In order to achieve sparsity, a regularization needs to be incorporated into problem (1) so that the obtained embedding will produce codes having only a small number of non-zero elements. In this work we employ an -norm regularization, extensively used in the compressed sensing literature to promote sparsity. Specifically, the loss considered in the minimization of the proposed SparseHash framework is given by the average of

(4)

over the training set, where is the groundtruth similarity function (1: similar, 0: dissimilar), is a parameter controlling the level of sparsity, is a weighting parameter governing the false positive and negative rate tradeoff, and is a margin.

With the new loss function given in (4), solving (1) will produce a sparse embedding that minimizes the aggregate of false positive and false negative rates for a given parametrization. Now the question is what parametrized family of embedding functions would lead to the best sparse similarity-preserving hashing codes? While there is no absolute answer to this question, recent approaches aimed at finding fast approximations of sparse codes have shed some light on this issue from a practical perspective (Gregor & LeCun, 2010; Sprechmann et al., 2012), and we use this same criterion for our proposed framework.

Gregor and LeCun (2010) proposed tailored feed-forward architectures capable of producing highly accurate approximations of the true sparse codes. These architectures were designed to mimic the iterations of successful first order optimization algorithms such as the iterative thresholding algorithm (ISTA) (Daubechies et al., 2004). The close relation between the iterative solvers and the network architectures plays a fundamental role in the quality of the approximation. This particular design of the encoder architecture was shown to lead to considerably better approximations than other of-the-shelf feed-forward neural networks (Gregor & LeCun, 2010). These ideas can be generalized to many different uses of sparse coding, in particular, they can be very effective in discriminative scenarios, performing similarly or better than exact algorithms at a small fraction of the computational cost (Sprechmann et al., 2012). These architectures are flexible enough for approximating the sparse hash in (4).111The hash codes produced by the proposed architecture can only be made on average -sparse by tuning the parameter . In order to guarantee that the codes contain no more than non-zeros, one can resort to the CoD encoders, derived from the coordinate descent pursuit algorithm (Gregor & LeCun, 2010), wherein is upper-bounded by the number of network layers. We stress that in our application the exact sparsity is not important, since we get the same qualitative behavior.

Implementation. We implement SparseHash by coupling two ISTA-type networks, sharing the same set of parameters as the ones described in (Gregor & LeCun, 2010; Sprechmann et al., 2012), and trained using the loss (4). The architecture of an ISTA-type network (Figure 1) can also be seen as a recurrent network with a soft threshold activation function. A conventional ISTA network designed to obtain sparse representations with fixed complexity has continuous output units. We follow the approach of (Masci et al., 2011) to obtain binary codes by adding a activation function. Such smooth approximation of the binary outputs is also similar to the logistic function used by KSH (Liu et al., 2012). We initialize with a unit length normalized random subset of the training vectors, as in the original ISTA algorithm considering as dictionary and the thresholds with zeros. The shrinkage activation is defined as .

The application of the learned hash function to a new data sample involves few matrix multiplications and the computation of the element-wise soft thresholding, and is on par with the fastest methods available, such as SSH (Shakhnarovich et al., 2003), DH (Strecha et al., 2012), and AGH (Liu et al., 2011).

4 Multi-modal SparseHash

In modern retrieval applications, a single object is often represented by more than one data modality. For example, images are frequently accompanied by textual tags, and video by an audio track. The need to search multi-modal data requires comparing objects incommensurable in their original representation. Similarity-preserving hashing can address this need by mapping the modalities into a common space, thus making them comparable in terms of a single similarity function, such as the Hamming metric. For simplicity, we will henceforth limit our discussion to two modalities, though the presented ideas can be straightforwardly generalized to any number of modalities.

We assume the data comes from two distinct data spaces and , equipped with intra-modality similarity functions and , respectively. We furthermore assume the existence of an inter-modality similarity . Typically, examples of similar and dissimilar objects across modalities are more expensive to obtain compared to their intra-modality counterparts. We construct two embeddings and , in such way that the Hamming metric preserves the similarity relations of the modalities. We distinguish between a cross-modal similarity-preserving hashing, preserving only the intra-modality similarity (Bronstein et al., 2010), and the full multi-modal setting, also preserving inter-modal similarities.

Our sparse similarity-preserving hashing technique can be generalized to both settings. We construct an independent SparseHash network for each modality, and train them by minimizing an aggregate loss of the form

with respect to the parameters of the networks. The parameters and control the relative importance of the intra-modality similarity, and are set to zero in the cross-modal regime. We refer to the networks constructed this way as MM-SparseHash.

Hamming radius Hamming radius
Method mAP Prec. Recall F1 Prec. Recall F1
17.42
KSH 48 31.10 18.22 0.86 0.44 5.39 5.6 0.11
64 32.49 10.86 0.13 0.26 2.49 9.6 1.9
AGH1 48 14.55 15.95 2.8 1.4 4.88 2.2 4.4
64 14.22 6.50 4.1 8.1 3.06 1.2 2.4
AGH2 48 15.34 17.43 7.1 3.6 5.44 3.5 6.9
64 14.99 7.63 7.2 1.4 3.61 1.4 2.7
SSH 48 15.78 9.92 6.6 1.3 0.30 5.1 1.0
64 17.18 1.52 3.0 6.1 1.0 1.69 3.3
DH 48 13.13 3.0 1.0 5.1 1.0 1.7 3.4
64 13.07 1.0 1.7 3.3 0.00 0.00 0.00
NN 48 30.18 32.69 1.45 0.74 9.47 5.2 0.10
64 34.74 22.78 0.28 5.5 5.70 8.8 1.8
Sparse 48 16 0.01 0.1 23.07 32.69 1.81 0.93 16.65 5.0 0.10
7 0.001 0.1 21.08 26.03 17.00 12.56 26.65 3.04 5.46
64 11 0.005 0.1 23.80 31.74 6.87 11.30 31.12 0.86 1.70
7 0.001 0.1 21.29 21.41 41.68 28.30 25.27 10.17 14.50

Table 1: Performance (in %) of different hashing methods on the CIFAR10 dataset with different settings of different length . Extended version of this table is shown in supplementary materials.
Figure 2: Five nearest neighbors retrieved by different hashing methods for two different queries (marked in gray) in the NUS dataset. Correct matches are marked in green. Note that in NUS dataset each image has multiple class labels, and our groundtruth positives are defined as at least one class label in common.

5 Experimental results

We compare SparseHash to several state-of-the-art supervised and semi-supervised hashing methods: DH (Strecha et al., 2012), SSH (Shakhnarovich et al., 2003), AGH (Liu et al., 2011), KSH (Liu et al., 2012), and NNhash (Masci et al., 2011)

, using codes provided by the authors. For SparseHash, we use fully online training via stochastic gradient descent with annealed learning rate and momentum, fixing the maximum number of epochs to 250. A single layer ISTA net is used in all experiments. All dense hash methods achieve an average sparsity of about

per sample whereas SparseHash achieves much sparser and structured codes; i.e.

sparsity on CIFAR10 with hash length of 128. Both sparse and dense codes are well distributed; i.e. small variance of non zero components per code.

Evaluation. We use several criteria to evaluate the performance of the methods: precision and recall (PR) for different Hamming radii, and the F1 score (their harmonic average); mean average precision at , defined as , where is the relevance of the th results (one if relevant and zero otherwise), and is the precision at (percentage of relevant results in the first top-ranked matches); and the mean precision (MP), defined as the percentage of correct matches for a fixed number of retrieved elements. For the PR curves we use the ranking induced by the Hamming distance between the query and the database samples. In case of we considered only the results falling into the Hamming ball of radius .

CIFAR10 (Krizhevsky, 2009) is a standard set of 60K labeled images belonging to 10 different classes, sampled from the 80M tiny image benchmark (Torralba et al., 2008a). The images are represented using 384-dimensional GIST descriptors. Following (Liu et al., 2012), we used a training set of 200 images for each class; for testing, we used a disjoint query set of 100 images per class and the remaining 59K images as database.

Figure 4 shows examples of nearest neighbors retrieval by SparseHash. Performance of different methods is compared in Table 5 and Figures 56. In Figure 6 (left), we observe two phenomena: first, the recall of dense hash methods drops significantly with the increase of hash length (as expected from our analysis in Section 3; increasing the hash length is needed for precision performance), while the recall of SparseHash, being dependent on the number of non-zero elements rather than hash length, remains approximately unchanged. Second, SparseHash has significantly higher recall at low compared to other methods. This is also evinced in Figure 3 where we show the tradeoff between precision, recall and retrieval time for hashes of length . We used an efficient implementation of LUT-based and brute force search and took the fastest among the two; with codes of length on CIFAR10 dataset, for radii and LUT-based search showed significant speedup, while for brute force search was faster. In order to further analyze this behavior, we measured the average number of codes which are mapped to the same point for each of the methods. Results are reported in Table 2.

Figure 3: Tradeoff between recall/precision/query time (in sec) of different hashing methods of length on the CIFAR10 dataset for rejection radii (circle), (triangle) and (plus). Retrieval for was implemented using LUT; for brute-force search was more efficient. Diffhash produced very low recall and is not shown.
Figure 4: Ten nearest neighbors retrieved by SparseHash for five different queries (marked in gray) in the CIFAR10 dataset.
Avg. # of -neighbors
Method Unique codes
KSH 57368 3.95 12.38 27.21
AGH2 55863 1.42 2.33 4.62
SSH 59733 1.01 1.12 1.88
DH 59999 1.00 1.00 1.00
NN 54259 4.83 20.12 56.70
Sparse 9828 798.47 2034.73 3249.86
Table 2: Total number of unique codes for the entire CIFAR10 dataset and average number of retrieved results for various Hamming radii search. Hashes of length 48.

NUS (Chua et al., 2009) is a dataset containing 270K annotated images from Flickr. Every images is associated with one or more of the different 81 concepts, and is described using a 500-dimensional bag-of-features. In the training and evaluation, we followed the protocol of (Liu et al., 2011): two images were considered as neighbors if they share at least one common concept (only 21 most frequent concepts are considered). Testing was done on a query set of 100 images per concept; training was performed on 100K pairs of images.

Performance is shown in Table 6 and Figures 56; retrieved neighbors are shown in Figure 2. We again see behavior consistent with our analysis and SparseHash significant outperforms the other methods.

Figure 5: Precision/recall characteristics of hashing methods using bits for different Hamming radius (solid), (dotted) and (dashed) on CIFAR10 (left) and NUS (right) datasets. Some settings result in zero recall and the corresponding curves are not shown. While all methods show comparable performance at large , only SparseHash performs well for small values of .
Figure 6: Recall as function of Hamming radius of hash codes of different length (left: CIFAR10 dataset, solid: , dotted: ; right: NUS dataset, solid: , dotted: ). Note the dramatic drop in recall of dense hash methods when increasing code length , while our proposed framework maintains performance.

Multi-modal hashing. We repeated the experiment on the NUS dataset with the same indices of positive and negative pairs, adding the Tags modality represented as 1K-dimensional bags of words. The training set contained pairs of similar and dissimilar Images, Tags, and cross-modality Tags-Images pairs. We compare our MM-SparseHash to the cross-modal SSH (CM-SSH) method (Bronstein et al., 2010). Results are shown in Table 4 and in Figure 9. MM-SparseHash significantly outperforms CM-SSH in previously reported state-of-the-art cross-modality (Images-Tags and Tags-Images) retrieval. Both methods outperform the baseline for intra-modal (Tags-Tags and Images-Images) retrieval; since the two modalities complement each other, we attribute the improvement to the ability of the model to pick up such correlations.

Hamming radius Hamming radius
Method mAP@10 MP@5K Prec. Recall F1 Prec. Recall F1
68.67 32.77
KSH 64 72.85 42.74 83.80 6.1 1.2 84.21 1.7 3.3
256 73.73 45.35 84.24 1.4 2.9 84.24 1.4 2.9
AGH1 64 69.48 47.28 69.43 0.11 0.22 73.35 3.9 7.9
256 73.86 46.68 75.90 1.5 2.9 81.64 3.6 7.1
AGH2 64 68.90 47.27 68.73 0.14 0.28 72.82 5.2 0.10
256 73.00 47.65 74.90 5.3 0.11 80.45 1.1 2.2
SSH 64 72.17 44.79 60.06 0.12 0.24 81.73 1.1 2.2
256 73.52 47.13 84.18 1.8 3.5 84.24 1.5 2.9
DH 64 71.33 41.69 84.26 1.4 2.9 84.24 1.4 2.9
256 70.73 39.02 84.24 1.4 2.9 84.24 1.4 2.9
NN 64 76.39 59.76 75.51 1.59 3.11 81.24 0.10 0.20
256 78.31 61.21 83.46 5.8 0.11 83.94 4.9 9.8
Sparse 64 7 0.05 0.3 74.17 56.08 71.67 1.99 3.98 81.11 0.46 0.92
7 0.05 1.0 74.15 51.52 69.08 0.53 1.06 81.67 0.15 0.30
16 0.005 0.3 74.51 55.54 79.09 1.21 2.42 82.76 0.17 0.34
256 4 0.05 1.0 74.05 60.73 78.82 3.85 7.34 81.82 1.20 2.37
4 0.05 1.0 74.48 59.42 81.95 1.18 2.33 83.24 0.35 0.70
6 0.005 0.3 71.73 54.76 78.34 6.10 11.30 80.85 1.02 2.01
Table 3: Performance (in %) of hashing methods of different length on the NUS dataset. Extended version of this table is shown in supplementary materials.
mAp10 % MP@5K %
Method Image-Image Tag-Tag Image-Tag Tag-Image Image-Image Tag-Tag Image-Tag Tag-Image
68.67 71.38 32.77 32.85
CM-SSH 75.19 83.05 55.55 50.43 49.69 61.60 37.05 39.13
MM-Sparse 73.79 84.49 61.52 59.52 58.13 66.59 57.35 57.29
Table 4: Performance (in %) of CM-SSH and MM-SparseHash on the NUS multi-modal dataset with hashes of length 64.
Figure 7: Cross-modality retrieval on NUS dataset. Left plate: first five Image results to a Tags query (shown in gray) obtained using CM-SSH (top) and MM-SparseHash (bottom). Correct matches are marked in green. Note how many if not most of the results from CM-SSH, considered one of the state-of-the-art techniques in this task, are basically uncorrelated to the query. Right plate: union of top five Tags results (right) to an Image query (shown in gray) obtained using CM-SSH (top) and MM-SparseHash (bottom). Tags matching ground truth tags are shown in green. Note how the proposed approach not only detects significantly more matching words, but also the “non-matching” ones actually make a lot of sense, indicating that while they were not in the original image tags, underlying connections were learned. Extended version of this figure is shown in supplementary materials.

6 Conclusions

We presented a new method for learning sparse similarity-preserving hashing functions. The hashing is obtained by solving an regularized minimization of the aggregate of false positive and false negative rates. The embedding function is learned by using ISTA-type neural networks (Gregor & LeCun, 2010). These networks have a particular architecture very effective for learning discriminative sparse codes. We also show that, once the similarity-preserving hashing problem is stated as training a neural network, it can be straightforwardly extended to the multi-modal setting. While in this work we only used networks with a single layer, more generic embeddings could be learned with this exact framework simply by considering multiple layers.

A key contribution of this paper is to show that more accurate nearest neighbor retrieval can be obtained by introducing sparsity into the hashing code. SparseHash can achieve significantly higher recall at the same levels of precision than dense hashing schemes with similar number of degrees of freedom. At the same time, the sparsity in the hash codes allows retrieving partial collisions at much lower computational complexity than their dense counterparts in a Hamming ball with the same radius. Extensive experimental results backup these claims, showing that the proposed SparseHash framework produces comparable, or superior, results to some of the state-of-the-art methods.

References

  • Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
  • Bronstein et al. (2010) Bronstein, M. M. et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Proc. CVPR, 2010.
  • Chua et al. (2009) Chua, T.-S. et al. NUS-WIDE: A real-world web image database from national university of Singapore. In Proc. CIVR, 2009.
  • Coifman & Lafon (2006) Coifman, R. R. and Lafon, S. Diffusion maps. App. Comp. Harmonic Analysis, 21(1):5–30, 2006.
  • Daubechies et al. (2004) Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure and App. Math., 57(11):1413–1457, 2004.
  • Davis et al. (2007) Davis et al. Information-theoretic metric learning. In Proc. ICML, 2007.
  • Gionis et al. (1999) Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensions via hashing. In Proc. VLDB, 1999.
  • Goemans & Williamson (1995) Goemans, M. and Williamson, D. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, 1995.
  • Gong & Lazebnik (2011) Gong, Y. and Lazebnik, S. Iterative quantization: A procrustean approach to learning binary codes. In Proc. CVPR, 2011.
  • Gong et al. (2012) Gong, Y. et al. Angular quantization-based binary codes for fast similarity search. In Proc. NIPS, 2012.
  • Grauman & Fergus (2013) Grauman, K. and Fergus, R. Learning binary hash codes for large-scale image search. In Machine Learning for Computer Vision, pp. 49–87. Springer, 2013.
  • Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, 2010.
  • Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
  • Johnson & Wichern (2002) Johnson, R. A. and Wichern, D. W. Applied multivariate statistical analysis, volume 4. Prentice Hall, 2002.
  • Korman & Avidan (2011) Korman, S. and Avidan, S. Coherency sensitive hashing. In Proc. ICCV, 2011.
  • Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Kulis & Darrell (2009) Kulis, B. and Darrell, T. Learning to hash with binary reconstructive embeddings. In Proc. NIPS, 2009.
  • LeCun (1985) LeCun, Y. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pp. 599–604, 1985.
  • Liu et al. (2011) Liu, W. et al. Hashing with graphs. In Proc. ICML, 2011.
  • Liu et al. (2012) Liu, Wei et al. Supervised hashing with kernels. In Proc. CVPR, 2012.
  • Masci et al. (2011) Masci, J. et al. Descriptor learning for omnidirectional image matching. Technical Report arXiv:1112.6291, 2011.
  • McFee & Lanckriet (2009) McFee, B. and Lanckriet, G. R. G. Partial order embedding with multiple kernels. In Proc. ICML, 2009.
  • Mika et al. (1999) Mika, S. et al. Fisher discriminant analysis with kernels. In Proc. Neural Networks for Signal Processing, 1999.
  • Norouzi & Fleet (2011) Norouzi, M. and Fleet, D. Minimal loss hashing for compact binary codes. In Proc. ICML, 2011.
  • Norouzi et al. (2012) Norouzi, M., Fleet, D., and Salakhutdinov, R. Hamming distance metric learning. In Proc. NIPS, 2012.
  • Roweis & Saul (2000) Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000.
  • Schoelkopf et al. (1997) Schoelkopf, B., Smola, A., and Mueller, K. R.

    Kernel principal component analysis.

    Artificial Neural Networks, pp. 583–588, 1997.
  • Shakhnarovich et al. (2003) Shakhnarovich, G., Viola, P., and Darrell, T.

    Fast pose estimation with parameter-sensitive hashing.

    In Proc. CVPR, 2003.
  • Shen et al. (2009) Shen, C. et al. Positive semidefinite metric learning with boosting. In Proc. NIPS, 2009.
  • Sprechmann et al. (2012) Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning efficient sparse and low rank models. Technical Report arXiv:1010.3467, 2012.
  • Strecha et al. (2012) Strecha, C. et al. LDAHash: Improved matching with smaller descriptors. PAMI, 34(1):66–78, 2012.
  • Taylor et al. (2011) Taylor, G. W. et al. Learning invariance through imitation. In Proc. CVPR, 2011.
  • Torralba et al. (2008a) Torralba, A., Fergus, R., and Freeman, W. T.

    80 million tiny images: A large data set for nonparametric object and scene recognition.

    PAMI, 30(11):1958–1970, 2008a.
  • Torralba et al. (2008b) Torralba, A., Fergus, R., and Weiss, Y. Small codes and large image databases for recognition. In Proc. CVPR, 2008b.
  • Wang et al. (2010) Wang, J., Kumar, S., and Chang, S.-F. Sequential projection learning for hashing with compact codes. In Proc. ICML, 2010.
  • Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
  • Weiss et al. (2008) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. NIPS, 2008.
  • Xing et al. (2002) Xing, E. P. et al. Distance metric learning with application to clustering with side-information. In Proc. NIPS, 2002.
  • Yagnik et al. (2011) Yagnik, J. et al. The power of comparative reasoning. In Proc. CVPR, 2011.

Supplementary Material for Sparse similarity-preserving hashing

Figure 8: Precision as function of Hamming radius of hash codes of different length (left: CIFAR10 dataset, solid: , dotted: ; right: NUS dataset, solid: , dotted: ).

Figure 9:

Cross-modality retrieval: first five Image results to three Tags queries (left) obtained using CM-SSH (odd rows) and MM-SparseHash (even rows) in the multimodal NUS dataset. Note how many if not most of the results from CM-SSH, considered one of the state-of-the-art techniques in this task, are basically uncorrelated to the query.

Hamming radius Hamming radius
Method mAP Prec. Recall F1 Prec. Recall F1
17.42
KSH 48 31.10 18.22 0.44 0.86 5.39 5.6 0.11
64 32.49 10.86 0.13 0.26 2.49 9.6 1.9
128 33.50 2.91 3.3 6.5 0.67 4.5 8.9
AGH1 48 14.55 15.95 1.4 2.8 4.88 2.2 4.4
64 14.22 6.50 4.1 8.1 3.06 1.2 2.4
128 13.53 2.89 1.1 2.2 1.58 3.4 6.8
AGH2 48 15.34 17.43 3.6 7.1 5.44 3.5 6.9
64 14.99 7.63 7.2 1.4 3.61 1.4 2.7
128 14.38 3.78 1.6 3.2 1.43 3.9 7.8
SSH 48 15.78 9.92 6.6 1.3 0.30 5.1 1.0
64 17.18 1.52 3.1 6.1 1.0 1.7 3.3
128 17.20 0.30 5.1 1.0 0.10 1.7 3.4
DH 48 13.13 3.0 5.1 1.0 1.0 1.7 3.4
64 13.07 1.0 1.7 3.3 0.00 0.00 0.00
128 13.12 0.00 0.00 0.00 0.00 0.00 0.00
NN 48 30.18 32.69 0.74 1.45 9.47 5.2 0.10
64 34.74 22.78 0.28 5.5 5.70 8.8 1.8
128 37.89 5.38 2.9 5.7 1.39 2.2 4.4
Sparse 48 16 0.01 0.1 23.07 32.69 0.93 1.81 16.65 5.0 0.10
7 0.001 0.1 21.08 26.03 12.56 17.00 26.65 3.04 5.46
64 11 0.005 0.1 23.80 31.74 6.87 11.30 31.12 0.86 1.70
7 0.001 0.1 21.29 21.41 41.68 28.30 25.27 10.17 14.50
128 16 0 0.1 21.97 25.94 18.11 21.30 27.99 3.81 6.71
Table 5: Performance (in %) of different hashing methods on the CIFAR10 dataset with different settings of different length .
Hamming radius Hamming radius
Method mAP@10 MP@5K Prec. Recall F1 Prec. Recall F1

68.67 32.77
KSH 64 72.85 42.74 83.80 6.1 1.2 84.21 1.7 3.3
80 72.76 43.32 84.21 1.8 3.6 84.23 1.4 2.9
256 73.73 45.35 84.24 1.4 2.9 84.24 1.4 2.9
AGH1 64 69.48 47.28 69.43 0.11 0.22 73.35 3.9 7.9
80 69.62 47.23 71.15 7.5 0.15 74.14 2.5 5.1
256 73.86 46.68 75.90 1.5 2.9 81.64 3.6 7.1
AGH2 64 68.90 47.27 68.73 0.14 0.28 72.82 5.2 0.10
80 69.73 47.32 70.57 0.12 0.24 73.85 4.2 8.3
256 73.00 47.65 74.90 5.3 0.11 80.45 1.1 2.2
SSH 64 72.17 44.79 60.06 0.12 0.24 81.73 1.1 2.2
80 72.58 46.96 83.96 1.9 3.9 80.91 1.3 2.6
256 73.52 47.13 84.18 1.8 3.5 84.24 1.5 2.9
DH 64 71.33 41.69 84.26 1.4 2.9 84.24 1.4 2.9
80 70.34 37.75 84.24 4.9 9.8 84.24 4.9 9.8
256 70.73 39.02 84.24 1.4 2.9 84.24 1.4 2.9
NN 64 76.39 59.76 75.51 1.59 3.11 81.24 0.10 0.20
80 75.51 59.59 77.17 2.02 3.94 81.89 0.24 0.48
256 78.31 61.21 83.46 5.8 0.11 83.94 4.9 9.8
Sparse 64 7 0.05 0.3 74.17 56.08 71.67 1.99 3.98 81.11 0.46 0.92
7 0.05 1.0 74.15 51.52 69.08 0.53 1.06 81.67 0.15 0.30
16 0.005 0.3 74.51 55.54 79.09 1.21 2.42 82.76 0.17 0.34
256 4 0.05 1.0 74.05 60.73 78.82 3.85 7.34 81.82 1.20 2.37
4 0.05 1.0 74.48 59.42 81.95 1.18 2.33 83.24 0.35 0.70
6 0.005 0.3 71.73 54.76 78.34 6.10 11.30 80.85 1.02 2.01
Table 6: Performance (in %) of hashing methods of different length on the NUS dataset. Comparable degrees of freedom are 256-bit SparseHash and 80-bit dense hashes.

Figure 10: Cross-modality retrieval: union of top five Tags results (right) to three Image queries (left) obtained using CM-SSH (odd rows) and MM-SparseHash (even rows) in the multimodal NUS dataset. Tags matching ground truth tags showed in green. Note how the proposed approach not only detects significantly more matching words, but also the “non-matching” ones actually make a lot of sense, indicating that while they were not in the original image tags, underlying connections were learned.