1 Introduction
Efficient representation of data in compact and convenient way to similaritysensitive hashing methods, first considered in [11] and later in [3, 24, 17, 32, 22]. Similaritysensitive hashing methods can be regarded as a particular instance of supervised metric learning [2, 31]
, where one tries to construct a hashing function on the data space that preserves known similarity on the training set. Typically, the similarity is binary and can be related to hash collision probability (similar points should collide, and dissimilar points should not collide). Such methods have been enjoying increasing popularity in the computer vision and pattern recognition community in image analysis and retrieval
[13, 27, 14, 15, 30, 16], video copy detection [5], and shape retrieval [7].Shakhnarovich [24]
considered parametric hashing functions with affine transformation of the data vectors (projection matrix and threshold vector) followed by the sign function. He posed the problem of similaritysensitive hash construction as boosted classification, where each dimension of the hash acts as a weak binary classifier. The parameters of the hashing function were learned using AdaBoost. In
[25], we used the same setting of the problem and proposed a much simpler algorithm, wherein projections were selected as eigenvectors of the ratio or difference of covariance matrices of similar and dissimilar pairs of data points; the former method was dubbed as LDAhash and the latter as diffhash. Applying these methods to SIFT local features in images
[18], very compact and accurate binary descriptors were produced.The inspiration to this paper is the diffhash method [25]. While being remarkably simple and efficient, this method suffers from two major limitations. First, the length of the hash is limited by the descriptor dimensionality. In some situations, this is a clear disadvantage, as longer hashes allow to produce more accurate matching. Secondly, the affine hashing functions are in many cases too simple and fail to represent correctly the structure of the data. In this paper, we propose a kernel formulation of the diffhash algorithm which efficiently resolved both problems. We show the performance of the algorithm on the problem of image descriptor matching using the patches dataset from [33] and show that it outperforms the original diffhash.
2 Background
Let denote the data space. We denote by the set of pairs of similar data points (positives) and by the set of pairs of dissimilar data points (negatives). The problem of similaritysensitive hashing is to represent the data in a common space of dimensional binary vectors with the Hamming metric by means of a map such that on and . Alternatively, this can be expressed as having (i.e., the hash has high collision probability on the set of positives) and . The former can be interpreted as the false negative rate (FNR) and the latter as the false positive rate (FPR).
2.1 Similaritysensitive hashing (SSH)
To further simplify the problem, Shakhnarovich [24] considered parametric hashing function of the form , where is projection matrix and is an threshold
vector. The similaritysensitive hashing (SSH) algorithm considers the hash construction as boosted binary classification, where each hash dimension acts as a weak binary classifier. For each dimension, AdaBoost is used to maximize the following loss function
(1) 
where , for and for and is the AdaBoost weigh for pair at th iteration. Shakhnarovich [24] selected as the axis projection onto which minimizes the objective. In [5, 8], minimization problem (1) was relaxed in the following way : First, removing the nonlinearity and setting , find the projection vector . Then, fixing the projection , find the threshold . The disadvantages of the boostingbased SSH is first high computational complexity, and second, the tendency to find unnecessary long hashes.^{1}^{1}1The second problem can be partially resolved by using sequential probability testing [6] which creates hashes of minimum expected length.
2.2 Diffhash
In [25], we proposed a simpler approach, computing the similaritysensitive hashing by minimizing
(2)  
w.r.t. the map . Problem (2) is equivalent, up to constants, to minimizing the correlations
(3)  
w.r.t. the projection matrix and threshold vector . The first and second terms in (3) can be thought of as FPR and FNR, respectively. The parameter controls the tradeoff between FPR and FNR. The limit case effectively considers only the positive pairs ignoring the negative set.
Problem (3) is a highly nonconvex nonlinear optimization problem difficult to solve straightforwardly. Following [5, 8], we simplify the problem in the following way. First, ignore the threshold and solve a simplified problem without the sign nonlinearity for projection matrix ,
(4)  
where denote the covariance matrices of the positive and negative data. The solution of (4) is given explicitly as , the smallest eigenvectors of the matrix of weighted covariance differences.^{2}^{2}2The name of the algorithm diffhash refers in fact to this covariance difference matrix.
Second, fixing the projections find optimal threshold vector ,
The problem is separable and can be solved independently in each dimension . The above terms are the false positive and negative rates as function of the threshold ,
and
The above probabilities can be estimated from histograms (cumulative distributions) of
and on the positive and negative sets. The optimal threshold(5) 
is obtained by means of onedimensional exhaustive search.
3 Kernel diffhash
An obvious disadvantage of diffhash (and spectral methods in general) compared to AdaBoostbased methods is that it must be dimensionalityreducing: since we compute projection as the eigenvectors of a covariance matrix of size , the dimensionality of the embedding space must be . This restriction is limiting in many cases, as first it depends on the data dimensionality, and second, such a dimensionality may be too low and a longer hash would achieve better performance. Furthermore, the affine parametric form of the embedding is in many cases an oversimplification, and some more generic map is required.
In this paper, we cope with both problems using a kernel formulation, which transforms the data into some feature space that is never dealt with explicitly (only inner products in this space, referred to as kernel [23], are required). In order to simplify the following discussion, since the problem is separable (as we have seen, projection in each dimension corresponds to a eigenvector of the covariance matrix difference), we consider onedimensional projections. The whole method is summarized in Algorithm 1.
3.1 Projection computation
Let be a positive semidefinite kernel, and let . Thus, maps the data into some feature space, which we represent here as a Hilbert space (possibly of infinite dimension) with an inner product , and satisfies .
The idea of kernelization is to replace the original data with the corresponding feature vectors , replacing the linear projection with . Here, is a vector of unknown linear combination coefficients, and denote some representative points in the data space.
In this formulation, at the projection computation stage we minimize, for each dimension
where and denote and matrices with elements . The optimal projection coefficients minimizing are given as the smallest eigenvectors of the matrix .
The kernel can be selected to account correctly for the structure of the data space . In our formulation, the dimensionality of the hash is bounded by the number of the basis vectors, , which is limited only by the training set size and computational complexity.
3.2 Threshold selection
As previously, the threshold should be selected to minimize the false positive and false negative rates, that can be expressed, as previously, as
The optimal threshold is obtained as
(6) 
3.3 Hash function application
Once the coefficients and threshold are computed, given a new data point , the corresponding dimensional binary hash vector is constructed as . Note that this embedding is kerneldependent and has a more generic form than the affine transformation used in [24, 25].
4 Results
In order to test our approach, we applied it to the problem of image feature matching. This problem is a core of many modern Internetscale computer vision applications, including city scale reconstruction [1]. The basic underlying task in these problems, repeated millions and billions of times, is the comparison of local image features (SIFT [18] or similar methods [21, 4, 26]). Typically, these features are represented by means of multidimensional descriptors vectors (e.g. SIFT is dimensional) and compared using the Euclidean distance. With very large datasets (containing feature points), severe scalability issues are encountered, including problems of storage and similarity query on feature descriptors. Efficient representation and comparison of feature descriptors have been addressed in many recent works in the computer vision community (see, e.g., [20, 19, 28, 12, 33, 34, 10, 9]). In [25], we proposed using similaritysensitive hashing methods to produce compact binary descriptors [25]. Such descriptors have several appealing properties that make them especially suitable in largescale applications. First, they are compact (typically, bits, compared to at least required for the standard SIFT) and easy to store in standard databases. Second, the comparison of binary descriptors is done using the Hamming metric, which amounts to XOR and bit count – an operation that can be carried out extremely efficiently on modern CPU architectures, significantly faster than the computation of Euclidean or other
distances. Finally, the construction of the binarization transformations involves metric learning, thus modeling more correctly the distance between the descriptors, which is usually nonEuclidean. In particular, this allows to compensate for imperfect invariance of the descriptor (since viewpoint transformations are only approximately locally affine) and cope with descriptor variability in pairs of images with wide baseline. As a result of this last property, the use of similaritysensitive hashing reduces the descriptor size while actually
improving its performance [25], unlike other methods that typically come at the price of decreased performance.In our experiments, we used data from [33]. The datasets contained rectified and normalized patches extracted from multiple images depicting three different scenes (Trevi fountain, Notre Dame cathedral, and Half Dome). The first two scenes were similar representing architectural landmarks; the last scene was different representing a natural mountain environment. In each scene, a total of nearly K patches corresponding to around K different feature points were available; each feature appeared multiple times. For training, we used K pairs of patches corresponding to different views of the same points as positives, and K pairs of patches from different points as negatives (Figure 1). For testing, a different subset of the dataset containing K positive and K negative pairs was used.
In each patch, a dimensional (bit per dimension) SIFT descriptor was computed using the toolbox of Vedaldi [29]. We compared the performance of binary descriptor obtained by means of the diffhash method of Strecha at al. [25] (DIF) and our kernel version (kDIF). Diffhash appeared to be the best performing algorithm in an extensive set of evaluations done in [25]. Since kDIF is an extended version of DIF, we choose to compare to this method. In both methods, we used the value which was experimentally found to produce the best results. In kDIF, we used a Gaussian kernel with the Mahalanobis distance of the form . The same training and testing data were used for all methods. For reference, we show the Euclidean distance between the original SIFT descriptors.
Figures 2–3 show the performance of different hashing algorithms as a function of on different datasets. Several conclusions can be drawn from this figure. First, kDIF appears to consistently outperform DIF on all three scenes for the same hash length . Second, for sufficiently large , our method outperforms SIFT while still being more compact. Third, the learned hashing functions generalize gracefully to other scenes, though slight performance degradation is noticeable when training on mountain scene (Half Dome) and using the learned hash in an architectural scene (Note Dame).
Figure 4 compares the performance of different descriptors in terms of FNR at two low FPR points ( and ). Binary descriptors outperform raw SIFT while being 24 more compact (to say nothing about the lower computational complexity of the Hamming distance compared to the Euclidean distance). Second, kDIF consistently outperforms DIF. Third, one can see that using longer hash () increases the performance.
Figure 5 shows a few examples of first matches between patch descriptors obtained using Euclidean distance and the Hamming distance on the hashed descriptors using our method. Our method provides superior performance.
First matches using Euclidean distance between SIFT descriptors (odd rows) and Hamming distance between
dimensional binary vectors constructed using our kDIF hashing algorithms (even rows). Query image is shown on the left, first five matches are shown on the right. Numbers indicate the distance from query. Wrong matches are marked in red, correct matches are marked in green.5 Conclusions
We presented kernel formulation of diffhash similaritysensitive hashing algorithm and showed how this method can be used to produce efficient and compact binary feature descriptors. Though we showed results with SIFT, the method is generic and can be applied to any local feature descriptor. Our method showed superior results compared to the original diffhash proposed in [25], and is more generic as it allows to obtain hashes of any length and also incorporate nonlinearity through the choice of the kernel.
References
 [1] S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, and R. Szeliski. Building Rome in one day. In Proc. ICCV, 2009.
 [2] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap: a method for efficient approximate similarity ranking. In Proc. CVPR, 2004.
 [3] M. Bawa, T. Condie, and P. Ganesan. LSH forest: selftuning indexes for similarity search. In Proc. Int. Conf. World Wide Web, pages 651–660. ACM, 2005.
 [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. CVIU, 10(3):346–359, 2008.
 [5] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. The video genome. Technical Report arXiv:1003.5320v1, 2010.
 [6] A. M. Bronstein, M. M. Bronstein, M. Ovsjanikov, and L. J. Guibas. WaldHash: sequential similaritypreserving hashing”,. Technical Report CIS201003, Technion, Israel, 2010.
 [7] A.M. Bronstein, M.M. Bronstein, M. Ovsjanikov, and L.J. Guibas. Shape Google: geometric words and expressions for invariant shape retrieval. ACM TOG, 2010.
 [8] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, 2010.
 [9] M. Brown, G. Hua, and S. A. Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(PrePrints), 2010.
 [10] V. Chandrasekhar, G. Takacs, D. M. Chen, S.S. Tsai, R. Grzeszczuk, and B. Girod. Chog: Compressed histogram of gradients a low bitrate feature descriptor. In Proc. CVPR, pages 2504–2511, 2009.
 [11] A. Gionis, P. Indik, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Int. Conf. Very Large Databases, 2004.
 [12] G. Hua, M. Brown, and S. Winder. Discriminant embedding for local image descriptors. In Proc. ICCV, 2007.
 [13] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In Proc. CVPR, 2008.
 [14] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proc. ECCV, pages 304–317, 2008.
 [15] H. Jégou, M. Douze, and C. Schmid. Packing BagofFeatures. In Proc. ICCV, 2009.
 [16] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. Trans. PAMI, 2010.
 [17] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proc. NIPS, pages 1042–1050, 2009.
 [18] D.G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 20(2):91–110, 2004.
 [19] K. Mikolajczyk and J. Matas. Improving descriptors for fast tree matching by optimal linear projection. In Proc. ICCV, 2007.
 [20] K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. In Proc. CVPR, pages 257–263, June 2003.
 [21] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1/2):43–72, 2005.
 [22] M. Raginsky and S. Lazebnik. LocalitySensitive Binary Codes from ShiftInvariant Kernels. Proc. NIPS, 2009.

[23]
B. Schölkopf, A. Smola, and K.R. Müller.
Kernel principal component analysis.
Proc. ICANN, pages 583–588, 1997.  [24] G. Shakhnarovich. Learning TaskSpecific Similarity. PhD thesis, MIT, 2005.
 [25] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: improved matching with smaller descriptors. Trans. PAMI, 2011.
 [26] E. Tola, V. Lepetit, and P. Fua. Daisy: an Efficient Dense Descriptor Applied to Wide Baseline Stereo. Trans. PAMI, 32(5):815–830, 2010.

[27]
A. Torralba, R. Fergus, and W. T. Freeman.
80 million tiny images: a large dataset for nonparametric object and scene recognition.
Trans. PAMI, 30(11):1958–1970, 2008.  [28] T. Tuytelaars and C. Schmid. Vector quantizing feature space with a regular lattice. Proc. ICCV, 2007.
 [29] A. Vedaldi. An open implementation of the SIFT detector and descriptor. Technical Report 070012, UCLA CSD, 2007.

[30]
J. Wang, S. Kumar, and S. F. Chang.
Semisupervised hashing for scalable image retrieval.
In CVPR, 2010.  [31] J. Wang, S. Kumar, and S. F. Chang. Sequential projection learning for hashing with compact codes. In ICML, 2010.
 [32] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. Proc. NIPS, 21:1753–1760, 2009.
 [33] S. A. Winder and M. Brown. Learning local image descriptors. In Proc. CVPR, Minneapolis, MI, June 2007.
 [34] S. A. Winder, G. Hua, and M. Brown. Picking the best DAISY. In Proc. CVPR, June 2009.