I Introduction
Large amounts of data are being collected every day in the sciences and industry. Analysing such truly big data sets even by linear methods can become infeasible, thus sublinear methods such as locality sensitive hashing (LSH) have become an important analysis tool. For some data collections, the purpose can be clearly expressed from the start, for example, text/image/video/speech analysis or recommender systems. In other cases such as drug discovery or the material genome project, the ultimate query structure to such data may still not be fully fixed. In other words, measurements, simulations or observations may be recorded without being able to spell out the full specific purpose (although the general goal: better drugs, more potent materials is clear). Motivated by the latter case, we consider how one can use LSH schemes without defining any specific dissimilarity at the data acquisition and preprocessing phase.
LSH, one of the key technologies for big data analysis, enables approximate nearest neighbor search (ANNS) in sublinear time [1, 2]. With LSH functions for a required dissimilarity measure in hand, each data sample is assigned to a hash bucket in the preprosessing stage. At runtime, ANNS with theoretical guarantees can be performed by restricting the search to the samples that lie within the hash bucket, to which the query point is assigned, along with the samples lying in the neighbouring buckets.
A challenge in developing LSH without defining specific purpose is that the existing LSH schemes, designed for different dissimilarity measures, provide significantly different hash codes. Therefore, a naive realization requires us to prepare the same number of hash tables as the number of possible target dissimilarities, which is not realistic if we need to adjust the importance of multiple criteria. In this paper, we propose three variants of multiple purpose LSH (mpLSH), which support L2, cosine, and inner product (IP) dissimilarities, and their weighted sums, where the weights can be adjusted at query time.
The first proposed method, called mpLSH with vector augmentation (mpLSHVA), maps the data space into an augmented vector space, so that the squaredL2distance in the augmented space matches the required dissimilarity measure up to a constant. This scheme can be seen as an extension of recent developments of LSH for maximum IP search (MIPS)
[3, 4, 5, 6]. The significant difference from the previous methods is that our method is designed to modify the dissimilarity by changing the augmented query vector. We show that mpLSHVA is locality sensitive for L2 and IP dissimilarities and their weighted sums. However, its performance for the L2 dissimilarity is significantly inferior to the standard L2LSH [7]. In addition, mpLSHVA does not support the cosinedistance.Our second proposed method, called mpLSH with code concatenation (mpLSHCC), concatenates the hash codes for L2, cosine, and IP dissimilarities, and constructs a special structure, called cover tree [8], which enables efficient NNS with the weights for the dissimilarity measures controlled by adjusting the metric in the code space. Although mpLSHCC is conceptually simple and its performance is guaranteed by the original LSH scheme for each dissimilarity, it is not memory efficient, which also results in increased query time.
Considering the drawbacks of the aforementioned two variants led us to our final and recommended proposal, called mpLSH with code augmentation and transformation (mpLSHCAT). It supports L2, cosine, and IP dissimilarities by augmenting the hash codes, instead of the original vector. mpLSHCAT is memory efficient, since it shares most information over the hash codes for different dissimilarities, so that the augmentation is minimized.
We theoretically and empirically analyze the performance of mpLSH methods, and demonstrate their usefulness on realworld data sets. Our mpLSH methods also allow us to modify the importance of predefined groups of features. Adjustability of the dissimilarity measure at query time is not only useful in the absence of future analysis plans, but also applicable to multicriteria searches. The following lists some sample applications of multicriteria queries in diverse areas:

In recommender systems, suggesting items which are similar to a userprovided query and also match the user’s preference.

In material science, finding materials which are similar to a query material and also possess desired properties such as stability, conductivity, and medical utility.

In video retrieval, we can adjust the importance of multimodal information such as brightness, color, audio, and text at query time.
Ii Background
In this section, we briefly overview previous locality sensitive hashing (LSH) techniques.
Assume that we have a sample pool in dimensional space. Given a query , nearest neighbor search (NNS) solves the following problem:
(1) 
where is a dissimilarity measure. A naive approach computes the dissimilarity from the query to all samples, and then chooses the most similar samples, which takes time. On the other hand, approximate NNS can be performed in sublinear time. We define the following three terms: (near neighbor) For , is called near neighbor of , if . (approximate nearest neighbor search) Given , , and , approximate nearest neighbor search (ANNS) reports some near neighbor of
with probability
, if there exists an near neighbor of in . (Locality sensitive hashing) A family of functions is called sensitive for a dissimilarity measure , if the following two conditions hold for any :where denotes the probability of the event (with respect to the random draw of hash functions). Note that is required for LSH to be useful. The image of hash functions is typically binary or integer. The following proposition guarantees that locality sensitive hashing (LSH) functions enable ANNS in sublinear time. [1] Given a family of sensitive hash functions, there exists an algorithm for ANNS with query time and space, where .
Below, we introduce three LSH families. Let be the
dimensional Gaussian distribution,
be thedimensional uniform distribution with its support
for all dimensions, and be thedimensional identity matrix. The sign function,
, applies elementwise, giving for and for . Likewise, the floor operator applies elementwise for a vector. We denote by the angle between two vectors. (L2LSH) [7] For the L2distance , the hash function(2) 
where is a fixed real number, , and , satisfies , where
Here, is the standard cumulative Gaussian.
[6] (simpleLSH) Assume that the samples and the query are rescaled so that , . For the inner product dissimilarity (with which the NNS problem (1) is called maximum IP search (MIPS)), the asymmetric hash functions
(5)  
(6)  
satisfy .
These three LSH methods above are standard and stateoftheart (among the dataindependent LSH schemes) for each dissimilarity measure. Although all methods involve the same random projection , the resulting hash codes are significantly different from each other.
Iii Proposed Methods and Theory
In this section, we first define the problem setting. Then, we propose three LSH methods for multiple dissimilarity measures, and conduct a theoretical analysis.
Iiia Problem Setting
Similarly to the simpleLSH (Proposition II), we rescale the samples so that . We also assume .^{1}^{1}1 This assumption is reasonable for L2NNS if the size of the sample pool is sufficiently large, and the query follows the same distribution as the samples. For MIPS, the norm of the query can be arbitrarily modified, and we set it to . Let us assume multimodal data, where we can separate the feature vectors into groups, i.e., , . For example, each group corresponds to monochrome, color, audio, and text features in video retrieval. We also accept multiple queries for a single retrieval task. Our goal is to perform ANNS for the following dissimilarity measure, which we call multiple purpose (MP) dissimilarity:
(7) 
where are the feature weights such that . In the single query case, where , setting corresponds to L2NNS based on the first and the third feature groups, while setting corresponds to MIPS on the same feature groups. When we like to downweight the importance of signal amplitude (e.g., brightness of image) of the th feature group, we should increase the weight for the cosinedistance, and decrease the weight for the squaredL2distance. Multiple queries are useful when we mix NNS and MIPS, for which the queries lie in different spaces with the same dimensionality. For example, by setting , we can retrieve items, which are close to the item query and match the user preference query . An important requirement for our proposal is that the weights can be adjusted at query time.
IiiB Multiple purpose LSH with Vector Augmentation (mpLSHVA)
Our first method, called multiple purpose LSH with vector augmentation (mpLSHVA), is inspired by the research on asymmetric LSHs for MIPS [3, 4, 5, 6], where the query and the samples are augmented with additional entries, so that the squaredL2distance in the augmented space coincides with the target dissimilarity up to a constant. A significant difference of our proposal from the previous methods is that we design the augmentation so that we can adjust the dissimilarity measure (i.e., the feature weights in Eq.(7)) by modifying the augmented query vector. Since mpLSHVA, unfortunately, does not support the cosinedistance, we set in this subsection. We define the weighted sum query by^{2}^{2}2 A semicolon denotes the rowwise concatenation of vectors, like in matlab.
We augment the queries and the samples as follows:
where is a (vectorvalued) function of , and is a function of . We constrain the augmentation for the sample vector so that it satisfies, for a constant ,
(8) 
Under this constraint, the norm of any augmented sample is equal to , which allows us to use signLSH (Proposition II) to perform L2NNS. The squaredL2distance between the query and a sample in the augmented space can be expressed as
(9) 
For , only the choice satisfying Eq.(8) is simpleLSH (for ), given in Proposition II. We consider the case for , and design and so that Eq.(9) matches the MP dissimilarity (7).
The augmentation that matches the MP dissimilarity is not unique. Here, we introduce the following easy construction with :
(10)  
Here, we defined
With the vector augmentation (10), Eq.(9) matches Eq.(7) up to a constant (see Appendix A):
The collision probability, i.e., the probability that the query and the sample are given the same code, can be analytically computed: Assume that the samples are rescaled so that and . For the MP dissimilarity , given by Eq.(7), with , the asymmetric hash functions
where and are given by Eq.(10), satisfy
(Proof) Via construction, it holds that and , and simple calculations (see Appendix A) give . Then, applying Propostion II immediately proves the theorem.
Figure 1 depicts the theoretical value of of mpLSHVA, computed by using Thoerem IIIB, for different weight settings for . Note that determines the quality of LSH (smaller is better) for ANNS performance (see Proposition II). In the case for L2NNS and MIPS, the values of the standard LSH methods, i.e., L2LSH (Proposition II) and simpleLSH (Proposition II), are also shown for comparison.
Although mpLSHVA offers attractive flexibility with adjustable dissimilarity, Figure 1 implies its inferior performance to the standard methods, especially in the L2NNS case. The reason might be a too strong asymmetry between the query and the samples: a query and a sample are far apart in the augmented space, even if they are close to each other in the original space. We can see this from the first entries in and in Eq.(10), respectively. Those entries for the query are nonnegative, i.e., for , while the corresponding entries for the sample are nonpositive, i.e., for . We believe that there is room to improve the performance of mpLSHVA, e.g., by adding constants and changing the scales of some augmented entries, which we leave as our future work.
In the next subsections, we propose alternative approaches, where codes are as symmetric as possible, and downweighting is done by changing the metric in the code space. This effectively keeps close points in the original space close in the code space.
IiiC Multiple purpose LSH with Code Concatenation (mpLSHCC)
Let , , and , and define the metricwise weighted average queries by , , and .
Our second proposal, called multiple purpose LSH with code concatenation (mpLSHCC), simply concatenates multiple LSH codes, and performs NNS under the following distance metric at query time:
(11) 
where denotes the th independent draw of the corresponding LSH code for . The distance (11) is a multimetric, a linear combination of metrics [8], in the code space. For a multimetric, we can use the cover tree [11] for efficient (exact) NNS. Assuming that all adjustable linear weights are upperbounded by 1, the cover tree expresses neighboring relation between samples, taking all possible weight settings into account. NNS is conducted by bounding the code metric for a given weight setting. Thus, mpLSHCC allows selective exploration of hash buckets, so that we only need to accurately measure the distance to the samples assigned to the hash buckets within a small code distance. The query time complexity of the cover tree is , where is a datadependent expansion constant [12]. Another good aspect of the cover tree is that it allows dynamic insertion and deletion of new samples, and therefore, it lends itself naturally to the streaming setting. Appendix F describes further details.
In the pure case for L2, cosine, or IP dissimilarity, the hash code of mpLSHCC is equivalent to the base LSH code, and therefore, the performance is guaranteed by Propositions II–II, respectively. However, mpLSHCC is not optimal in terms of memory consumption and NNS efficiency. This inefficiency comes from the fact that it redundantly stores the same angular (or cosinedistance) information into each of the L2, sign, and simpleLSH codes. Note that the information of a vector is dominated by its angular components unless the dimensionality is very small.
IiiD Multiple purpose LSH with Code Augmentation and Transformation (mpLSHCAT)
Our third proposal, called multiple purpose LSH with code augmentation and transformation (mpLSHCAT), offers significantly less memory requirement and faster NNS than mpLSHCC by sharing the angular information for all considered dissimilarity measures. Let
We essentially use signhash functions that we augment with norm information of the data, giving us the following augmented codes:
(12)  
(13)  
(14)  
for a partitioned vector and . Here, each entry of follows .
For two matrices in the transformed hash code space, we measure the distance with the following multimetric:
(15) 
where and .
Although the hash codes consist of entries, we do not need to store all the entries, and computation can be simpler and faster by first computing the total number of collisions in the signLSH part (14) for :
(16) 
Note that this computation, which dominates the computation cost for evaluating code distances, can be performed efficiently with bit operations. With the total number of collisions (16), the metric (15) between a query set and a sample can be expressed as
(17) 
Given a query set, this can be computed from and for . Therefore, we only need to store the pure signbits, which is required by signLSH alone, and additional float numbers.
Similarly to mpLSHCC, we use the cover tree for efficient NNS based on the code distance (15). In the cover tree construction, we set the metric weights to their upperbounds, i.e., , and measure the distance between samples by
(18) 
Since the collision probability can be zero, we cannot directly apply the standard LSH theory with the value guaranteeing the ANNS performance. Instead, we show that the metric (15) of mpLSHCAT approximates the MP dissimilarity (7), and the quality of ANNS is guaranteed.
For , it is
with . (proof is given in Appendix D).
with
The error with a maximum of ranges one order of magnitude below the MP dissimilarity having itself a range of . Note that Corollary IIID implies good approximation for the boundary cases, squaredL2, IP and cosinedistance, through mpLSHCAT since they are special cases of weights: For example, mpLSHCAT approximates squaredL2distance when setting . The following theorem guarantees ANNS to succeed with mpLSHCAT for pure MIPS case with specified probability (proof is given in Appendix E): Let , and set
where depend on and (see Appendix E for details). With probability larger than mpLSHCAT guarantees ANNS with respect to (MIPS). Note that it is straight forward to show Theorem IIID for squaredL2 and cosinedistance.
In Section IV, we will empirically show the good performance of mpLSHCAT in general cases.
IiiE Memory Requirement
For all LSH schemes, one can trade off the memory consumption and accuracy performance by changing the hash bit length . However, the memory consumption for specific hashing schemes heavily differs from the other schemes such that a comparison of performance is inadequate for a globally shared . In this subsection, we derive individual numbers of hashes for each scheme, given a fixed memory budget.
We count the theoretically minimal number of bits required to store the hash code of one data point. The two fundamental components we are confronted with are signhashes and discretized reals. Signhashes can be represented by exactly one bit. For the reals we choose a resolution such that their discretizations take values in a set of fixed size. The L2hash function
is a random variable with potentially infinite, discrete values. Nevertheless we can come up with a realistic upperbound of values the L2hash essentially takes. Note that
follows a distribution and . Then . Therefore L2hash essentially takes one of discrete values stored by bits. Namely, for L2hash requires 13 bits. We also store the normpart of mpLSHCAT using 13 bits.Denote by the required storage of mpLSHCAT. Then , which we set as our fixed memory budget for a given . The baselines sign and simpleLSH, so mpLSHVA are pure signhashes, thus giving them a budget of hashes. As discussed above L2LSH may take hashes. For mpLSHCC we allocate a third of the budget for each of the three components giving . This consideration is used when we compare mpLSHCC and mpLSHCAT in Section IVB.
Recall@k  Query time (msec) 




1  5  10  1  5  10  
L2  0.53  0.76  0.82  2633.83  2824.06  2867.00  31351  4344 bytes  
MIPS  0.69  0.77  0.82  3243.51  3323.20  3340.36  31351  4344 bytes  
L2+MIPS (.5,.5)  0.29  0.50  0.60  3553.63  3118.93  3151.44  31351  4344 bytes 
Recall@k  Query time (msec) 




1  5  10  1  5  10  
L2  0.52  0.80  0.89  583.85  617.02  626.02  41958  224 bytes  
MIPS  0.64  0.76  0.85  593.11  635.72  645.14  41958  224 bytes  
L2+MIPS (.5,.5)  0.29  0.52  0.62  476.62  505.63  515.77  41958  224 bytes 
Recall@k  Query time (msec) 




1  5  10  1  5  10  
L2  0.35  0.49  0.59  1069.29  1068.97  1074.40  4244  280 bytes  
MIPS  0.32  0.56  0.56  363.61  434.49  453.35  4244  280 bytes  
L2+MIPS (.5,.5)  0.04  0.07  0.08  811.72  839.91  847.35  4244  280 bytes 
Iv Experiment
Here, we conduct an empirical evaluation on several realworld data sets.
Iva Collaborative Filtering
We first evaluate our methods on collaborative filtering data, the MovieLens10M^{3}^{3}3http://www.grouplens.org/ and the Netflix datasets [13]. Following the experiment in [3, 5], we applied PureSVD [14] to get dimensional user and item vectors, where for MovieLens and for Netflix. We centered the samples so that , which does not affect the L2NNS as well as the MIPS solution.
Regarding the dimensional vector as a single feature group (), we evaluated the performance in L2NNS (), MIPS (), and their weighted sum (). The queries for L2NNS were chosen randomly from the items, while the queries for MIPS were chosen from the users. For each query, we found its nearest neighbors in terms of the MP dissimilarity (7) by linear search, and used them as the ground truth. We set the hash bit length to , and rank the samples (items) based on the Hamming distance for the baseline methods and mpLSHVA. For mpLSHCC and mpLSHCAT, we rank the samples based on their code distances (11) and (15), respectively. After that, we drew the precisionrecall curve, defined as and for different , where “relevant seen” is the number of the true nearest neighbors that are ranked within the top positions by the LSH methods. Figures 2 and 3 show the results on MovieLens10M for and and NetFlix for
Comments
There are no comments yet.