Large amounts of data are being collected every day in the sciences and industry. Analysing such truly big data sets even by linear methods can become infeasible, thus sublinear methods such as locality sensitive hashing (LSH) have become an important analysis tool. For some data collections, the purpose can be clearly expressed from the start, for example, text/image/video/speech analysis or recommender systems. In other cases such as drug discovery or the material genome project, the ultimate query structure to such data may still not be fully fixed. In other words, measurements, simulations or observations may be recorded without being able to spell out the full specific purpose (although the general goal: better drugs, more potent materials is clear). Motivated by the latter case, we consider how one can use LSH schemes without defining any specific dissimilarity at the data acquisition and pre-processing phase.
LSH, one of the key technologies for big data analysis, enables approximate nearest neighbor search (ANNS) in sublinear time [1, 2]. With LSH functions for a required dissimilarity measure in hand, each data sample is assigned to a hash bucket in the pre-prosessing stage. At runtime, ANNS with theoretical guarantees can be performed by restricting the search to the samples that lie within the hash bucket, to which the query point is assigned, along with the samples lying in the neighbouring buckets.
A challenge in developing LSH without defining specific purpose is that the existing LSH schemes, designed for different dissimilarity measures, provide significantly different hash codes. Therefore, a naive realization requires us to prepare the same number of hash tables as the number of possible target dissimilarities, which is not realistic if we need to adjust the importance of multiple criteria. In this paper, we propose three variants of multiple purpose LSH (mp-LSH), which support L2, cosine, and inner product (IP) dissimilarities, and their weighted sums, where the weights can be adjusted at query time.
The first proposed method, called mp-LSH with vector augmentation (mp-LSH-VA), maps the data space into an augmented vector space, so that the squared-L2-distance in the augmented space matches the required dissimilarity measure up to a constant. This scheme can be seen as an extension of recent developments of LSH for maximum IP search (MIPS)[3, 4, 5, 6]. The significant difference from the previous methods is that our method is designed to modify the dissimilarity by changing the augmented query vector. We show that mp-LSH-VA is locality sensitive for L2 and IP dissimilarities and their weighted sums. However, its performance for the L2 dissimilarity is significantly inferior to the standard L2-LSH . In addition, mp-LSH-VA does not support the cosine-distance.
Our second proposed method, called mp-LSH with code concatenation (mp-LSH-CC), concatenates the hash codes for L2, cosine, and IP dissimilarities, and constructs a special structure, called cover tree , which enables efficient NNS with the weights for the dissimilarity measures controlled by adjusting the metric in the code space. Although mp-LSH-CC is conceptually simple and its performance is guaranteed by the original LSH scheme for each dissimilarity, it is not memory efficient, which also results in increased query time.
Considering the drawbacks of the aforementioned two variants led us to our final and recommended proposal, called mp-LSH with code augmentation and transformation (mp-LSH-CAT). It supports L2, cosine, and IP dissimilarities by augmenting the hash codes, instead of the original vector. mp-LSH-CAT is memory efficient, since it shares most information over the hash codes for different dissimilarities, so that the augmentation is minimized.
We theoretically and empirically analyze the performance of mp-LSH methods, and demonstrate their usefulness on real-world data sets. Our mp-LSH methods also allow us to modify the importance of pre-defined groups of features. Adjustability of the dissimilarity measure at query time is not only useful in the absence of future analysis plans, but also applicable to multi-criteria searches. The following lists some sample applications of multi-criteria queries in diverse areas:
In recommender systems, suggesting items which are similar to a user-provided query and also match the user’s preference.
In material science, finding materials which are similar to a query material and also possess desired properties such as stability, conductivity, and medical utility.
In video retrieval, we can adjust the importance of multimodal information such as brightness, color, audio, and text at query time.
In this section, we briefly overview previous locality sensitive hashing (LSH) techniques.
Assume that we have a sample pool in -dimensional space. Given a query , nearest neighbor search (NNS) solves the following problem:
where is a dissimilarity measure. A naive approach computes the dissimilarity from the query to all samples, and then chooses the most similar samples, which takes time. On the other hand, approximate NNS can be performed in sublinear time. We define the following three terms: (-near neighbor) For , is called -near neighbor of , if . (-approximate nearest neighbor search) Given , , and , -approximate nearest neighbor search (-ANNS) reports some -near neighbor of
with probability, if there exists an -near neighbor of in . (Locality sensitive hashing) A family of functions is called -sensitive for a dissimilarity measure , if the following two conditions hold for any :
where denotes the probability of the event (with respect to the random draw of hash functions). Note that is required for LSH to be useful. The image of hash functions is typically binary or integer. The following proposition guarantees that locality sensitive hashing (LSH) functions enable -ANNS in sublinear time.  Given a family of -sensitive hash functions, there exists an algorithm for -ANNS with query time and space, where .
Below, we introduce three LSH families. Let be the
-dimensional Gaussian distribution,be the
-dimensional uniform distribution with its supportfor all dimensions, and be the
-dimensional identity matrix. The sign function,, applies element-wise, giving for and for . Likewise, the floor operator applies element-wise for a vector. We denote by the angle between two vectors. (L2-LSH)  For the L2-distance , the hash function
where is a fixed real number, , and , satisfies , where
Here, is the standard cumulative Gaussian.
 (simple-LSH) Assume that the samples and the query are rescaled so that , . For the inner product dissimilarity (with which the NNS problem (1) is called maximum IP search (MIPS)), the asymmetric hash functions
These three LSH methods above are standard and state-of-the-art (among the data-independent LSH schemes) for each dissimilarity measure. Although all methods involve the same random projection , the resulting hash codes are significantly different from each other.
Iii Proposed Methods and Theory
In this section, we first define the problem setting. Then, we propose three LSH methods for multiple dissimilarity measures, and conduct a theoretical analysis.
Iii-a Problem Setting
Similarly to the simple-LSH (Proposition II), we rescale the samples so that . We also assume .111 This assumption is reasonable for L2-NNS if the size of the sample pool is sufficiently large, and the query follows the same distribution as the samples. For MIPS, the norm of the query can be arbitrarily modified, and we set it to . Let us assume multi-modal data, where we can separate the feature vectors into groups, i.e., , . For example, each group corresponds to monochrome, color, audio, and text features in video retrieval. We also accept multiple queries for a single retrieval task. Our goal is to perform ANNS for the following dissimilarity measure, which we call multiple purpose (MP) dissimilarity:
where are the feature weights such that . In the single query case, where , setting corresponds to L2-NNS based on the first and the third feature groups, while setting corresponds to MIPS on the same feature groups. When we like to down-weight the importance of signal amplitude (e.g., brightness of image) of the -th feature group, we should increase the weight for the cosine-distance, and decrease the weight for the squared-L2-distance. Multiple queries are useful when we mix NNS and MIPS, for which the queries lie in different spaces with the same dimensionality. For example, by setting , we can retrieve items, which are close to the item query and match the user preference query . An important requirement for our proposal is that the weights can be adjusted at query time.
Iii-B Multiple purpose LSH with Vector Augmentation (mp-LSH-VA)
Our first method, called multiple purpose LSH with vector augmentation (mp-LSH-VA), is inspired by the research on asymmetric LSHs for MIPS [3, 4, 5, 6], where the query and the samples are augmented with additional entries, so that the squared-L2-distance in the augmented space coincides with the target dissimilarity up to a constant. A significant difference of our proposal from the previous methods is that we design the augmentation so that we can adjust the dissimilarity measure (i.e., the feature weights in Eq.(7)) by modifying the augmented query vector. Since mp-LSH-VA, unfortunately, does not support the cosine-distance, we set in this subsection. We define the weighted sum query by222 A semicolon denotes the row-wise concatenation of vectors, like in matlab.
We augment the queries and the samples as follows:
where is a (vector-valued) function of , and is a function of . We constrain the augmentation for the sample vector so that it satisfies, for a constant ,
Under this constraint, the norm of any augmented sample is equal to , which allows us to use sign-LSH (Proposition II) to perform L2-NNS. The squared-L2-distance between the query and a sample in the augmented space can be expressed as
The augmentation that matches the MP dissimilarity is not unique. Here, we introduce the following easy construction with :
Here, we defined
The collision probability, i.e., the probability that the query and the sample are given the same code, can be analytically computed: Assume that the samples are rescaled so that and . For the MP dissimilarity , given by Eq.(7), with , the asymmetric hash functions
where and are given by Eq.(10), satisfy
Figure 1 depicts the theoretical value of of mp-LSH-VA, computed by using Thoerem III-B, for different weight settings for . Note that determines the quality of LSH (smaller is better) for -ANNS performance (see Proposition II). In the case for L2-NNS and MIPS, the values of the standard LSH methods, i.e., L2-LSH (Proposition II) and simple-LSH (Proposition II), are also shown for comparison.
Although mp-LSH-VA offers attractive flexibility with adjustable dissimilarity, Figure 1 implies its inferior performance to the standard methods, especially in the L2-NNS case. The reason might be a too strong asymmetry between the query and the samples: a query and a sample are far apart in the augmented space, even if they are close to each other in the original space. We can see this from the first entries in and in Eq.(10), respectively. Those entries for the query are non-negative, i.e., for , while the corresponding entries for the sample are non-positive, i.e., for . We believe that there is room to improve the performance of mp-LSH-VA, e.g., by adding constants and changing the scales of some augmented entries, which we leave as our future work.
In the next subsections, we propose alternative approaches, where codes are as symmetric as possible, and down-weighting is done by changing the metric in the code space. This effectively keeps close points in the original space close in the code space.
Iii-C Multiple purpose LSH with Code Concatenation (mp-LSH-CC)
Let , , and , and define the metric-wise weighted average queries by , , and .
Our second proposal, called multiple purpose LSH with code concatenation (mp-LSH-CC), simply concatenates multiple LSH codes, and performs NNS under the following distance metric at query time:
where denotes the -th independent draw of the corresponding LSH code for . The distance (11) is a multi-metric, a linear combination of metrics , in the code space. For a multi-metric, we can use the cover tree  for efficient (exact) NNS. Assuming that all adjustable linear weights are upper-bounded by 1, the cover tree expresses neighboring relation between samples, taking all possible weight settings into account. NNS is conducted by bounding the code metric for a given weight setting. Thus, mp-LSH-CC allows selective exploration of hash buckets, so that we only need to accurately measure the distance to the samples assigned to the hash buckets within a small code distance. The query time complexity of the cover tree is , where is a data-dependent expansion constant . Another good aspect of the cover tree is that it allows dynamic insertion and deletion of new samples, and therefore, it lends itself naturally to the streaming setting. Appendix F describes further details.
In the pure case for L2, cosine, or IP dissimilarity, the hash code of mp-LSH-CC is equivalent to the base LSH code, and therefore, the performance is guaranteed by Propositions II–II, respectively. However, mp-LSH-CC is not optimal in terms of memory consumption and NNS efficiency. This inefficiency comes from the fact that it redundantly stores the same angular (or cosine-distance) information into each of the L2-, sign-, and simple-LSH codes. Note that the information of a vector is dominated by its angular components unless the dimensionality is very small.
Iii-D Multiple purpose LSH with Code Augmentation and Transformation (mp-LSH-CAT)
Our third proposal, called multiple purpose LSH with code augmentation and transformation (mp-LSH-CAT), offers significantly less memory requirement and faster NNS than mp-LSH-CC by sharing the angular information for all considered dissimilarity measures. Let
We essentially use sign-hash functions that we augment with norm information of the data, giving us the following augmented codes:
for a partitioned vector and . Here, each entry of follows .
For two matrices in the transformed hash code space, we measure the distance with the following multi-metric:
where and .
Although the hash codes consist of entries, we do not need to store all the entries, and computation can be simpler and faster by first computing the total number of collisions in the sign-LSH part (14) for :
Note that this computation, which dominates the computation cost for evaluating code distances, can be performed efficiently with bit operations. With the total number of collisions (16), the metric (15) between a query set and a sample can be expressed as
Given a query set, this can be computed from and for . Therefore, we only need to store the pure sign-bits, which is required by sign-LSH alone, and additional float numbers.
Similarly to mpLSH-CC, we use the cover tree for efficient NNS based on the code distance (15). In the cover tree construction, we set the metric weights to their upper-bounds, i.e., , and measure the distance between samples by
Since the collision probability can be zero, we cannot directly apply the standard LSH theory with the value guaranteeing the ANNS performance. Instead, we show that the metric (15) of mpLSH-CAT approximates the MP dissimilarity (7), and the quality of ANNS is guaranteed.
For , it is
with (proof is given in Appendix C).
For , it is
with . (proof is given in Appendix D).
The error with a maximum of ranges one order of magnitude below the MP dissimilarity having itself a range of . Note that Corollary III-D implies good approximation for the boundary cases, squared-L2-, IP- and cosine-distance, through mpLSH-CAT since they are special cases of weights: For example, mpLSH-CAT approximates squared-L2-distance when setting . The following theorem guarantees ANNS to succeed with mp-LSH-CAT for pure MIPS case with specified probability (proof is given in Appendix E): Let , and set
where depend on and (see Appendix E for details). With probability larger than mp-LSH-CAT guarantees -ANNS with respect to (MIPS). Note that it is straight forward to show Theorem III-D for squared-L2- and cosine-distance.
In Section IV, we will empirically show the good performance of mpLSH-CAT in general cases.
Iii-E Memory Requirement
For all LSH schemes, one can trade off the memory consumption and accuracy performance by changing the hash bit length . However, the memory consumption for specific hashing schemes heavily differs from the other schemes such that a comparison of performance is inadequate for a globally shared . In this subsection, we derive individual numbers of hashes for each scheme, given a fixed memory budget.
We count the theoretically minimal number of bits required to store the hash code of one data point. The two fundamental components we are confronted with are sign-hashes and discretized reals. Sign-hashes can be represented by exactly one bit. For the reals we choose a resolution such that their discretizations take values in a set of fixed size. The L2-hash function
is a random variable with potentially infinite, discrete values. Nevertheless we can come up with a realistic upper-bound of values the L2-hash essentially takes. Note thatfollows a distribution and . Then . Therefore L2-hash essentially takes one of discrete values stored by bits. Namely, for L2-hash requires 13 bits. We also store the norm-part of mp-LSH-CAT using 13 bits.
Denote by the required storage of mp-LSH-CAT. Then , which we set as our fixed memory budget for a given . The baselines sign- and simple-LSH, so mp-LSH-VA are pure sign-hashes, thus giving them a budget of hashes. As discussed above L2-LSH may take hashes. For mp-LSH-CC we allocate a third of the budget for each of the three components giving . This consideration is used when we compare mp-LSH-CC and mp-LSH-CAT in Section IV-B.
|Recall@k||Query time (msec)||
|L2+MIPS (.5,.5)||0.29||0.50||0.60||3553.63||3118.93||3151.44||31351||4344 bytes|
|Recall@k||Query time (msec)||
|L2+MIPS (.5,.5)||0.29||0.52||0.62||476.62||505.63||515.77||41958||224 bytes|
|Recall@k||Query time (msec)||
|L2+MIPS (.5,.5)||0.04||0.07||0.08||811.72||839.91||847.35||4244||280 bytes|
Here, we conduct an empirical evaluation on several real-world data sets.
Iv-a Collaborative Filtering
We first evaluate our methods on collaborative filtering data, the MovieLens10M333http://www.grouplens.org/ and the Netflix datasets . Following the experiment in [3, 5], we applied PureSVD  to get -dimensional user and item vectors, where for MovieLens and for Netflix. We centered the samples so that , which does not affect the L2-NNS as well as the MIPS solution.
Regarding the -dimensional vector as a single feature group (), we evaluated the performance in L2-NNS (), MIPS (), and their weighted sum (). The queries for L2-NNS were chosen randomly from the items, while the queries for MIPS were chosen from the users. For each query, we found its nearest neighbors in terms of the MP dissimilarity (7) by linear search, and used them as the ground truth. We set the hash bit length to , and rank the samples (items) based on the Hamming distance for the baseline methods and mp-LSH-VA. For mp-LSH-CC and mp-LSH-CAT, we rank the samples based on their code distances (11) and (15), respectively. After that, we drew the precision-recall curve, defined as and for different , where “relevant seen” is the number of the true nearest neighbors that are ranked within the top positions by the LSH methods. Figures 2 and 3 show the results on MovieLens10M for and and NetFlix for