DeepAI
Log In Sign Up

DB-LSH: Locality-Sensitive Hashing with Query-based Dynamic Bucketing

Among many solutions to the high-dimensional approximate nearest neighbor (ANN) search problem, locality sensitive hashing (LSH) is known for its sub-linear query time and robust theoretical guarantee on query accuracy. Traditional LSH methods can generate a small number of candidates quickly from hash tables but suffer from large index sizes and hash boundary problems. Recent studies to address these issues often incur extra overhead to identify eligible candidates or remove false positives, making query time no longer sub-linear. To address this dilemma, in this paper we propose a novel LSH scheme called DB-LSH which supports efficient ANN search for large high-dimensional datasets. It organizes the projected spaces with multi-dimensional indexes rather than using fixed-width hash buckets. Our approach can significantly reduce the space cost as by avoiding the need to maintain many hash tables for different bucket sizes. During the query phase of DB-LSH, a small number of high-quality candidates can be generated efficiently by dynamically constructing query-based hypercubic buckets with the required widths through index-based window queries. For a dataset of n d-dimensional points with approximation ratio c, our rigorous theoretical analysis shows that DB-LSH achieves a smaller query cost O(n^ρ^* dlog n), where ρ^* is bounded by 1/c^α while the bound is 1/c in the existing work. An extensive range of experiments on real-world data demonstrates the superiority of DB-LSH over state-of-the-art methods on both efficiency and accuracy.

READ FULL TEXT VIEW PDF
11/24/2020

Efficient Approximate Nearest Neighbor Search for Multiple Weighted l_p≤2 Distance Functions

Nearest neighbor search is fundamental to a wide range of applications. ...
07/06/2021

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Nearest neighbor (NN) search is inherently computationally expensive in ...
10/22/2018

Norm-Range Partition: A Univiseral Catalyst for LSH based Maximum Inner Product Search (MIPS)

Recently, locality sensitive hashing (LSH) was shown to be effective for...
09/20/2018

Local Density Estimation in High Dimensions

An important question that arises in the study of high dimensional vecto...
03/10/2021

MP-RW-LSH: An Efficient Multi-Probe LSH Solution to ANNS in L_1 Distance

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic ...
04/11/2020

Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring

Locality-Sensitive Hashing (LSH) is one of the most popular methods for ...
06/22/2019

Algorithms for Similarity Search and Pseudorandomness

We study the problem of approximate near neighbor (ANN) search and show ...

I Introduction

The nearest neighbor (NN) search finds the closest point in a point dataset to a given query point. As the points which are closer to each other can often be considered ‘similar’ to each other in many applications when a proper distance measure is used, this search operation plays a vital role in a wide range of areas, such as pattern recognition

[1], information retrieval[36], and data mining[13]. However, it is well known that finding the exact NN in large-scale high-dimensional datasets can be very time-consuming. People often conduct approximate nearest neighbor (ANN) searches instead[18, 35]. The -approximate nearest neighbor (-ANN) search and -nearest neighbor (-NN) search are two representative queries to trade result accuracy for query efficiency. Specifically, -ANN search aims to find a point whose distance to the query point is bounded by , where is the distance from to its exact NN and is a given approximation ratio (see Definition 1, Section III). -NN search can be considered as a decision version of -ANN, which aims to determine whether there exists a point whose distance to is at most , where is a given search range (see Definition 2, Section III).

Algorithms Indexing Query Index Size Query Cost Comment
KL DB-LSH Dynamic Query-centric
E2LSH [4] Static Query-oblivious
LSB-Forest [35] Static Query-oblivious
C2 QALSH [14] Dynamic Query-centric
VHP [27] Dynamic Query-centric
R2LSH [26] Dynamic Query-centric
MQ SRS [34] Dynamic Query-centric
PM-LSH [38] Dynamic Query-centric
TABLE I: Comparison of Typical LSH Methods

Locality-Sensitive Hashing (LSH) [9, 10, 35, 39, 11, 4] is one of the most popular tools for computing

-ANN in high-dimensional spaces. LSH maps data points into buckets using a set of hash functions such that nearby points in the original space have a higher probability to be hashed into the same bucket than those which are far away. When a query arrives, the probability to find its

-ANN is guaranteed to be sufficiently high by only checking the points in the bucket where the query point falls in. In order to achieve this goal, the original LSH-based methods (E2LSH) [4] design a set of independent hash functions with which all data points in the original -dimensional space are mapped into a -dimensional space, . These -dimensional points are assigned into a range of buckets which are -dimensional hypercubes. This process is repeated times to generate -dimensional hash buckets (we term this type of approach -index). Intuitively, as increases, the probability of two different points being hashed into the same bucket decreases. On the contrary, the collision probability, which is the probability of two different points being mapped into the same bucket, increases as increases because two points are considered as a ‘collision’ as long as they are mapped into the same bucket at least once. As shown in [11, 8], by choosing and , where , are constants depending on and (for the meaning of and , see Definition 3, Section III), E2LSH can solve the -NN problem in sub-linear time with constant success probability of . Accordingly, E2LSH finds -ANN in sub-linear time by answering a series of -NN queries with . However, to achieve a good accuracy, E2LSH needs to prepare a -index for each -NN and is typically large, which causes prohibitively large storage costs for the indexes. LSB [35] alleviates this issue by building a -index for -NN and repeatedly merging small hash buckets into a large one, which effectively enlarges . However, LSB only works for -NN queries at some discrete integer , which imposes the limitation that LSB cannot answer the -ANN query with . C2LSH [9] proposes a new LSH scheme called collision counting (C2). By relaxing the collision condition from the exactly collisions to any collisions where is a given value, C2LSH only needs to maintain one-dimensional hash tables (instead of -dimensional hash tables). However, the query cost of C2 is no longer sub-linear [9] because it is expensive to count the number of collisions between a large number of data points and the query point dimension by dimension. In addition to the dilemma between space and time, the above methods also suffer from the candidate quality issue (a.k.a. the hash boundary issue). That is, no matter how large the hash buckets are, some points close to a query point may still be partitioned into different buckets. Several dynamic bucketing techniques are proposed to address this issue. The main idea of dynamic bucketing is to leave the bucketing process to the query phase in the hope of generating buckets such that the nearby points are more likely to be in the same bucket as the query point. The C2 approach is extended to dynamic scenarios by using B-trees to locate points falling in a query-centric bucket in each dimension [14, 27, 26], at the cost of increased query time because of a large number of one-dimensional searches. [34, 38] explore a new dynamic metric query (MQ) based LSH scheme to map data points in a high-dimensional space into a low-dimensional projected space via independent LSH functions, and determine -ANN by exact nearest neighbor searches in the projected space. However, even in a low-dimensional space, finding the exact NN is still inherently computationally expensive. More importantly, at least exact distance computations are needed to perform in case of missing correct -ANN, which incurs a linear time complexity. Here

is an estimated ratio for the number of

dimensional NN searches such that the dimensional ANN results can be found safely [34, 38]. Table 1 summarizes the query and space costs of typical LSH methods. As shown in the table, among the existing solutions to the -ANN search problem, -index based methods are the only ones that can achieve sub-linear query cost, i.e., , where is proven to be bounded by . in E2LSH is the number of -indexes prepared ahead [35]. Note that the value of is bounded by only when the bucket size is very large [8]. This implies a very large value of is necessary to effectively differentiate points based on their distances. It remains a significant challenge to find a smaller and truly bounded without using a very large bucket size. Motivated by the aforementioned limitations, in this paper we propose a novel -index approach with a query-centric dynamic bucketing strategy called DB-LSH to solve the high-dimensional -ANN search problem. DB-LSH decouples the hashing and bucketing processes of the -index, making it possible to answer -NN queries for any and -ANN for any with only one suit of indexes (i.e., without the need to perform LSH times for each possible ). In this way the space cost is reduced significantly, and a reduction of value becomes possible. DB-LSH builds dynamic query-centric buckets and conducts multi-dimensional window queries to eliminate the hash boundary issues for selecting the candidates. Different from other query-centric methods, the region of our buckets are still multi-dimensional cubes like in static -index methods, which enables DB-LSH to not only generate high-quality candidates but also to achieve sub-linear query cost, as shown in Table I. Furthermore, DB-LSH achieves a much smaller bound at a proper and finite bucket size, denoted as , which is bounded by (e.g., when choosing as the width of the initial hypercubic bucket). With theoretical analysis and an extensive range of experiments, we show that DB-LSH outperforms the existing LSH methods significantly for both efficiency and accuracy. The main contributions of this paper include:

  • We propose a novel LSH framework, called DB-LSH, to solve the high-dimensional -ANN search problem. It is the first work that combines the static -index approach with a dynamic search strategy for bucketing. By taking advantages from both sides, DB-LSH can reduce the index size and improve query efficiency simultaneously.

  • A rigorous theoretical analysis shows that DB-LSH can achieve the lowest query time complexity so far for any approximation ratio . DB-LSH answers a -ANN query with a constant success probability in time, where is bounded by , e.g., when initial bucket width is , which is smaller than in other -index methods.

  • Extensive experiments on 10 real datasets with different sizes and dimensionality have been conducted to show that DB-LSH can achieve better efficiency and accuracy than the existing LSH methods.

The rest of the paper is organized as follows. The related work is reviewed in Section II. Section III introduces the basic concepts and the research problem formally. The construction and query algorithms of DB-LSH are presented in Section IV, with a theoretical analysis in Section V and an experimental study in VI. We conclude this paper in Section VII.

Ii Related Work

LSH is originally proposed in [18, 11]. Due to its simple structure, sub-linear query cost, and a rigorous quality guarantee, it has been a prominent approach for processing approximate nearest neighbor queries in the high dimensional spaces [11, 8, 6, 28]. We give a brief overview of the existing LSH methods in this section.

Ii-a Mainstream LSH Methods

-index based methods. Although the basic LSH [11] is used in the Hamming space, -index methods extend from it to provide a universal and well-adopted LSH framework for answering the -ANN problem in other metric spaces. E2LSH[4] is a popular -index method in the Euclidean space and adopts the -stable distribution-based function proposed in [8] as the LSH function. Its applications are limited by the hash boundary problem and undesirably large index sizes. These two shortcomings are shared by other -index methods due to the fact that static buckets are used in these methods. To reduce index sizes, Tao et al. [35] consider answering -NN queries at different radii via an elegant LSB-Tree framework, although it only works for -ANN query with . SK-LSH [25] is another approach based on the idea of static -index, but proposes a novel search framework to find more candidates. To address the limitations of static

-index methods, dynamic query strategies are developed to find high-quality candidates using smaller indexes. These methods can be classified into two categories as follows.

Collision counting based methods (C2). The core idea of C2 is to generate candidates based on the collision numbers. It is proposed in C2LSH [9], which uses the techniques of collision counting and virtual rehashing to reduce space consumption. QALSH [14] improves C2LSH by adopting query-aware buckets rather than static ones, which alleviates the hash boundary issue. R2LSH [26] improves the performance of QALSH by mapping data into multiple two-dimensional projected spaces rather than one-dimensional projected spaces as in QALSH. VHP [27] considers the buckets in QALSH as hyper-planes and introduces the concept of virtual hyper-sphere to achieve smaller space complexity than QALSH. C2 can find high-quality candidates with a larger probability but its cost of finding the candidates is expensive due to the unbounded search regions, which makes all points likely to be counted once in the worst case. Dynamic metric query based methods (MQ). SRS [25] and PM-LSH [38] are representative dynamic MQ approaches that map data into a low-dimensional projected space and determine candidates based on their Euclidean distances via queries in the projected space. It is proven that this strategy can accurately estimate the distance between two points in high-dimensional spaces [38]. However, answering metric queries in the projected space is still computationally expensive and as many as candidates have to be checked to ensure a success probability of , where is a constant mentioned earlier. Therefore, MQ can incur a high query cost of .

Ii-B Additional LSH Methods

There are other LSH methods that come from two categories: the methods that design different hash functions and the methods that adopt alternative query strategies. The former includes studies that aim to propose novel LSH functions in Euclidean space with smaller [3, 2, 5]. However, these functions are highly theoretical and difficult to use. The latter focuses on finding better query strategies to further reduce the query time or index size [6, 28, 20, 31, 32, 39, 24, 23, 20]. LSH forest [6] offers each point a variable-length hash value instead of a fixed hash value as in

-index methods. It can improve the quality guarantee of LSH for skewed data distributions while retaining the same space consumption and query cost. Multi-Probe LSH

[28] examines multiple hash buckets in the order of a probing sequence derived from a hash table. It reduces the space requirement of E2LSH at the cost of the quality guarantee. Entropy-based LSH [31] and BayesLSH [32] adopt similar multi-probing strategies as in Multi-Probe LSH, but have a more rigorous theoretical analysis. Their theoretical analysis relies on a strong assumption on data distribution which can be hard to satisfy, leading to poor performance for some datasets. LazyLSH [39] supports -ANN queries in multiple -norm spaces with only one suit of indexes, thus effectively reducing the space consumption. I-LSH [24] and EI-LSH [23] design a set of adaptive early termination conditions so that the query process can stop early if a good enough result is found. Developed upon SK-LSH [25] and Suffix Array [29], Lei et al. [20] propose a dynamic concatenating search framework, LCCS-LSH, that also achieves sub-linear query time and sub-quadratic space. Recently, researchers have adopted the LSH framework to solve other kinds of queries, such as maximum inner product search[33, 30, 16, 37]

and point-to-hyperplane NN search

[15] in high dimensional spaces. These examples demonstrate the superior performance and great scalability of LSH.

Iii Preliminaries

In this section, we present the definition of the ANN search problem, the concepts of LSH, and an important observation. Frequently used notations are summarized in Table II.

Notation Description
-dimensional Euclidean space
The dataset
The cardinality of dataset
A data point
A query point
The distance between and

The pdf of standard normal distribution

Hash function
TABLE II: List of Key Notations.

Iii-a Problem Definitions

Let be a -dimensional Euclidean space, and denote the distance between points.

Definition 1 (-ANN Search).

Given a dataset , a query point and an approximation ratio , -ANN search returns a point satisfying , where is the exact nearest neighbor of .

Remark 1.

-ANN search is a natural generalization of -ANN search. It returns points, say that are sorted in ascending order w.r.t. their distances to , such that for , we have , where is the -th nearest neighbor of .

-nearest neighbor search is often used as a subroutine when finding -ANN. Following [35], it is defined formally as follows:

Definition 2 (-NN Search).

Given a dataset , a query point , an approximation ratio and a distance , -NN search returns:

  • a point satisfying , if there exists a point such that ;

  • nothing, if there is no point such that .

  • otherwise, the result is undefined.

The result of case 3 remains undefined since case 1 and case 2 suffice to ensure the correctness of a -ANN query. By setting , where is the nearest neighbor of , a -ANN can be found directly by answering an -NN query. As is not known in advance, a -ANN query is processed by conducting a series of -NN queries with increasing radius, i.e., it begins by searching a region around using a small value. Without loss of generality, we assume . Then, it keeps enlarging the search radius in multiples of , i.e., until a point is returned. In this way, as shown in [18, 11, 4], a -ANN query can be answered with an approximation ratio of .

Example 1.

Figure 1 shows an example where has 12 data points. Suppose approximation ratio . Consider the first -NN search with (the yellow circle). Since there is no point such that (the red circle), it returns nothing. Then, consider -NN with . Since there exists no point such that , but (the blue circle), the returned result is undefined, i.e., it is correct to return either nothing or any found point, such as . Finally, consider -NN with . Since , the query must return a point, which can be any point from as all of them satisfy (the green circle). The above procedures also elaborate the process of answering a -ANN query. Any point from can be considered as a result. Apparently, they are correct -ANN results of .

Fig. 1: An illustration of -NN and -ANN

Iii-B Locality-Sensitive Hashing

Locality-sensitive hashing is the foundation of our method. For a hash function , two points and are said to collide over if , i.e., they are mapped into the same bucket using . The formal definition of LSH is given below[11]:

Definition 3 (Lsh).

Given a distance and an approximation ratio , a family of hash functions is called -locality-sensitive, if for , it satisfies both conditions below:

  • If ;

  • If ,

where is chosen at random, are collision probabilities and .

A typical LSH family for Euclidean space in static LSH methods (e.g., E2LSH) is defined as follows [8]:

(1)

where

is the vector representation of a point

, is a -dimensional vector where each entry is chosen independently from a -stable distribution, i.e., the standard normal distribution, is a real number chosen uniformly from , and is a pre-defined integer. Denote the distance between any two points as , then the collision probability under such hash function can be computed as:

(2)

where is the probabilistic density function (pdf) of the standard normal distribution. For a given , it is easy to see that decreases monotonically with . Therefore, the hash family defined by Equation 1 is -locality-sensitive, where and .

Iii-C Locality-Sensitive Hashing with Dynamic Bucketing

A typical dynamic LSH family for the Euclidean space is defined as follows [14]:

(3)

where is the same as in Equation 1. For a hash function , two points and are said to collide over if . In this sense, the collision probability can be computed as:

(4)

It is easy to see that the hash family defined by Equation 3 is -locality-sensitive, where and . In what follows, refers to the LSH family identified by Equation 3 and refers to the corresponding collision probability in Equation 4 unless otherwise stated. Next, we introduce a simple but important observation that inspires us to design a dynamic -index.

Observation 1.

The hash family is -locality-sensitive for any search radius and , where is a positive constant.

Proof.

It is easy to see that for any search radius and , the following equation holds:

(5)

That is, is -locality-sensitive. ∎

By the above observation, we do not need to physically maintain multiple -indexes from -locality-sensitive hash family in advance to support the corresponding -NN queries with different . Instead, we can dynamically partition buckets with the width required by different queries via only one -index, where , , is a constant to balance the query efficiency and space consumption (see Remark 2, Section V). As explained in Section V, the choice of and guarantees correctness of DB-LSH for -NN search and -ANN search. This is a key observation that leads to our novel approach to be presented next.

Iv Our Method

DB-LSH consists of an indexing phase for mapping and a query phase for dynamic bucketing. We first give an overview of this novel approach, followed by detailed descriptions of the two separate phases.

Iv-a Overview of DB-LSH

Considering the limitations of C2 and MQ discussed earlier, we propose to keep the basic idea of the static -index, which provides an opportunity to answer -ANN queries with the sub-linear query cost. To remove the inherent obstacles in static -index methods, DB-LSH develops a dynamic bucketing strategy that constructs query-centric hypercubic buckets with the required width in the query phase. In the indexing phase, DB-LSH projects each data point into -dimensional spaces by independent LSH functions. Unlike static -index methods that quantify the projected points with a fixed size, we index points in each -dimensional space with a multi-dimensional index. In the query phase, an -NN query with sufficiently small , say , is issued at the beginning. To answer this query, query-centric hypercubic buckets with width are constructed and the points in them are found by window queries. If the retrieved point is within of , DB-LSH returns it as a correct -ANN result. Otherwise, the next -NN query with is issued, and the width of the dynamic hypercubic bucket is updated from to accordingly. By gradually extending the search radii and bucket width , DB-LSH achieves finding -ANN with a constant success probability on top of just one -index after accessing a maximum of points.

Fig. 2: Search regions of DB-LSH and other LSH methods

Figure 2 gives an intuitive explanation of the advantages of DB-LSH on the search region. The dotted purple square is the search region in E2LSH. We can notice that points close to the query might be hashed to a different bucket (e.g., ), especially when is near to the bucket boundary, which jeopardizes the accuracy. The gray cross-like region is the search region of C2. Such an unbounded region is much bigger than that of DB-LSH (the red square), which leads to the number of points accessed arbitrarily large in the worst case and thus incurs a large query cost. The dotted blue circle is the search region of MQ. Although it is a bounded region, finding the points in it becomes more complex than in other regions. DB-LSH still uses hypercubic buckets (search region) as used in static -index methods, but achieves much better accuracy. The query-centric bucketing strategy eliminates the hash boundary issue. The overhead of dynamic bucketing is affordable because of efficient window queries via multi-dimensional indexes. To summarize, DB-LSH is hopeful of reaching a given accuracy with the least query cost among all these methods. In what follows, we give everything that a practitioner needs to know to apply DB-LSH.

Iv-B Indexing Phase

The indexing phase consists of two steps: constructing projected spaces and indexing points by multi-dimensional indexes. Constructing projected spaces. Given a -locality-sensitive hash family , let be the set of all subsets with hash functions chosen independently from , i.e., each element is a -dimensional compound hash of the form:

(6)

where . Then, we sample instances independently from denoted as , and compute projections of each data object as follows:

(7)

Indexing points by multi-dimensional indexes. In each -dimensional projected space, we index points with a multi-dimensional index. The only requirement of the index is that it can efficiently answer a window query in the low-dimensional space. In this paper, we simply choose the R-Tree [17] as our index due to an ocean of optimizations and toolboxes, which enables the R-Tree to perform robustly in practice. The CR-Tree [19], X-tree [7] or multi-dimensional learned index [21] can certainly be used to potentially further improve our approach.

Iv-C Query Phase

DB-LSH can directly answer an -NN query with any search radius by exploiting the -index that has been built for -NN in the indexing phase, as described in Section IV-B. Algorithm 1 outlines the query processing. To find the -NN of a query , we consider -dimensional projected spaces in order. For each space, we first compute the hash values of , i.e., (Line 1). Then, a window query, denoted as , is conducted using the R-Tree. To be more specific, means a query that needs to return points in the following hypercubic region:

(8)

Without confusion, we also use to denote a region as above. For each point falling in such a region, we compute its distance to . If the distance is less than or we have verified points, the algorithm reports the current point and stops. Otherwise, the algorithm returns nothing. According to Lemma 2, to be introduced in Section V, DB-LSH is able to correctly answer an -NN query with a constant success probability.

Input: : a query point; : query radius; : the approximation ratio; : a positive integer
Output: A point or
1;
2 for  to  do Compute ;
3 while a point is found do + ;
4 if  or  then return ;
5 end if
6 end while
7end for
8return ;
Algorithm 1 -NN Query
Input: : a query point; : the approximation ratio;
Output: A point
1;
2 while TRUE do call -NN;
3 if   then return ;
4 else ;
5 end if
6end while
Algorithm 2 -ANN Query

-ANN. A -ANN query can be answered by conducting a series of -NN queries with . Algorithm 2 demonstrates the details of finding -ANN. Given a query and an approximation ratio , the algorithm starts by the -NN query. After that, if we have found a satisfying object or have accessed enough points i.e., (Line 2), the algorithm reports the current point and terminates immediately. Otherwise, it enlarges the query radius by a factor of and invokes the -NN query (Algorithm 1) again till the termination conditions are satisfied. According to Theorem 1, to be introduced in Section V, DB-LSH is able to correctly answer a -ANN query with a constant success probability.

Example 2.

Figure 3 gives an example of answering a -ANN query by DB-LSH, where we choose and for simplicity. Figure 3(a) and Figure 3(b) exhibit the points in the original and projected space, respectively. Assume is set to . First of all, we issue a -NN query in the original space (the yellow circle in Figure 3(a)). To answer this query, we conduct window query in the projected space (the yellow square in Figure 3(b)). Since no point is found, an -NN query with larger , i.e., (the red circle in Figure 3(a)) is issued, and window query (the red square in Figure 3(b)) is performed accordingly. Then, is found as a candidate and we verify it by computing its original distance to . Since (the blue circle in Figure 3(a)), is returned as the result.

(a) Original Space
(b) Projected Space
Fig. 3: An example of -ANN search using DB-LSH

-ANN. Algorithm 2 can be easily adapted to answer -ANN queries. Specifically, it suffices to modify the two termination conditions to the following:

  • At a certain -NN query, the total number of objects accessed so far exceeds (corresponding to the first case in Line 1 of Algorithm 1).

  • At a certain -NN query, the -th nearest neighbor found so far is within distance of (corresponding to the second case in Line 1 of Algorithm 1).

DB-LSH terminates if and only if one of the situations happens. Also, apparently Line 1 in Algorithm 1 (or Line 2 in Algorithm 2) should return the nearest neighbors.

V Theoretical Analysis

It is essential to provide a theoretical analysis of DB-LSH. First, we discuss the quality guarantees for DB-LSH. Then, we prove that DB-LSH achieves lower query time and space complexities, with an emphasis on deriving a smaller .

V-a Quality Guarantees

We demonstrate that DB-LSH is able to correctly answer a -ANN query. Before proving it, we first define two events as follows:

  • [leftmargin=7mm]

  • If there exists a point satisfying , then for some ;

  • The number of points satisfying two conditions below is no more than : 1) ; and 2) for some .

Lemma 1.

For given and , by setting and where and , the probability that E1 occurs is at least and the probability that E2 occurs is at least .

Proof.

If there exists a point satisfying , then the LSH property implies that for any , . Then, the probability that , and thus the probability that E1 does not occur will not exceed . Therefore, when and is set as above. Likewise, if there exists a point satisfying , we have . Then, the probability that , and thus the expected number of such points in a certain projected space does not exceed . Therefore, the expected number of such points in all projected spaces is upper bounded by . By Markov’s inequality, we have . ∎

It is easy to see that the probability that E1 and E2 hold at the same time is a constant, which can be computed as . Next, we demonstrate that when E1 and E2 hold at the same time, Algorithm 1 is correct for answering an -NN query.

Lemma 2.

Algorithm 1 answers the -NN query with at least a constant probability of .

Proof.

Assume that E1 and E2 hold at the same time, which occurs with at least a constant probability . In this case, if Algorithm 1 terminates after accessing points, then the current point must satisfy due to E2, and thus a correct result is found. If Algorithm 1 terminates because of finding a point satisfying , this point is obviously a correct -NN. If window queries are over (the algorithm does not terminate because of either already accessing points or finding a point within of ), it indicates that no point satisfying due to E1. According to the definition of -NN search, it is reasonable to return nothing. Therefore, when E1 and E2 hold at the same time, an -NN query is always correctly answered when Algorithm 1 terminates. That is, Algorithm 1 can answer the -NN query with at least a constant probability of . ∎

Theorem 1.

Algorithm 2 returns a -ANN with at least constant probability of .

Proof.

We show that when E1 and E2 hold at the same time, Algorithm 2 returns a correct -ANN result. Let be the exact NN of query point in and . Without loss of generality, we assume . Obviously, there must exist a integer such that . Let . When enlarging the search radius , we know that at termination of Algorithm 2 is at most due to E1. In this case, according to Lemma 2, the returned point satisfies that , and thus a correct -ANN result. Clearly, if Algorithm 1 stops in a smaller case for either condition, the returned point satisfy . Therefore, Algorithm 2 returns a -ANN with at least constant probability of . ∎

Remark 2.

Unlike the classic -index methods, where and are set as and , we introduce a constant to lessen and . In this manner, the total space consumption will be greatly reduced. The overhead of this strategy is the need to examine at most candidates instead of ones, which seems to cause a higher query cost. However, in fact, none of the efficient LSH methods really build hash indexes and only check 2 candidates in each index. Usually, hash indexes much fewer than are already able to return a sufficiently accurate -ANN. Therefore, by introducing , we tend to get candidates in one index. This kind of parameter setting is more reasonable and practical.

V-B The bound of

As proven in [8], is strictly bounded by when is large enough. Such a large bucket size can not be used experimentally since it implies a very large value of to effectively differentiate points based on their distance. In contrast, we find that has a smaller bound than that can be taken even when the bucket width is not too large. To make a better understanding and simplify the proof, we prove the bound of in a special case where is set as , where .

Lemma 3.

By setting , can be bounded by , where and is the pdf of the standard normal distribution.

Proof.

Recall that , we have

(9)

according to Lemma 1 in [8]. Given a , we prove holds for any , which is equivalent to prove the following inequality:

(10)

Define a function . Inequality 10 holds when decreases monotonically with . To ensure this, let , where is the derivative function of , then we have . That is to say, inequality 10 holds when . Denote , it can be proven that increases monotonically with when . Since and , we have , and thus is greater than . Therefore, can be set as and then is always bounded by when . ∎

holds when , which subsequently provides a bound smaller than . The value of increases with , and approaches to when approaches to infinity. That is, the query cost can be very small when is large enough. However, a large bucket size implies a very large in order to reduce the number of false positives, so should typically be set to a similar interval range as in other -index methods. Recall that LSB [35] sets the bucket size to with approximate ratio , we can equivalently set (i.e., ) to make also be when . Then, according to Lemma 3, and the bound is as compared to the bound of in [35]. Note that can be less than when . In this case, no longer seems to be a better bound than . However, it will not necessarily lead to . Figure 4(a) gives an example that when . By setting , exceeds when , which means it is not bounded by , while is always bounded by and smaller than . The main reason is that is just an asymptotic bound of approachable only by a very large bucket size, while is a non-asymptotic result and is always much smaller than . Besides, it is not necessary to set , since it implies a very large value of . For example, if is close to , will be which makes -index based methods unpractical. Figure 4(b) gives a clear comparison for the decided advantage of over by setting a reasonable value . is very close to , while has a much smaller bound and decreases rapidly to .

(a)
(b)
Fig. 4: v.s.

V-C Complexity Analysis

Similar to other -index based methods whose time complexity and space complexity are affected by , the complexities of DB-LSH are affected by .

Theorem 2.

DB-LSH answers a -ANN query in time and index size, where is bounded by and smaller than defined in static -index methods.

Proof.

It is obvious that and . Therefore, the index size is . In DB-LSH, we need to first compute hash values of query point, the computational cost of which is . When finding candidates, it takes time to find a candidate using R-Trees. Since we need to retrieve at most candidate points, the cost of generating candidates is . In the verification phase, each candidate point spends time on distance computation, so the total verification cost is . Therefore, the query time of DB-LSH is bounded by . ∎

Vi Experimental study

We implement DB-LSH111https://github.com/Jacyhust/DB-LSH and the competitors in C++ compiled in a single thread with g++ using O3 optimization. All experiments are conducted on a server running 64-bit Ubuntu 20.04 with 2 Intel(R) Xeon(R) Gold 5218 CPUs @ 2.30GHz and 254 GB RAM.

Vi-a Experimental Settings

Datasets and Queries. We employ 10 real-world datasets varying in cardinality, dimensionality and types, which are used widely in existing LSH work [26, 20, 21, 27, 38]. For the sake of fairness, we make sure that each dataset is used by at least one of our competitors. Table III summarizes the statistics of the datasets. Note that both SIFT10M and SIFT100M consist of points randomly chosen from SIFT1B dataset222http://corpus-texmex.irisa.fr/. For queries, we randomly select 100 points as queries and remove them from the datasets.

Datasets Cardinality Dim. Types
Audio 54,387 192 Audio
MNIST 60,000 784 Image
Cifar 60,000 1024 Image
Trevi 101,120 4096 Image
NUS 269,648 500 SIFT Description
Deep1M 1,000,000 256 DEEP Description
Gist 1,000,000 960 GIST Description
SIFT10M 10,000,000 128 SIFT Description
TinyImages80M 79,302,017 384 GIST Description
SIFT100M 100,000,000 128 SIFT Description
TABLE III: Summary of Datasets

Competitors. We compare DB-LSH with 5 LSH methods as mentioned in Section II, i.e., LCCS-LSH [20], PM-LSH [38], VHP [27] and R2LSH [26] and LSB-Forest [35]. LCCS-LSH adopts a query-oblivious LSH indexing strategy with a novel search framework. PM-LSH is a typical dynamic MQ method that adopts PM-Tree to index the projected data. R2LSH and VHP are representative C2 methods that improve QALSH from the perspective of search regions. LSB-Forest is a static -index method that can answer -ANN queries for any with only one suit of indexes. In addition, to study the effectiveness of query-centric dynamic bucketing strategy in DB-LSH, we design a static -index method called Fixed Bucketing-LSH (FB-LSH) by replacing the dynamic bucketing part in DB-LSH with the fixed bucketing. Note that FB-LSH is not equivalent to E2LSH since only one suit of -index is used.

DB-LSH FB-LSH LCCS-LSH PM-LSH R2LSH VHP LSB-Forest
Query Time (ms) 4.962 5.434 5.797 5.459 8.748 11.32 18.52
Overall Ratio 1.003 1.008 1.006 1.003 1.005 1.006 1.005
Recall 0.9268 0.8512 82.04 0.9212 0.868 0.8580 0.4676
Audio Indexing Time (s) 0.099 0.164 2.126 0.166 2.764 1.626 19.55
Query Time (ms) 7.684 9.304 19.89 13.87 12.95 15.37 37.35
Overall Ratio 1.005 1.018 1.007 1.005 1.005 1.008 1.010
Recall 0.9130 0.7580 0.8038 0.9098 0.8756 0.8426 0.3734
MNIST Indexing Time (s) 0.149 0.192 1.942 0.189 6.231 5.457 92.26
Query Time (ms) 12.54 16.37 17.66 17.53 21.81 19.31 59.66
Overall Ratio 1.002 1.006 1.006 1.004 1.003 1.014 1.010
Recall 0.9156 0.8018 0.7150 0.8742 0.8784 0.6322 0.1496
Cifar Indexing Time (s) 0.149 0.209 1.941 0.199 8.261 6.844 146.27
Query Time (ms) 48.20 61.74 113.7 52.23 53.10 176.47 271.56
Overall Ratio 1.001 1.010 1.003 1.002 1.003 1.003 1.007
Recall 0.9338 0.6818 0.7816 0.8918 0.8100 0.8798 0.1588
Trevi Indexing Time (s) 0.232 0.374 6.572 0.386 46.08 44.05 1347.9
Query Time (ms) 36.07 58.75 79.15 68.38 93.13 103.33 155.72
Overall Ratio 1.0008 1.011 1.004 1.011 1.012 1.010 1.009
Recall 0.5532 0.4656 0.5376 0.4637 0.4494 0.4972 0.1080
NUS Indexing Time (s) 0.768 1.655 40.032 1.190 23.40 15.86 798.45
Query Time (ms) 127.16 170.24 163.24 327.58 188.84 243.53 377.60
Overall Ratio 1.004 1.010 1.004 1.004 1.005 1.014 1.003
Recall 0.8784 0.7376 0.8530 0.8594 0.8354 0.5048 0.4524
Deep1M Indexing Time (s) 5.704 7.856 159.41 6.141 61.79 34.57 3498.3
Query Time (ms) 164.03 265.90 335.67 339.63 288.63 384.77 761.02
Overall Ratio 1.004 1.007 1.003 1.006 1.010 1.016 1.005
Recall 0.8098 0.7360 0.7248 0.7566 0.6442 0.5180 0.2736
Gist Indexing Time (s) 6.056 7.811 178.74 8.038 139.93 105.98 11907
Query Time (ms) 963.17 2633.9 2774.66 1922.4 3998 9723.4 2667.9
Overall Ratio 1.001 1.002 1.002 1.001 1.001 1.006 1.001
Recall 0.9602 0.9420 0.9192 0.9469 0.9560 0.8248 0.7206
SIFT10M Indexing Time (s) 86.49 123.46 159.31 101.71 506.13 263.19 23631
Query Time (ms) 14511 28854 21101 29023 35396 164194 \
Overall Ratio 1.002 1.004 1.002 1.005 1.035 1.014 \
Recall 0.8922 0.8144 0.8384 0.8164 0.6303 0.7720 \
TinyImages80M Indexing Time (s) 1198.9 2663.3 23911 2153.5 6508.1 4265.1 \
Query Time (ms) 7961.6 10287 25342 26724 25467 163531 \
Overall Ratio 1.001 1.009 1.004 1.001 1.019 1.006 \
Recall 0.9618 0.7960 0.8568 0.9597 0.6180 0.7980 \
SIFT100M Indexing Time (s) 1638.1 3414.3 10912 2552.6 5404.6 3442.9 \
TABLE IV: Performance Overview

Parameter Settings. By default, all algorithms are conducted to answer -ANN queries with . For DB-LSH, we set the approximation ratio and . is fixed as . for the datasets with cardinality greater than 1M and for the rest datasets. Parameter settings of competitors follow the original papers or their source codes. Specifically, for LCSS-LSH, we set and . For PM-LSH, we set and use hash functions, . For R2LSH, we are recommended to set , and to , and . For VHP, we set and for the datasets except Gist, Trevi and Cifar. For these three datasets, is set as since they have much higher dimensionality. For LSB-Forest, we set based on the dimensionality of the datasets. Then and can be computed by and . To achieve comparable query accuracy with the competitors, we increase the total number of leaf entries in LSB-Forest from to . For FB-LSH, we set the approximation ratio and . is fixed as and ranges from to based on the cardinality of the datasets. Evaluation Metric. There are five metrics in total. Two metrics are used to evaluate the indexing performance: namely, index size and indexing time. Three metrics are used to evaluate the query performance: query time, overall ratio and recall. For a -ANN query, let the returned set be with points sorted in ascending order of their distances to the query point and the exact -NN , then the overall ratio and recall are defined as follows [38].

(11)
(12)

We repeatedly conduct each algorithm times for all queries and report the average query time, overall ratio and recall. Since LSB-Forest, R2LSH and VHP are disk-based methods, we only take their CPU time as the query time for fairness. For FB-LSH, we omit the search time for candidates in R-Tree when computing the query time so as to mimic the fast lookup of candidates through hash tables in static -index methods. Such time cannot be ignored in DB-LSH.

Vi-B Performance Overview

In this subsection, we provide an overview of the average query time, overall ratio, recall and indexing time of all algorithms with default parameter settings on all datasets, as shown in Table IV. We do not run LSB-Forest on TinyImages80M and SIFT100M, since their storage consumption is considerably huge (more than 10TB to store the indexes).

Vi-B1 DB-LSH and FB-LSH

we first make a brief comparison of DB-LSH and FB-LSH, where the number of hash functions is set to the same value. The only difference between them is whether a query-centric bucket is used or not. As we can see from Table IV, DB-LSH saves - of the query time compared to FB-LSH but reaches a higher recall and smaller overall ratio. In other words, DB-LSH achieves better accuracy with higher efficiency. The main reason is that although DB-LSH spends more time searching for candidates in the R-Trees, the number of required candidates is reduced due to the high quality of candidates in query-centric buckets.

Vi-B2 Indexing Performance

The indexing time and index size of all algorithms with the default settings are considered in this set of experiments. Since the index size of all algorithms except LSB-Forest can be easily estimated by , we compare the index size by the number of hash functions used in each algorithm as mentioned in the parameter settings and do not list them again in the Table IV. We can see that the index sizes are close for all algorithms except PM-LSH, which demonstrates that DB-LSH eliminates the space consumption issue in -index methods. In LSB-Forest, data points are also stored in each indexes, which leads to extremely large space consumption. Besides, the value of in LSB-Forest is . It also makes LSB-Forest ill-adapted to the large-scale datasets. For example, reaches to for Gist and for SIFT10M. For the indexing time, as shown in Table IV, we have the following observations: (1) DB-LSH achieves the smallest indexing time on all datasets. The reason is twofold. First, DB-LSH adopts the bulk-loading strategy to construct R-Trees, which is a more efficient strategy than conventional insertion strategies. It takes less time to construct R-Trees than PM-LSH to build a PM-Tree. Second, DB-LSH requires only indexes, which is much smaller than those in LCCS-LSH, R2LSH and VHP. In addition, R2LSH and VHP have close indexing time since they both adopt B-Trees as indexes. LCSS has a much longer indexing time than other algorithms due to its complex index structure, CSA. The indexing time of LSB-Forest is also very long because LSB-Forest uses several times the number of indexes than other algorithms. (2) The indexing time is almost determined by the cardinality of the dataset and it increases super-linearly with cardinality in all algorithms. For example, MNIST and Cifar have the same cardinality and almost the same indexing time. All algorithms take more than 10 times longer to build indexes on dataset SIFT100M than on SIFT10M. It implies that it is time-consuming to construct indexes for very large-scale datasets, and therefore, the smallest indexing time gives DB-LSH a great advantage.

Vi-B3 Query Performance

In this set of experiments, we study the average query time, recall and overall ratio of all algorithms in the default settings. According to the results shown in Table IV, we have the following observations: (1) DB-LSH offers the best query performance on all datasets. The higher recall, smaller overall ratio and shorter query time indicate DB-LSH outperforms all competitor algorithms on both efficiency and accuracy. In particular, on very large-scale datasets TinyImages80M and SIFT100M (s and s), DB-LSH not only takes just about half query time of PM-LSH, R2LSH, VHP and LSB-Forest, but also reaches a higher accuracy. Only LCCS-LSH and FB-LSH achieve the comparable query time on these two large-scale datasets (s and s). The reason DB-LSH achieves the best performances can be concluded as follows: a) compared with query-oblivious methods (LCCS-LSH, LSB-Forest), query-centric methods can obtain higher quality candidates since they address the hash boundary issue; b) compared with other query-centric methods (C2), both MQ and DB-LSH perform better due to the bounded search region; c) compared with MQ that adopts only one index, DB-LSH uses indexes to miss fewer exact NNs, and thus achieving better recall and ratio. (2) The query accuracy, especially recall, varies with datasets. All algorithms can achieve - recall on most datasets. On NUS, all algorithms perform slightly inferior due to intrinsically complex distribution (that can be quantified by relative contrast and local intrinsic dimensionality [38, 12, 22]), but DB-LSH still has a lead. (3) The query performance of VHP and R2LSH are considerably worse than other algorithms on large-scale datasets TinyImages80M and SIFT100M. VHP takes as long as linear scan (s and s) and R2LSH is difficult to reach an acceptable recall ( and ) or overall ratio. Therefore, we do not report the results of them on TinyImages80M and SIFT100M in the subsequent experiments. (4) No matter which datasets, LSB-Forest always needs the longest query time to reach a similar accuracy. Its query time grows rapidly with the cardinality and dimensionality of the dataset. As many as index accesses make LSB-Forest not comparable to others, so we do not report it in the rest experiments.

Vi-C Evaluation of Query Performance

Vi-C1 Effect of

In order to investigate how the dataset cardinality affects the query performance, we randomly pick up and data points from the original dataset and compare the query performance of all algorithms on them in the default parameters. Due to the space limitation, we only report the results on Gist and TinyImages80M, which are representative due to their different cardinality and dimensionality. The comparative results are shown in Figure 5-7.

(a) Gist
(b) TinyImages80M
Fig. 5: Query Time when Varying
(a) Gist
(b) TinyImages80M
Fig. 6: Recall when Varying
(a) Gist
(b) TinyImages80M
Fig. 7: Overall Ratio when Varying

Clearly, DB-LSH has a lead advantage over all competitors under all evaluation metrics when varying the cardinality. Although the query time increases with the cardinality, DB-LSH grows much slower than other algorithms. The reason is that DB-LSH truly achieves a sub-linear query cost. In terms of query accuracy, all algorithms, especially DB-LSH, LCCS-LSH and PM-LSH, achieve relatively stable recall and overall ratio, because query accuracy depends mainly on the data distribution. Although the cardinality increases, the data distribution remains essentially the same, and therefore the accuracy does not change much. The accuracy of FB-LSH is a bit unsteady due to hash boundary issue. As we can see, DB-LSH keeps performing better than all competitor algorithms.