I Introduction
The nearest neighbor (NN) search finds the closest point in a point dataset to a given query point. As the points which are closer to each other can often be considered ‘similar’ to each other in many applications when a proper distance measure is used, this search operation plays a vital role in a wide range of areas, such as pattern recognition
[1], information retrieval[36], and data mining[13]. However, it is well known that finding the exact NN in largescale highdimensional datasets can be very timeconsuming. People often conduct approximate nearest neighbor (ANN) searches instead[18, 35]. The approximate nearest neighbor (ANN) search and nearest neighbor (NN) search are two representative queries to trade result accuracy for query efficiency. Specifically, ANN search aims to find a point whose distance to the query point is bounded by , where is the distance from to its exact NN and is a given approximation ratio (see Definition 1, Section III). NN search can be considered as a decision version of ANN, which aims to determine whether there exists a point whose distance to is at most , where is a given search range (see Definition 2, Section III).Algorithms  Indexing  Query  Index Size  Query Cost  Comment  
KL  DBLSH  Dynamic  Querycentric  
E2LSH [4]  Static  Queryoblivious  
LSBForest [35]  Static  Queryoblivious  
C2  QALSH [14]  Dynamic  Querycentric  
VHP [27]  Dynamic  Querycentric  
R2LSH [26]  Dynamic  Querycentric  
MQ  SRS [34]  Dynamic  Querycentric  
PMLSH [38]  Dynamic  Querycentric 
LocalitySensitive Hashing (LSH) [9, 10, 35, 39, 11, 4] is one of the most popular tools for computing
ANN in highdimensional spaces. LSH maps data points into buckets using a set of hash functions such that nearby points in the original space have a higher probability to be hashed into the same bucket than those which are far away. When a query arrives, the probability to find its
ANN is guaranteed to be sufficiently high by only checking the points in the bucket where the query point falls in. In order to achieve this goal, the original LSHbased methods (E2LSH) [4] design a set of independent hash functions with which all data points in the original dimensional space are mapped into a dimensional space, . These dimensional points are assigned into a range of buckets which are dimensional hypercubes. This process is repeated times to generate dimensional hash buckets (we term this type of approach index). Intuitively, as increases, the probability of two different points being hashed into the same bucket decreases. On the contrary, the collision probability, which is the probability of two different points being mapped into the same bucket, increases as increases because two points are considered as a ‘collision’ as long as they are mapped into the same bucket at least once. As shown in [11, 8], by choosing and , where , are constants depending on and (for the meaning of and , see Definition 3, Section III), E2LSH can solve the NN problem in sublinear time with constant success probability of . Accordingly, E2LSH finds ANN in sublinear time by answering a series of NN queries with . However, to achieve a good accuracy, E2LSH needs to prepare a index for each NN and is typically large, which causes prohibitively large storage costs for the indexes. LSB [35] alleviates this issue by building a index for NN and repeatedly merging small hash buckets into a large one, which effectively enlarges . However, LSB only works for NN queries at some discrete integer , which imposes the limitation that LSB cannot answer the ANN query with . C2LSH [9] proposes a new LSH scheme called collision counting (C2). By relaxing the collision condition from the exactly collisions to any collisions where is a given value, C2LSH only needs to maintain onedimensional hash tables (instead of dimensional hash tables). However, the query cost of C2 is no longer sublinear [9] because it is expensive to count the number of collisions between a large number of data points and the query point dimension by dimension. In addition to the dilemma between space and time, the above methods also suffer from the candidate quality issue (a.k.a. the hash boundary issue). That is, no matter how large the hash buckets are, some points close to a query point may still be partitioned into different buckets. Several dynamic bucketing techniques are proposed to address this issue. The main idea of dynamic bucketing is to leave the bucketing process to the query phase in the hope of generating buckets such that the nearby points are more likely to be in the same bucket as the query point. The C2 approach is extended to dynamic scenarios by using Btrees to locate points falling in a querycentric bucket in each dimension [14, 27, 26], at the cost of increased query time because of a large number of onedimensional searches. [34, 38] explore a new dynamic metric query (MQ) based LSH scheme to map data points in a highdimensional space into a lowdimensional projected space via independent LSH functions, and determine ANN by exact nearest neighbor searches in the projected space. However, even in a lowdimensional space, finding the exact NN is still inherently computationally expensive. More importantly, at least exact distance computations are needed to perform in case of missing correct ANN, which incurs a linear time complexity. Hereis an estimated ratio for the number of
dimensional NN searches such that the dimensional ANN results can be found safely [34, 38]. Table 1 summarizes the query and space costs of typical LSH methods. As shown in the table, among the existing solutions to the ANN search problem, index based methods are the only ones that can achieve sublinear query cost, i.e., , where is proven to be bounded by . in E2LSH is the number of indexes prepared ahead [35]. Note that the value of is bounded by only when the bucket size is very large [8]. This implies a very large value of is necessary to effectively differentiate points based on their distances. It remains a significant challenge to find a smaller and truly bounded without using a very large bucket size. Motivated by the aforementioned limitations, in this paper we propose a novel index approach with a querycentric dynamic bucketing strategy called DBLSH to solve the highdimensional ANN search problem. DBLSH decouples the hashing and bucketing processes of the index, making it possible to answer NN queries for any and ANN for any with only one suit of indexes (i.e., without the need to perform LSH times for each possible ). In this way the space cost is reduced significantly, and a reduction of value becomes possible. DBLSH builds dynamic querycentric buckets and conducts multidimensional window queries to eliminate the hash boundary issues for selecting the candidates. Different from other querycentric methods, the region of our buckets are still multidimensional cubes like in static index methods, which enables DBLSH to not only generate highquality candidates but also to achieve sublinear query cost, as shown in Table I. Furthermore, DBLSH achieves a much smaller bound at a proper and finite bucket size, denoted as , which is bounded by (e.g., when choosing as the width of the initial hypercubic bucket). With theoretical analysis and an extensive range of experiments, we show that DBLSH outperforms the existing LSH methods significantly for both efficiency and accuracy. The main contributions of this paper include:
We propose a novel LSH framework, called DBLSH, to solve the highdimensional ANN search problem. It is the first work that combines the static index approach with a dynamic search strategy for bucketing. By taking advantages from both sides, DBLSH can reduce the index size and improve query efficiency simultaneously.

A rigorous theoretical analysis shows that DBLSH can achieve the lowest query time complexity so far for any approximation ratio . DBLSH answers a ANN query with a constant success probability in time, where is bounded by , e.g., when initial bucket width is , which is smaller than in other index methods.

Extensive experiments on 10 real datasets with different sizes and dimensionality have been conducted to show that DBLSH can achieve better efficiency and accuracy than the existing LSH methods.
The rest of the paper is organized as follows. The related work is reviewed in Section II. Section III introduces the basic concepts and the research problem formally. The construction and query algorithms of DBLSH are presented in Section IV, with a theoretical analysis in Section V and an experimental study in VI. We conclude this paper in Section VII.
Ii Related Work
LSH is originally proposed in [18, 11]. Due to its simple structure, sublinear query cost, and a rigorous quality guarantee, it has been a prominent approach for processing approximate nearest neighbor queries in the high dimensional spaces [11, 8, 6, 28]. We give a brief overview of the existing LSH methods in this section.
Iia Mainstream LSH Methods
index based methods. Although the basic LSH [11] is used in the Hamming space, index methods extend from it to provide a universal and welladopted LSH framework for answering the ANN problem in other metric spaces. E2LSH[4] is a popular index method in the Euclidean space and adopts the stable distributionbased function proposed in [8] as the LSH function. Its applications are limited by the hash boundary problem and undesirably large index sizes. These two shortcomings are shared by other index methods due to the fact that static buckets are used in these methods. To reduce index sizes, Tao et al. [35] consider answering NN queries at different radii via an elegant LSBTree framework, although it only works for ANN query with . SKLSH [25] is another approach based on the idea of static index, but proposes a novel search framework to find more candidates. To address the limitations of static
index methods, dynamic query strategies are developed to find highquality candidates using smaller indexes. These methods can be classified into two categories as follows.
Collision counting based methods (C2). The core idea of C2 is to generate candidates based on the collision numbers. It is proposed in C2LSH [9], which uses the techniques of collision counting and virtual rehashing to reduce space consumption. QALSH [14] improves C2LSH by adopting queryaware buckets rather than static ones, which alleviates the hash boundary issue. R2LSH [26] improves the performance of QALSH by mapping data into multiple twodimensional projected spaces rather than onedimensional projected spaces as in QALSH. VHP [27] considers the buckets in QALSH as hyperplanes and introduces the concept of virtual hypersphere to achieve smaller space complexity than QALSH. C2 can find highquality candidates with a larger probability but its cost of finding the candidates is expensive due to the unbounded search regions, which makes all points likely to be counted once in the worst case. Dynamic metric query based methods (MQ). SRS [25] and PMLSH [38] are representative dynamic MQ approaches that map data into a lowdimensional projected space and determine candidates based on their Euclidean distances via queries in the projected space. It is proven that this strategy can accurately estimate the distance between two points in highdimensional spaces [38]. However, answering metric queries in the projected space is still computationally expensive and as many as candidates have to be checked to ensure a success probability of , where is a constant mentioned earlier. Therefore, MQ can incur a high query cost of .IiB Additional LSH Methods
There are other LSH methods that come from two categories: the methods that design different hash functions and the methods that adopt alternative query strategies. The former includes studies that aim to propose novel LSH functions in Euclidean space with smaller [3, 2, 5]. However, these functions are highly theoretical and difficult to use. The latter focuses on finding better query strategies to further reduce the query time or index size [6, 28, 20, 31, 32, 39, 24, 23, 20]. LSH forest [6] offers each point a variablelength hash value instead of a fixed hash value as in
index methods. It can improve the quality guarantee of LSH for skewed data distributions while retaining the same space consumption and query cost. MultiProbe LSH
[28] examines multiple hash buckets in the order of a probing sequence derived from a hash table. It reduces the space requirement of E2LSH at the cost of the quality guarantee. Entropybased LSH [31] and BayesLSH [32] adopt similar multiprobing strategies as in MultiProbe LSH, but have a more rigorous theoretical analysis. Their theoretical analysis relies on a strong assumption on data distribution which can be hard to satisfy, leading to poor performance for some datasets. LazyLSH [39] supports ANN queries in multiple norm spaces with only one suit of indexes, thus effectively reducing the space consumption. ILSH [24] and EILSH [23] design a set of adaptive early termination conditions so that the query process can stop early if a good enough result is found. Developed upon SKLSH [25] and Suffix Array [29], Lei et al. [20] propose a dynamic concatenating search framework, LCCSLSH, that also achieves sublinear query time and subquadratic space. Recently, researchers have adopted the LSH framework to solve other kinds of queries, such as maximum inner product search[33, 30, 16, 37]and pointtohyperplane NN search
[15] in high dimensional spaces. These examples demonstrate the superior performance and great scalability of LSH.Iii Preliminaries
In this section, we present the definition of the ANN search problem, the concepts of LSH, and an important observation. Frequently used notations are summarized in Table II.
Notation  Description 
dimensional Euclidean space  
The dataset  
The cardinality of dataset  
A data point  
A query point  
The distance between and  
The pdf of standard normal distribution 

Hash function 
Iiia Problem Definitions
Let be a dimensional Euclidean space, and denote the distance between points.
Definition 1 (ANN Search).
Given a dataset , a query point and an approximation ratio , ANN search returns a point satisfying , where is the exact nearest neighbor of .
Remark 1.
ANN search is a natural generalization of ANN search. It returns points, say that are sorted in ascending order w.r.t. their distances to , such that for , we have , where is the th nearest neighbor of .
nearest neighbor search is often used as a subroutine when finding ANN. Following [35], it is defined formally as follows:
Definition 2 (NN Search).
Given a dataset , a query point , an approximation ratio and a distance , NN search returns:

a point satisfying , if there exists a point such that ;

nothing, if there is no point such that .

otherwise, the result is undefined.
The result of case 3 remains undefined since case 1 and case 2 suffice to ensure the correctness of a ANN query. By setting , where is the nearest neighbor of , a ANN can be found directly by answering an NN query. As is not known in advance, a ANN query is processed by conducting a series of NN queries with increasing radius, i.e., it begins by searching a region around using a small value. Without loss of generality, we assume . Then, it keeps enlarging the search radius in multiples of , i.e., until a point is returned. In this way, as shown in [18, 11, 4], a ANN query can be answered with an approximation ratio of .
Example 1.
Figure 1 shows an example where has 12 data points. Suppose approximation ratio . Consider the first NN search with (the yellow circle). Since there is no point such that (the red circle), it returns nothing. Then, consider NN with . Since there exists no point such that , but (the blue circle), the returned result is undefined, i.e., it is correct to return either nothing or any found point, such as . Finally, consider NN with . Since , the query must return a point, which can be any point from as all of them satisfy (the green circle). The above procedures also elaborate the process of answering a ANN query. Any point from can be considered as a result. Apparently, they are correct ANN results of .
IiiB LocalitySensitive Hashing
Localitysensitive hashing is the foundation of our method. For a hash function , two points and are said to collide over if , i.e., they are mapped into the same bucket using . The formal definition of LSH is given below[11]:
Definition 3 (Lsh).
Given a distance and an approximation ratio , a family of hash functions is called localitysensitive, if for , it satisfies both conditions below:

If ;

If ,
where is chosen at random, are collision probabilities and .
A typical LSH family for Euclidean space in static LSH methods (e.g., E2LSH) is defined as follows [8]:
(1) 
where
is the vector representation of a point
, is a dimensional vector where each entry is chosen independently from a stable distribution, i.e., the standard normal distribution, is a real number chosen uniformly from , and is a predefined integer. Denote the distance between any two points as , then the collision probability under such hash function can be computed as:(2) 
where is the probabilistic density function (pdf) of the standard normal distribution. For a given , it is easy to see that decreases monotonically with . Therefore, the hash family defined by Equation 1 is localitysensitive, where and .
IiiC LocalitySensitive Hashing with Dynamic Bucketing
A typical dynamic LSH family for the Euclidean space is defined as follows [14]:
(3) 
where is the same as in Equation 1. For a hash function , two points and are said to collide over if . In this sense, the collision probability can be computed as:
(4) 
It is easy to see that the hash family defined by Equation 3 is localitysensitive, where and . In what follows, refers to the LSH family identified by Equation 3 and refers to the corresponding collision probability in Equation 4 unless otherwise stated. Next, we introduce a simple but important observation that inspires us to design a dynamic index.
Observation 1.
The hash family is localitysensitive for any search radius and , where is a positive constant.
Proof.
It is easy to see that for any search radius and , the following equation holds:
(5) 
That is, is localitysensitive. ∎
By the above observation, we do not need to physically maintain multiple indexes from localitysensitive hash family in advance to support the corresponding NN queries with different . Instead, we can dynamically partition buckets with the width required by different queries via only one index, where , , is a constant to balance the query efficiency and space consumption (see Remark 2, Section V). As explained in Section V, the choice of and guarantees correctness of DBLSH for NN search and ANN search. This is a key observation that leads to our novel approach to be presented next.
Iv Our Method
DBLSH consists of an indexing phase for mapping and a query phase for dynamic bucketing. We first give an overview of this novel approach, followed by detailed descriptions of the two separate phases.
Iva Overview of DBLSH
Considering the limitations of C2 and MQ discussed earlier, we propose to keep the basic idea of the static index, which provides an opportunity to answer ANN queries with the sublinear query cost. To remove the inherent obstacles in static index methods, DBLSH develops a dynamic bucketing strategy that constructs querycentric hypercubic buckets with the required width in the query phase. In the indexing phase, DBLSH projects each data point into dimensional spaces by independent LSH functions. Unlike static index methods that quantify the projected points with a fixed size, we index points in each dimensional space with a multidimensional index. In the query phase, an NN query with sufficiently small , say , is issued at the beginning. To answer this query, querycentric hypercubic buckets with width are constructed and the points in them are found by window queries. If the retrieved point is within of , DBLSH returns it as a correct ANN result. Otherwise, the next NN query with is issued, and the width of the dynamic hypercubic bucket is updated from to accordingly. By gradually extending the search radii and bucket width , DBLSH achieves finding ANN with a constant success probability on top of just one index after accessing a maximum of points.
Figure 2 gives an intuitive explanation of the advantages of DBLSH on the search region. The dotted purple square is the search region in E2LSH. We can notice that points close to the query might be hashed to a different bucket (e.g., ), especially when is near to the bucket boundary, which jeopardizes the accuracy. The gray crosslike region is the search region of C2. Such an unbounded region is much bigger than that of DBLSH (the red square), which leads to the number of points accessed arbitrarily large in the worst case and thus incurs a large query cost. The dotted blue circle is the search region of MQ. Although it is a bounded region, finding the points in it becomes more complex than in other regions. DBLSH still uses hypercubic buckets (search region) as used in static index methods, but achieves much better accuracy. The querycentric bucketing strategy eliminates the hash boundary issue. The overhead of dynamic bucketing is affordable because of efficient window queries via multidimensional indexes. To summarize, DBLSH is hopeful of reaching a given accuracy with the least query cost among all these methods. In what follows, we give everything that a practitioner needs to know to apply DBLSH.
IvB Indexing Phase
The indexing phase consists of two steps: constructing projected spaces and indexing points by multidimensional indexes. Constructing projected spaces. Given a localitysensitive hash family , let be the set of all subsets with hash functions chosen independently from , i.e., each element is a dimensional compound hash of the form:
(6) 
where . Then, we sample instances independently from denoted as , and compute projections of each data object as follows:
(7) 
Indexing points by multidimensional indexes. In each dimensional projected space, we index points with a multidimensional index. The only requirement of the index is that it can efficiently answer a window query in the lowdimensional space. In this paper, we simply choose the RTree [17] as our index due to an ocean of optimizations and toolboxes, which enables the RTree to perform robustly in practice. The CRTree [19], Xtree [7] or multidimensional learned index [21] can certainly be used to potentially further improve our approach.
IvC Query Phase
DBLSH can directly answer an NN query with any search radius by exploiting the index that has been built for NN in the indexing phase, as described in Section IVB. Algorithm 1 outlines the query processing. To find the NN of a query , we consider dimensional projected spaces in order. For each space, we first compute the hash values of , i.e., (Line 1). Then, a window query, denoted as , is conducted using the RTree. To be more specific, means a query that needs to return points in the following hypercubic region:
(8) 
Without confusion, we also use to denote a region as above. For each point falling in such a region, we compute its distance to . If the distance is less than or we have verified points, the algorithm reports the current point and stops. Otherwise, the algorithm returns nothing. According to Lemma 2, to be introduced in Section V, DBLSH is able to correctly answer an NN query with a constant success probability.
ANN. A ANN query can be answered by conducting a series of NN queries with . Algorithm 2 demonstrates the details of finding ANN. Given a query and an approximation ratio , the algorithm starts by the NN query. After that, if we have found a satisfying object or have accessed enough points i.e., (Line 2), the algorithm reports the current point and terminates immediately. Otherwise, it enlarges the query radius by a factor of and invokes the NN query (Algorithm 1) again till the termination conditions are satisfied. According to Theorem 1, to be introduced in Section V, DBLSH is able to correctly answer a ANN query with a constant success probability.
Example 2.
Figure 3 gives an example of answering a ANN query by DBLSH, where we choose and for simplicity. Figure 3(a) and Figure 3(b) exhibit the points in the original and projected space, respectively. Assume is set to . First of all, we issue a NN query in the original space (the yellow circle in Figure 3(a)). To answer this query, we conduct window query in the projected space (the yellow square in Figure 3(b)). Since no point is found, an NN query with larger , i.e., (the red circle in Figure 3(a)) is issued, and window query (the red square in Figure 3(b)) is performed accordingly. Then, is found as a candidate and we verify it by computing its original distance to . Since (the blue circle in Figure 3(a)), is returned as the result.
ANN. Algorithm 2 can be easily adapted to answer ANN queries. Specifically, it suffices to modify the two termination conditions to the following:
DBLSH terminates if and only if one of the situations happens. Also, apparently Line 1 in Algorithm 1 (or Line 2 in Algorithm 2) should return the nearest neighbors.
V Theoretical Analysis
It is essential to provide a theoretical analysis of DBLSH. First, we discuss the quality guarantees for DBLSH. Then, we prove that DBLSH achieves lower query time and space complexities, with an emphasis on deriving a smaller .
Va Quality Guarantees
We demonstrate that DBLSH is able to correctly answer a ANN query. Before proving it, we first define two events as follows:

[leftmargin=7mm]

If there exists a point satisfying , then for some ;

The number of points satisfying two conditions below is no more than : 1) ; and 2) for some .
Lemma 1.
For given and , by setting and where and , the probability that E1 occurs is at least and the probability that E2 occurs is at least .
Proof.
If there exists a point satisfying , then the LSH property implies that for any , . Then, the probability that , and thus the probability that E1 does not occur will not exceed . Therefore, when and is set as above. Likewise, if there exists a point satisfying , we have . Then, the probability that , and thus the expected number of such points in a certain projected space does not exceed . Therefore, the expected number of such points in all projected spaces is upper bounded by . By Markov’s inequality, we have . ∎
It is easy to see that the probability that E1 and E2 hold at the same time is a constant, which can be computed as . Next, we demonstrate that when E1 and E2 hold at the same time, Algorithm 1 is correct for answering an NN query.
Lemma 2.
Algorithm 1 answers the NN query with at least a constant probability of .
Proof.
Assume that E1 and E2 hold at the same time, which occurs with at least a constant probability . In this case, if Algorithm 1 terminates after accessing points, then the current point must satisfy due to E2, and thus a correct result is found. If Algorithm 1 terminates because of finding a point satisfying , this point is obviously a correct NN. If window queries are over (the algorithm does not terminate because of either already accessing points or finding a point within of ), it indicates that no point satisfying due to E1. According to the definition of NN search, it is reasonable to return nothing. Therefore, when E1 and E2 hold at the same time, an NN query is always correctly answered when Algorithm 1 terminates. That is, Algorithm 1 can answer the NN query with at least a constant probability of . ∎
Theorem 1.
Algorithm 2 returns a ANN with at least constant probability of .
Proof.
We show that when E1 and E2 hold at the same time, Algorithm 2 returns a correct ANN result. Let be the exact NN of query point in and . Without loss of generality, we assume . Obviously, there must exist a integer such that . Let . When enlarging the search radius , we know that at termination of Algorithm 2 is at most due to E1. In this case, according to Lemma 2, the returned point satisfies that , and thus a correct ANN result. Clearly, if Algorithm 1 stops in a smaller case for either condition, the returned point satisfy . Therefore, Algorithm 2 returns a ANN with at least constant probability of . ∎
Remark 2.
Unlike the classic index methods, where and are set as and , we introduce a constant to lessen and . In this manner, the total space consumption will be greatly reduced. The overhead of this strategy is the need to examine at most candidates instead of ones, which seems to cause a higher query cost. However, in fact, none of the efficient LSH methods really build hash indexes and only check 2 candidates in each index. Usually, hash indexes much fewer than are already able to return a sufficiently accurate ANN. Therefore, by introducing , we tend to get candidates in one index. This kind of parameter setting is more reasonable and practical.
VB The bound of
As proven in [8], is strictly bounded by when is large enough. Such a large bucket size can not be used experimentally since it implies a very large value of to effectively differentiate points based on their distance. In contrast, we find that has a smaller bound than that can be taken even when the bucket width is not too large. To make a better understanding and simplify the proof, we prove the bound of in a special case where is set as , where .
Lemma 3.
By setting , can be bounded by , where and is the pdf of the standard normal distribution.
Proof.
Recall that , we have
(9) 
according to Lemma 1 in [8]. Given a , we prove holds for any , which is equivalent to prove the following inequality:
(10) 
Define a function . Inequality 10 holds when decreases monotonically with . To ensure this, let , where is the derivative function of , then we have . That is to say, inequality 10 holds when . Denote , it can be proven that increases monotonically with when . Since and , we have , and thus is greater than . Therefore, can be set as and then is always bounded by when . ∎
holds when , which subsequently provides a bound smaller than . The value of increases with , and approaches to when approaches to infinity. That is, the query cost can be very small when is large enough. However, a large bucket size implies a very large in order to reduce the number of false positives, so should typically be set to a similar interval range as in other index methods. Recall that LSB [35] sets the bucket size to with approximate ratio , we can equivalently set (i.e., ) to make also be when . Then, according to Lemma 3, and the bound is as compared to the bound of in [35]. Note that can be less than when . In this case, no longer seems to be a better bound than . However, it will not necessarily lead to . Figure 4(a) gives an example that when . By setting , exceeds when , which means it is not bounded by , while is always bounded by and smaller than . The main reason is that is just an asymptotic bound of approachable only by a very large bucket size, while is a nonasymptotic result and is always much smaller than . Besides, it is not necessary to set , since it implies a very large value of . For example, if is close to , will be which makes index based methods unpractical. Figure 4(b) gives a clear comparison for the decided advantage of over by setting a reasonable value . is very close to , while has a much smaller bound and decreases rapidly to .
VC Complexity Analysis
Similar to other index based methods whose time complexity and space complexity are affected by , the complexities of DBLSH are affected by .
Theorem 2.
DBLSH answers a ANN query in time and index size, where is bounded by and smaller than defined in static index methods.
Proof.
It is obvious that and . Therefore, the index size is . In DBLSH, we need to first compute hash values of query point, the computational cost of which is . When finding candidates, it takes time to find a candidate using RTrees. Since we need to retrieve at most candidate points, the cost of generating candidates is . In the verification phase, each candidate point spends time on distance computation, so the total verification cost is . Therefore, the query time of DBLSH is bounded by . ∎
Vi Experimental study
We implement DBLSH^{1}^{1}1https://github.com/Jacyhust/DBLSH and the competitors in C++ compiled in a single thread with g++ using O3 optimization. All experiments are conducted on a server running 64bit Ubuntu 20.04 with 2 Intel(R) Xeon(R) Gold 5218 CPUs @ 2.30GHz and 254 GB RAM.
Via Experimental Settings
Datasets and Queries. We employ 10 realworld datasets varying in cardinality, dimensionality and types, which are used widely in existing LSH work [26, 20, 21, 27, 38]. For the sake of fairness, we make sure that each dataset is used by at least one of our competitors. Table III summarizes the statistics of the datasets. Note that both SIFT10M and SIFT100M consist of points randomly chosen from SIFT1B dataset^{2}^{2}2http://corpustexmex.irisa.fr/. For queries, we randomly select 100 points as queries and remove them from the datasets.
Datasets  Cardinality  Dim.  Types 
Audio  54,387  192  Audio 
MNIST  60,000  784  Image 
Cifar  60,000  1024  Image 
Trevi  101,120  4096  Image 
NUS  269,648  500  SIFT Description 
Deep1M  1,000,000  256  DEEP Description 
Gist  1,000,000  960  GIST Description 
SIFT10M  10,000,000  128  SIFT Description 
TinyImages80M  79,302,017  384  GIST Description 
SIFT100M  100,000,000  128  SIFT Description 
Competitors. We compare DBLSH with 5 LSH methods as mentioned in Section II, i.e., LCCSLSH [20], PMLSH [38], VHP [27] and R2LSH [26] and LSBForest [35]. LCCSLSH adopts a queryoblivious LSH indexing strategy with a novel search framework. PMLSH is a typical dynamic MQ method that adopts PMTree to index the projected data. R2LSH and VHP are representative C2 methods that improve QALSH from the perspective of search regions. LSBForest is a static index method that can answer ANN queries for any with only one suit of indexes. In addition, to study the effectiveness of querycentric dynamic bucketing strategy in DBLSH, we design a static index method called Fixed BucketingLSH (FBLSH) by replacing the dynamic bucketing part in DBLSH with the fixed bucketing. Note that FBLSH is not equivalent to E2LSH since only one suit of index is used.
DBLSH  FBLSH  LCCSLSH  PMLSH  R2LSH  VHP  LSBForest  
Query Time (ms)  4.962  5.434  5.797  5.459  8.748  11.32  18.52  
Overall Ratio  1.003  1.008  1.006  1.003  1.005  1.006  1.005  
Recall  0.9268  0.8512  82.04  0.9212  0.868  0.8580  0.4676  
Audio  Indexing Time (s)  0.099  0.164  2.126  0.166  2.764  1.626  19.55 
Query Time (ms)  7.684  9.304  19.89  13.87  12.95  15.37  37.35  
Overall Ratio  1.005  1.018  1.007  1.005  1.005  1.008  1.010  
Recall  0.9130  0.7580  0.8038  0.9098  0.8756  0.8426  0.3734  
MNIST  Indexing Time (s)  0.149  0.192  1.942  0.189  6.231  5.457  92.26 
Query Time (ms)  12.54  16.37  17.66  17.53  21.81  19.31  59.66  
Overall Ratio  1.002  1.006  1.006  1.004  1.003  1.014  1.010  
Recall  0.9156  0.8018  0.7150  0.8742  0.8784  0.6322  0.1496  
Cifar  Indexing Time (s)  0.149  0.209  1.941  0.199  8.261  6.844  146.27 
Query Time (ms)  48.20  61.74  113.7  52.23  53.10  176.47  271.56  
Overall Ratio  1.001  1.010  1.003  1.002  1.003  1.003  1.007  
Recall  0.9338  0.6818  0.7816  0.8918  0.8100  0.8798  0.1588  
Trevi  Indexing Time (s)  0.232  0.374  6.572  0.386  46.08  44.05  1347.9 
Query Time (ms)  36.07  58.75  79.15  68.38  93.13  103.33  155.72  
Overall Ratio  1.0008  1.011  1.004  1.011  1.012  1.010  1.009  
Recall  0.5532  0.4656  0.5376  0.4637  0.4494  0.4972  0.1080  
NUS  Indexing Time (s)  0.768  1.655  40.032  1.190  23.40  15.86  798.45 
Query Time (ms)  127.16  170.24  163.24  327.58  188.84  243.53  377.60  
Overall Ratio  1.004  1.010  1.004  1.004  1.005  1.014  1.003  
Recall  0.8784  0.7376  0.8530  0.8594  0.8354  0.5048  0.4524  
Deep1M  Indexing Time (s)  5.704  7.856  159.41  6.141  61.79  34.57  3498.3 
Query Time (ms)  164.03  265.90  335.67  339.63  288.63  384.77  761.02  
Overall Ratio  1.004  1.007  1.003  1.006  1.010  1.016  1.005  
Recall  0.8098  0.7360  0.7248  0.7566  0.6442  0.5180  0.2736  
Gist  Indexing Time (s)  6.056  7.811  178.74  8.038  139.93  105.98  11907 
Query Time (ms)  963.17  2633.9  2774.66  1922.4  3998  9723.4  2667.9  
Overall Ratio  1.001  1.002  1.002  1.001  1.001  1.006  1.001  
Recall  0.9602  0.9420  0.9192  0.9469  0.9560  0.8248  0.7206  
SIFT10M  Indexing Time (s)  86.49  123.46  159.31  101.71  506.13  263.19  23631 
Query Time (ms)  14511  28854  21101  29023  35396  164194  \  
Overall Ratio  1.002  1.004  1.002  1.005  1.035  1.014  \  
Recall  0.8922  0.8144  0.8384  0.8164  0.6303  0.7720  \  
TinyImages80M  Indexing Time (s)  1198.9  2663.3  23911  2153.5  6508.1  4265.1  \ 
Query Time (ms)  7961.6  10287  25342  26724  25467  163531  \  
Overall Ratio  1.001  1.009  1.004  1.001  1.019  1.006  \  
Recall  0.9618  0.7960  0.8568  0.9597  0.6180  0.7980  \  
SIFT100M  Indexing Time (s)  1638.1  3414.3  10912  2552.6  5404.6  3442.9  \ 
Parameter Settings. By default, all algorithms are conducted to answer ANN queries with . For DBLSH, we set the approximation ratio and . is fixed as . for the datasets with cardinality greater than 1M and for the rest datasets. Parameter settings of competitors follow the original papers or their source codes. Specifically, for LCSSLSH, we set and . For PMLSH, we set and use hash functions, . For R2LSH, we are recommended to set , and to , and . For VHP, we set and for the datasets except Gist, Trevi and Cifar. For these three datasets, is set as since they have much higher dimensionality. For LSBForest, we set based on the dimensionality of the datasets. Then and can be computed by and . To achieve comparable query accuracy with the competitors, we increase the total number of leaf entries in LSBForest from to . For FBLSH, we set the approximation ratio and . is fixed as and ranges from to based on the cardinality of the datasets. Evaluation Metric. There are five metrics in total. Two metrics are used to evaluate the indexing performance: namely, index size and indexing time. Three metrics are used to evaluate the query performance: query time, overall ratio and recall. For a ANN query, let the returned set be with points sorted in ascending order of their distances to the query point and the exact NN , then the overall ratio and recall are defined as follows [38].
(11) 
(12) 
We repeatedly conduct each algorithm times for all queries and report the average query time, overall ratio and recall. Since LSBForest, R2LSH and VHP are diskbased methods, we only take their CPU time as the query time for fairness. For FBLSH, we omit the search time for candidates in RTree when computing the query time so as to mimic the fast lookup of candidates through hash tables in static index methods. Such time cannot be ignored in DBLSH.
ViB Performance Overview
In this subsection, we provide an overview of the average query time, overall ratio, recall and indexing time of all algorithms with default parameter settings on all datasets, as shown in Table IV. We do not run LSBForest on TinyImages80M and SIFT100M, since their storage consumption is considerably huge (more than 10TB to store the indexes).
ViB1 DBLSH and FBLSH
we first make a brief comparison of DBLSH and FBLSH, where the number of hash functions is set to the same value. The only difference between them is whether a querycentric bucket is used or not. As we can see from Table IV, DBLSH saves  of the query time compared to FBLSH but reaches a higher recall and smaller overall ratio. In other words, DBLSH achieves better accuracy with higher efficiency. The main reason is that although DBLSH spends more time searching for candidates in the RTrees, the number of required candidates is reduced due to the high quality of candidates in querycentric buckets.
ViB2 Indexing Performance
The indexing time and index size of all algorithms with the default settings are considered in this set of experiments. Since the index size of all algorithms except LSBForest can be easily estimated by , we compare the index size by the number of hash functions used in each algorithm as mentioned in the parameter settings and do not list them again in the Table IV. We can see that the index sizes are close for all algorithms except PMLSH, which demonstrates that DBLSH eliminates the space consumption issue in index methods. In LSBForest, data points are also stored in each indexes, which leads to extremely large space consumption. Besides, the value of in LSBForest is . It also makes LSBForest illadapted to the largescale datasets. For example, reaches to for Gist and for SIFT10M. For the indexing time, as shown in Table IV, we have the following observations: (1) DBLSH achieves the smallest indexing time on all datasets. The reason is twofold. First, DBLSH adopts the bulkloading strategy to construct RTrees, which is a more efficient strategy than conventional insertion strategies. It takes less time to construct RTrees than PMLSH to build a PMTree. Second, DBLSH requires only indexes, which is much smaller than those in LCCSLSH, R2LSH and VHP. In addition, R2LSH and VHP have close indexing time since they both adopt BTrees as indexes. LCSS has a much longer indexing time than other algorithms due to its complex index structure, CSA. The indexing time of LSBForest is also very long because LSBForest uses several times the number of indexes than other algorithms. (2) The indexing time is almost determined by the cardinality of the dataset and it increases superlinearly with cardinality in all algorithms. For example, MNIST and Cifar have the same cardinality and almost the same indexing time. All algorithms take more than 10 times longer to build indexes on dataset SIFT100M than on SIFT10M. It implies that it is timeconsuming to construct indexes for very largescale datasets, and therefore, the smallest indexing time gives DBLSH a great advantage.
ViB3 Query Performance
In this set of experiments, we study the average query time, recall and overall ratio of all algorithms in the default settings. According to the results shown in Table IV, we have the following observations: (1) DBLSH offers the best query performance on all datasets. The higher recall, smaller overall ratio and shorter query time indicate DBLSH outperforms all competitor algorithms on both efficiency and accuracy. In particular, on very largescale datasets TinyImages80M and SIFT100M (s and s), DBLSH not only takes just about half query time of PMLSH, R2LSH, VHP and LSBForest, but also reaches a higher accuracy. Only LCCSLSH and FBLSH achieve the comparable query time on these two largescale datasets (s and s). The reason DBLSH achieves the best performances can be concluded as follows: a) compared with queryoblivious methods (LCCSLSH, LSBForest), querycentric methods can obtain higher quality candidates since they address the hash boundary issue; b) compared with other querycentric methods (C2), both MQ and DBLSH perform better due to the bounded search region; c) compared with MQ that adopts only one index, DBLSH uses indexes to miss fewer exact NNs, and thus achieving better recall and ratio. (2) The query accuracy, especially recall, varies with datasets. All algorithms can achieve  recall on most datasets. On NUS, all algorithms perform slightly inferior due to intrinsically complex distribution (that can be quantified by relative contrast and local intrinsic dimensionality [38, 12, 22]), but DBLSH still has a lead. (3) The query performance of VHP and R2LSH are considerably worse than other algorithms on largescale datasets TinyImages80M and SIFT100M. VHP takes as long as linear scan (s and s) and R2LSH is difficult to reach an acceptable recall ( and ) or overall ratio. Therefore, we do not report the results of them on TinyImages80M and SIFT100M in the subsequent experiments. (4) No matter which datasets, LSBForest always needs the longest query time to reach a similar accuracy. Its query time grows rapidly with the cardinality and dimensionality of the dataset. As many as index accesses make LSBForest not comparable to others, so we do not report it in the rest experiments.
ViC Evaluation of Query Performance
ViC1 Effect of
In order to investigate how the dataset cardinality affects the query performance, we randomly pick up and data points from the original dataset and compare the query performance of all algorithms on them in the default parameters. Due to the space limitation, we only report the results on Gist and TinyImages80M, which are representative due to their different cardinality and dimensionality. The comparative results are shown in Figure 57.
Clearly, DBLSH has a lead advantage over all competitors under all evaluation metrics when varying the cardinality. Although the query time increases with the cardinality, DBLSH grows much slower than other algorithms. The reason is that DBLSH truly achieves a sublinear query cost. In terms of query accuracy, all algorithms, especially DBLSH, LCCSLSH and PMLSH, achieve relatively stable recall and overall ratio, because query accuracy depends mainly on the data distribution. Although the cardinality increases, the data distribution remains essentially the same, and therefore the accuracy does not change much. The accuracy of FBLSH is a bit unsteady due to hash boundary issue. As we can see, DBLSH keeps performing better than all competitor algorithms.