1 Motivation
Nearest neighbor (NN) search is a fundamental problem with applications in information retrieval, data mining, multimedia and scientific databases, social networks, to name a few. Given a query point , the NN problem is to identify an object from a dataset of objects such that is closest to than any other object. A useful extension is to identify nearest neighbours (kNN), i.e., an ordered set of top closest objects to .
While there exist a host of scalable methods to efficiently answer NN queries in low and mediumdimensional spaces [14, 61, 28, 9, 21, 12, 20, 10, 55, 11], they do not scale gracefully with the increase in dimensionality of the dataset. Under the strict requirement that the answer set returned should be completely accurate, the problem of high dimensionality in similarity searches, commonly dreaded as the “curse of dimensionality” [14], renders these methods useless in comparison to a simple linear scan of the entire database [71].
In many practical applications, however, such a strict correctness requirement can be sacrificed to mitigate the dimensionality curse and achieve practical running times [66, 26, 64]. A webscale search for similar images, e.g., Google Images (www.google.com/imghp), Tineye (www.tineye.com), RevIMG (www.revimg.com), perhaps best exemplifies such an application. Moreover, a single query is often just an intermediate step and aggregations of results over many such queries is required for the overall retrieval process. For example, in image searching, generally multiple image keypoints are searched. The strict accuracy of a single image keypoint search is, therefore, not so critical since the overall aggregation can generally tolerate small errors.
With this relaxation, the NN and kNN problems are transformed into ANN (approximate nearest neighbors) and kANN problems respectively, where an error, controlled by a parameter that defines the approximation ratio (defined formally in Def. 1), is allowed. The retrieved object(s) is such that their distance from is at most times the distance of the actual nearest object(s) respectively.
The locality sensitive hashing (LSH) family [34] is one of the most popular methods capable of efficiently answering kANN queries in sublinear time. However, these methods are not scalable owing to the requirement of large space for constructing the index that is superlinear in the size of the dataset [24]. While there have been many recent improvements [65, 26, 44] in the original LSH method with the aim of reducing the indexing space, asymptotically they still require superlinear space and, thus, do not scale to very large datasets. A recent proposition, SRS [64], addresses this problem by introducing a linear sized index while retaining high efficiency and is the stateoftheart in this class of techniques.
A notable practical limitation of these techniques with theoretical guarantees is that the error is measured using the approximation ratio . Given that the task is to identify an ordered set of nearest objects to a query point , the rank at which an object is retrieved is significant. Since is characterized by the properties of the ratio of distances
between points in a vector space, a low (close to
) approximation ratio does not always guarantee similar relative ordering of the retrieved objects. This problem worsens with increase in dimensionality , as the distances tend to lose their meaning, and become more and more similar to each other [14]. Mathematically, where and are the maximum and minimum pairwise distances between a finite set of points. Hence, it is not surprising that rapidly approaches for extremely highdimensional spaces (dimensionality ). Thus, while might be a good metric in low and mediumdimensional spaces (), it loses its effectiveness with increasing number of dimensions.Thus, for assessing the quality of kANN search, especially in highdimensional spaces, a metric that gives importance to the retrieval rank is warranted. Mean Average Precision (MAP), which is widely used and accepted by the information retrieval community [49] for evaluation of ranked retrieval tasks, fits perfectly. MAP@k (defined formally in Def. 3) takes into account the relative ordering of the retrieved objects by imparting higher weights to objects returned at ranks closer to their original rank. Additionally, it has been shown to possess good discrimination and stability [49]. Thus, MAP should be the target for evaluating the quality.
To highlight the empirical difference between the two metrics, we plot both MAP@10 and approximation ratio (for ) for different techniques and real datasets (details in Sec. 5) in Fig. 1 (these and other figures look better in color). It is clear that good approximation ratios can lead to poor MAP values (e.g., in Fig. (a)a, C2LSH and SRS have ratios and respectively with MAP values of and respectively). Moreover, the relative orders may be flipped (such as in Fig. (b)b where C2LSH offers better ratio but poorer MAP than SRS). On the whole, this shows that MAP cannot be improved by just having better approximation ratios.
To the best of our knowledge, there is no method that can guarantee a bound on the MAP. Thus, the existing techniques that provide bounds on approximation ratio cannot be deemed to be advantageous in practice for the task of kANN in highdimensional spaces.
We next analyze another class of techniques that rely on the use of spacefilling curves (such as Hilbert or Zorder) [59] for indexing highdimensional spaces [66, 51, 46, 42, 40]. Although devoid of theoretical guarantees, these methods portray good retrieval efficiency and quality. The main idea is to construct single/multiple singledimensional ordering(s) of the highdimensional points, which are then indexed using a standard structure such as B+tree for efficient querying. These structures are based on the guarantee that points closer in the transformed space are necessarily close in the original space as well. The vice versa, however, is not true and serves as the source of inaccuracy.
The leaf node of a B+tree can either store the entire object which improves efficiency while hampering scalability, or store a pointer to the database object which improves scalability at the cost of efficiency. Hence, these methods do not scale gracefully to highdimensional datasets due to their large space requirements as only a few objects can fit in a node.
In light of the above discussion, the need is to design a scheme that congregates the best of both the worlds (onedimensional representation of spacefilling curves with hierarchical pruning of B+trees with leaves modified as explained later) to ensure both efficiency and scalability.
To this end, in this paper, we propose a hybrid index structure, HDIndex (HighDimensional Index). We focus on solving the kANN query for large diskresident datasets residing in extremely highdimensional spaces.
We first divide the dimensional space into disjoint partitions. For each lowerdimensional partitioned space, we construct a hierarchical tree structure using the Hilbert spacefilling curve orderings. We then propose a novel design called RDBtrees (Reference Distance B+trees), where instead of object descriptors or pointers, the distances to a fixed set of reference objects are stored in the leaves. While querying, in addition to distance approximation using triangular inequality, we incorporate the use of a more involved lower bounding filter known as the Ptolemaic inequality [30]. Instead of using a single reference object, it approximates the distance using a pair of reference objects. Although costlier to compute, it yields tighter lower bounds. The novel design of RDBtree leaves allows application of these filtering techniques using the reference objects easily.
Contributions. In sum, our contributions in this paper are:

[nosep,leftmargin=*]

We propose a novel, scalable, practical, and yet simple indexing scheme for large extremely highdimensional Euclidean spaces, called HDIndex (Sec. 3).

We design a novel structure called RDBtree with leaves designed to store distances to reference objects for catering to extremely highdimensional spaces (Sec. 3.2).

We devise an efficient and effective querying algorithm (Sec. 4) that leverages lowerbounding filters to prune the search space.

We advocate the use of mean average precision or MAP (Def. 3) as a better metric to benchmark techniques in terms of quality of the retrieved top objects, instead of the approximation ratio, especially in highdimensional spaces. We motivate the need of MAP in realworld applications through an actual information retrieval application, that of image search (Sec. 5.5).

Through an indepth empirical analysis (Sec. 5), we show that HDIndex scales gracefully to massive highdimensional datasets on commodity hardware, and provides speedups in running time of up to orders of magnitude, while providing significantly better quality results than the stateoftheart techniques.
2 Background
2.1 Preliminaries
We consider a dataset with dimensional points spanning a vector space. Given a query , and the Euclidean distance function ( norm), the knearest neighbor (kNN) query is to retrieve the set of nearest neighbors of from such that . The approximate knearest neighbor (kANN) problem relaxes the strict requirement and aims to retrieve a set of size that contains close neighbors of .
We next describe two quality metrics to assess the quality of the retrieved kANN answer set. Suppose the true nearest neighbors of are in order, i.e., is closest to , then , and so on. Assume that the approximate nearest neighbors returned are , again, in order.
Definition 1 (Approximation Ratio)
The approximation ratio is simply the average of the ratio of the distances:
(1) 
Definition 2 (Average Precision)
Consider the returned item . If is not relevant, i.e., it does not figure in the true set at all in any rank, its precision is . Otherwise, the recall at position is captured. If out of items , of them figure in the true set, then the recall is . The average of these values for ranks to is the average precision at , denoted by :
(2) 
is the indicator function: if ; and otherwise.
The AP function, thus, prefers correct answers at higher ranks.
Definition 3 (Mean Average Precision)
The mean average precision, denoted by , is the mean over values for queries:
(3) 
AP and, thus, MAP lie in the range .
Example 1
Suppose the true ranked order for a query is . Consider . Since is not relevant, its precision is . Next, for rank , only is relevant. Thus, the precision is . Similarly, the precision at is . The AP, therefore, is .
The AP for the same set of returned elements but in a different ranked order, , is .
Hence, the MAP for and is .
2.2 Related Work
The problem of nearest neighbor searches, more popularly known as the kNN query, has been investigated by the database community for decades. The basic technique of scanning the entire data does not scale for large diskresident databases and, consequently, a host of indexing techniques (e.g., Rtree [28] and Mtree families [21]) have been designed to efficiently perform the kNN query [14, 61].
2.2.1 Hierarchical Structures
The Rtree family of structures builds a hierarchy of data objects with the leaf pages containing the actual data points. In each level, the MBR (minimum bounding rectangle) corresponding to the internal node covers all its children nodes. The branching factor indicates how many children nodes can be packed within one disk page. The MBRs can be hyperrectangular (Rtree [28], R*tree [9]) or hyperspherical (SStree [72]) or mixed (SRtree [38]).
Unfortunately enough, for datasets having a large dimensionality, the infamous “curse of dimensionality” [14] renders almost all these hierarchical indexing techniques inefficient and linear scan becomes the only feasible solution [71]. Structures such as the VAfile [71] have, therefore, been designed to compress the data and perform the unavoidable linear scan faster. Techniques to particularly handle the problem of high dimensionality have also been proposed (e.g., the pyramid technique [11], iMinMax [55], iDistance [73], etc.). These structures map a highdimensional point to a single number and then index that using a B+tree [22]. A specialized hierarchical structure Xtree [12], on the other hand, adjusts itself with growing dimensionality. In the extreme case, it resembles a simple flat file and the searching degrades to that of a linear scan. To beat the curse of dimensionality, compression through quantization has also been employed. Atree [60] uses a fixed number of bits per dimension to quantize the data, thereby enabling packing of more children nodes in an internal node. IQtree [10] uses different number of quantization bits depending on the number of points in a data MBR and is restricted to only two levels of hierarchy before it indexes the leaf pages directly. Both of them in the end, however, produce the exact correct answer. A survey of such highdimensional techniques, especially in the context of multimedia databases, is available at [15].
Hierarchical index structures of the Mtree family [21] exploit the property of triangular inequality of metric distances. These structures do not need the vector space representation of points and work directly with the distance between two points. Nevertheless, at high dimensions, the Euclidean distance itself tends to lose its meaning [14] and these structures fail as well.
2.2.2 Reference ObjectBased Structures
Instead of hierarchical treebased structures, another important line of research has been the use of reference objects or landmarks or pivots [18, 56]. Certain database points are marked as reference objects and distance of all other points from one or more of these reference objects are precomputed and stored. Once the query arrives, its distance to the reference objects are computed. Using the triangular inequality (based on the assumption that the underlying distance measure is a metric), certain points can be pruned off without actually computing their distance to the query. AESA [58, 68] and LAESA [52] use the above generic scheme.
Marin et al. [50] employed clustering and the cluster representatives were used as reference objects. Amato et al. [7] proposed inverted distance mappings from a set of representative points based on the premise that in a metric space, if two objects are close to each other, then the relative ordering of the data points from these two objects would also be similar. Chen et al. [20] used representative objects to obtain a vector of distances. The onedimensional ordering of these vectors were then indexed using a B+tree.
The choice of representative points has been governed by various factors such as maximizing distance to the already chosen set [16, 18], maximizing the minimum distance among themselves [23], location on the boundary of the dataset [36]
, whether they are outliers
[17], using a nearest neighbor distance loss function
[29], minimizing the correlation among themselves [41], or through a sample of queries [67]. Chen et al. [20] included a nice survey of methods to choose representative objects.2.2.3 SpaceFilling Curves
Spacefilling curves [59, 14] provide an intuitive way of approximating the underlying data space by partitioning it into grid cells. A onedimensional ordering is then imposed on the grid cells such that the resultant curve covers each and every grid cell exactly once. The sequence of an occupied grid cell according to the curve is its onedimensional index. The multidimensional points are indexed by the onedimensional index of the grid cell it occupies. The basic property of these curves ensure that when two points are close in the onedimensional ordering, they are necessarily close in the original highdimensional space. The reverse may not be true, though, due to the “boundary” effects where the curve visits a neighboring cell after a long time. Among the different types of spacefilling curves, the Hilbert curve is considered to be the most appropriate for indexing purposes [37].
To diminish the boundary effects, the use of multiple spacefilling curves have been incorporated. Deterministic translation vectors were used by [42] whereas [62] used random transformation or scaling of data points. Valle et al. [66] proposed an alternative way where different subsets of the dimensions are responsible for each curve. This helps to reduce the dimensionality of the space handled by each curve. Different B+trees, each corresponding to a particular Hilbert curve, were built. When a query point arrives, the nearest neighbor searches in the different B+trees aggregate the points. Finally, the approximate nearest neighbors are chosen from this aggregated set.
2.2.4 LocalitySensitive Hashing
The localitysensitive hashing (LSH) family [34] of randomized algorithms adopt a completely different approach. The database points are probabilistically hashed to multiple hash tables. The hash functions follow two crucial properties: (1) if two points are close enough, they are very likely to fall in the same hash bucket, and (2) if two points are faroff, they are quite unlikely to fall in the same hash bucket. At run time, the query point is also hashed, and the points falling in the same hash buckets corresponding to each hash table form the candidates. The final approximate nearest neighbor set is chosen from these candidates.
The basic LSH scheme [34] was extended for use in Euclidean spaces by E2LSH [24]. C2LSH [26] was specifically designed on top of E2LSH to work with highdimensional spaces. LSBforest [65] builds multiple trees to adjust to the NN distance. Sun et al. devised SRS [64] to have a small index footprint so that the entire index structure can fit in the main memory.
Recently, a queryaware datadependent LSH scheme, QALSH, has been proposed [33]. The bucket boundaries are not decided apriori, but only after the query arrives in relation to the position of the query. As a result, accuracy improves.
Although the LSH family [34] provides nice theoretical guarantees in terms of running time, space and approximation ratio, however, importantly enough, they provide no guarantees on MAP.
Symbol  Definition 

Database  
Size of database  
Database object  
Dimensionality of each object  
Partition of dimensions  
Total number of partitions  
Set of Hilbert curves corresponding to  
Dimensionality of each Hilbert curve  
Order of Hilbert curve  
RDBtrees corresponding to  
Order of RDBtree leaf  
Disk page size  
Set of reference objects  
Number of reference objects  
Query object  
Number of candidates after RDBtree search  
Number of candidates after triangular inequality  
Number of candidates after Ptolemaic inequality  
Final candidate set  
Size of final candidate set  
Final answer set  
Number of nearest neighbors returned 
2.2.5 Inmemory Indexing Techniques
Flann [53] uses an ensemble of randomized KDforests [63]
and priority Kmeans trees to perform fast approximate nearest neighbor search. Its key novelty lies in the automatic selection of the best method in the ensemble, and allowing identification of optimal parameters for a given dataset. Recently, Houle et al.
[31] proposed a probabilistic index structure called the Rank Cover tree (RCT) that entirely avoids the use of numerical pruning techniques like triangle inequality etc., and instead uses rankbased pruning. This alleviates the curse of dimensionality to a large extent.Proximity graphbased kANN methods [70, 69, 8, 47] have gained popularity, out of which, the hierarchical level based navigable small world method HNSW [48] has been particularly successful.
Quantization based methods aim at significantly reducing the size of the index structure, thereby facilitating inmemory querying of approximate nearest neighbors. The product quantization (PQ) method [35]
divides the feature space into disjoint subspaces and quantize them independently. Each highdimensional data point is represented using just a short code composed of the subspace quantization indices. This facilitates efficient distance estimation. The Cartesian Kmeans (CKMeans)
[54] and optimized product quantization (OPQ) [27] methods were proposed later as improvements and a generalization of PQ respectively.Despite their attempt to reduce the index structure size, the techniques discussed above find difficulty in scaling to massive highdimensional databases on commodity hardware, and instead require machines possessing a humongous amount of main memory.
2.2.6 Representative Methods
For comparison, we choose the following representative methods: iDistance [73] (as an exact searching technique), Multicurves [66] (hierarchical index using spacefilling curves), SRS [64] (low memory footprint), C2LSH [26] (low running time), QALSH [33] (datadependent hashing with high quality), OPQ [27] (product quantization), and HNSW [48] (graphbased method) since they claim to outperform the other competing methods in terms of either running time, or space overhead, or quality.
3 HDIndex Construction
In this section, we describe in detail the construction of our index structure, HDIndex (Algo. 1). Fig. 2 shows the overall scheme of our method. We use the terminology listed in Table 1.
Since we are specifically targeting datasets in highdimensional vector spaces, we first divide the entire dimensionality into smaller partitions. A Hilbert curve is then passed through each partition. The resultant onedimensional values are indexed using a novel structure called RDBtree. While the upper levels of a RDBtree resembles that of a B+tree, the leaves are modified.
We next explain each of the fundamental steps in detail.
3.1 SpaceFilling Curves
The total number of dimensions is first divided into disjoint and equal partitions. Although the partitioning can be nonequal and/or overlapping as well, we choose an equal and contiguous partitioning scheme due to two reasons: (1) it is simpler, and (2) the other partitioning schemes do not give any tangible benefit since we assume that the dimensions are independent. The claim is empirically verified in Sec. 5.2.1.
For each partition, a Hilbert curve, catering to dimensions, is constructed using the Butz algorithm [19]. The order of a Hilbert curve specifies the number of bits used to describe every dimension. Thus, if the order is , each dimension is divided into equal grid partitions. The singledimensional value of an object in the Hilbert curve ordering is called its Hilbert key.
As an example, consider the dimensional dataset of objects shown in Table 2. The dimensions are broken into two partitions, 12 and 34. The Hilbert curve for each of these partitions is shown in Fig. (a)a and Fig. (b)b. The Hilbert keys for the objects are obtained by considering the sorted order along the curve, and are shown in the last two columns of Table 2.
Multiple Hilbert curves are beneficial in not only reducing the dimensionality but also tackling the “boundary” effect. It is unlikely that nearby objects will have faroff Hilbert keys in all the partitions. For example, in Fig. (a)a, and having adjacent Hilbert keys are nearby in space, while objects and , despite being close in space, have faroff Hilbert keys. However, in Fig. (b)b, and have adjacent Hilbert keys. Objects and continue to be close in both the cases. It is enough for an object that is a nearest neighbor to have a close Hilbert key in only one partition to be chosen as a candidate.
Object  Dim 1  Dim 2  Dim 3  Dim 4  HK 1  HK 2 

0.20  0.74  0.68  0.73  3  5  
0.84  0.34  0.49  0.81  6  5  
0.97  0.64  0.32  0.93  5  3  
0.42  0.86  0.12  0.82  4  2  
0.62  0.09  0.56  0.07  7  7  
0.84  0.59  0.49  0.73  5  4  
0.05  0.43  0.52  0.82  2  6  
0.40  0.24  0.10  0.64  1  1  
0.18  0.87  0.76  0.23     
3.2 RDBtrees
Corresponding to the Hilbert curves, RDBtrees are next constructed using the Hilbert keys of the objects.
The reasons for not adapting the standard B+tree leaf structure are as follows. If only the pointers to the actual objects are stored in the leaf, then accessing objects require random disk accesses. The number of such random disk accesses can be reduced by storing the entire object descriptor (and not a pointer to it) in the leaf itself. However, in high dimensions, the number of objects that can be stored in the leaf is extremely low. For example, assuming a page size of 4 KB, only objects of dimensionality can fit in a page, where each dimension is of bytes. Hence, none of these designs is suitable for highdimensional spaces.
We, therefore, modify the leaves to store distances to reference objects. Reference objects are specially designated objects from the dataset that are used to quickly obtain approximate distances to the query. We explain the querying process using reference objects in Sec. 4 and how they are chosen in Sec. 3.3. In this section, we assume that the set of reference objects of size is known.
For each object, the distance to all the reference objects are stored as part of its description. Hence, as long as , where is the dimensionality of the object, more number of objects can fit in a page of RDBtree than a B+tree (assuming that the storage requirement for a distance value is the same as that for a dimension).
In Sec. 5, we show that even for extremely large datasets, a reference set size of suffices. Thus, the reduction in number of random disk accesses required to retrieve the same number of candidate objects for a dimensional space is almost times.
Dataset  Dimen  Hilbert  Dimensions  Reference  RDBtree 

sions,  order,  per curve,  objects,  leaf order,  
SIFTn  128  8  16  10  63 
Yorck  128  32  16  10  36 
SUN  512  32  64  10  13 
Audio  192  32  24  10  28 
Enron  1369  16  37  10  18 
Glove  100  32  10  10  40 
The order of the RDBtree leaves is computed as follows. Storing the Hilbert key of an object requires bytes since for each of the dimensions, bytes describe the key. The distances to reference objects requires bytes of storage. The pointer to the complete object descriptor consumes a further bytes. Thus, if there are objects in a RDBtree, its storage requirement is bytes. Adding the overheads of left and right pointers ( bytes each) to the sibling leaves and an indicator byte to indicate that it is a leaf, if the page size is bytes,
(4) 
The order is the largest integer that satisfies Eq. (4). Table 3 lists the RDBtree leaf orders for the datasets we use.
Assuming that the reference objects chosen are and , the RDBtrees constructed using the Hilbert keys are shown in Fig. (c)c. The leaves contain actual distances to and respectively.
3.3 Reference Object Selection
In order to avoid the actual distance computation between an object and the query at run time, reference objects are used to approximate the distance and quickly filter out objects. We describe the details of how the distance is approximated later in Sec 4. In this section, we describe the desirable properties of the set of reference objects and how they are chosen.
The nature of the distance approximation is such that the closer a reference object is to the query, the better is the approximation. Hence, the set of reference objects should be well spread in the data space. This ensures that no matter where the query is located, there is at least one reference object that is not too far away from it and, hence, the approximation of distances to other objects using this reference object is good. The above objective assumes that any reference object can be used to approximate the distance for any other object. In other words, the purview of every reference object is the entire dataset. Thus, the distance from an object to all the reference objects must be precomputed and stored.
Considering the above issues, a good choice for reference object selection is the sparse spatial selection (SSS) algorithm [56]. The method first estimates the largest distance,
, between any two objects in the database using the following heuristic. A random object is selected and its farthest neighbor is found. This neighbor is then selected and the process of finding the farthest object is repeated. The process continues till convergence or for a fixed number of iterations.
Once is estimated, an object is chosen randomly as the first reference object. In every iteration, the dataset is scanned till an object whose distance to all the previously selected reference objects is greater than a fraction of is found. This object is added to the set . The process is continued till .
The SSS dynamic (SSSDyn) [18] method extends the above by continuing the process beyond the first iterations. For every new object that satisfies the fraction condition, it chooses a victim from the current set. The victim is the one that contributes the least towards approximating the distance between a set of (preselected and fixed) object pairs. If the contribution of the new object is better than the victim, it replaces the victim as a reference object. This goes on till there are no more objects that satisfy the condition. The choice of the method is discussed in Sec. 5.2.2.
The number of reference objects, , determines the amount of preprocessing and storage required for each database object. More importantly, since the RDBtree leaves store the distances to these reference objects, it determines the order of the RDBtrees at the leaf level as shown in Eq. (4).
The offline index building phase is completed with the construction of the RDBtrees. We refer to this union of the RDBtrees, each storing Hilbert key values of the database points from dimensional subspaces, with the leaves storing the distances to reference objects, as the HDIndex (HighDimensional Index).
3.4 Tuning of Parameters
The order of the Hilbert curve should be such that there is no large loss in the representation after the quantization using bits. Therefore, it depends on the domain from which the individual dimensions of the descriptors draw their value. Table 4 shows the details of the datasets including their domain sizes while Table 3 lists the Hilbert curve orders and other parameters.
If the number of partitions is more, then the larger number of Hilbert curves increases the number of RDBtree searches, thereby increasing the overall searching time. However, simultaneously, decreases which results in better approximation by the Hilbert curve, thereby increasing the accuracy. Hence, the value of (or equivalently, ) provides a handle to balance the tradeoff between running time and accuracy. We discuss the tuning of these parameters experimentally in Sec. 5.
From our experiments, we observed that the effect of varying on the overall quality and efficiency of the reference object selection method is minimal. Hence, we chose .
3.5 Analysis
3.5.1 Time and Space Complexities
We first analyze the index construction time of HDIndex.
Estimating for reference object selection requires time since the iterations are continued for at most a (constant) threshold number of times. In each iteration, a potential reference object is checked against all the existing reference objects. Assuming a worst case of potential reference objects, the total time for iterations is .
Computing the distances between the objects and the references having dimensionality consumes time.
Mapping a single point to a Hilbert key of order catering to a dimensionality of requires time. Thus, the time for objects is . The RDBtree construction requires time where is the branching factor of the internal nodes. Since there are RDBtrees, the total time for this phase is .
The total construction time of HDIndex is, therefore, .
We next analyze the space requirements for HDIndex. The original data objects require storage. In each of the RDBtrees and for each of the database objects, the following are stored: (1) Hilbert key composed from dimensions, (2) pointer to the original object, and (3) distances to reference objects. This requires a total of space. The rest of the RDBtree requires space to store the internal nodes consisting of Hilbert keys. Hence, the total space requirement of HDIndex is .
3.5.2 Discussion
In practice, , , the height of the RDBtree (), and are small constants (as verified experimentally in Sec. 5 as well). Hence, the construction time and space complexities of HDIndex are essentially , which is the size of the input. This linear scaling factor allows HDIndex to scale gracefully to extremely large datasets and high dimensionality.
3.6 Handling Updates
In this section, we describe how updates are handled. We only discuss insertions since deletions can be handled by simply marking the object as “deleted” and not returning it as an answer. Since B+trees are naturally updatefriendly, inserting new points in a RDBtree requires a very small amount of computation: the Hilbert key of the new object and its distances to reference objects.
Although the set of reference objects may change with insertions, since the performance with even random reference objects are quite close in quality to a carefully chosen set (Sec. 5.2.2), we do not recompute the set with every new insertion. Also, the number of updates, if any, are generally insignificant with respect to the total number of objects and, hence, there is little merit in updating the reference set.
4 Querying
We next describe the kNN search procedure for a query object residing in the same dimensional Euclidean space (Algo. 2).
The querying process follows three broad steps: (i) retrieving candidate objects from the RDBtrees using the Hilbert keys, (ii) refining the candidate set by applying distance approximation inequalities, and (iii) retrieving the approximate nearest neighbors by computing the actual distances from the candidates.
We next describe each of them in detail.
4.1 Retrieving Candidates from RDBtrees
The dimensions of the query are divided into groups according to the same partitioning scheme as the dataset.
For each of these partitions, the corresponding RDBtree is searched with the Hilbert key of associated with the dimensions . The nearest candidate objects are returned. These objects are close in the space represented by and, therefore, have the potential to be close in the entire dimensional space as well.
Returning to our example in Fig. 3, assuming , the first RDBtree returns while the second one returns .
4.2 Refining Candidates using Approximation
The objects are then successively refined by applying two distance approximation schemes—triangular inequality and Ptolemaic inequality—using the reference objects. As a prerequisite, the distance of query to all the reference objects are computed.
The triangular inequality using the reference object lower bounds the distance between the query and an object as follows:
Using reference objects produces lower bounds for an object . The best lower bound is the one that is maximal and is, therefore, considered as the approximate distance of from :
(5) 
Using this approximate distance, the candidates are sorted and the nearest ones to are chosen as the next set of candidates for the application of the next distance approximation, the Ptolemaic inequality [30], which is respected by the Euclidean distance.
The Ptolemaic inequality uses two reference objects and to lower bound the distance:
The best (i.e., maximal) lower bound among the possibilities is considered as the approximate distance:
(6) 
The nearest candidates, chosen from the candidates using Eq. (6), forms the final candidate set from the RDBtree .
The order of applying the inequalities is important as the approximation using Ptolemaic inequality is costlier but better than the triangular inequality.
The candidates from each are merged to form the final candidate set . The size of , denoted by , is at least (when all the partitions produce the same set) and at most (when the partitions produce completely disjoint sets). Thus, .
4.3 Retrieving k Nearest Neighbors
Finally, the complete descriptions of the candidate objects in are accessed using the corresponding object pointers from the RDBtrees, and the actual distances from are computed. The nearest objects from are returned as the kANN answer set .
4.4 Analysis
4.4.1 Time and Space Complexities
We first analyze the time complexity of the querying phase.
The cost of distance computation from to all the reference objects is .
For each of the partitions, the following costs are paid. Retrieval of candidates from a RDBtree requires time where is the branching factor of the RDBtree internal nodes. Computing the triangular inequalities for candidates using reference objects takes time due to preprocessing. By using a heap, the nearest candidates are chosen in time. Subjecting the candidates to Ptolemaic inequalities requires time. Finally, candidates are extracted using a heap in time. The total cost over partitions is, therefore, .
Computing the actual distances for candidates requires time. Since is at most , this phase runs in time.
Adding up the costs of each phase gives the total time.
The extra space overhead is that of candidates. Since the RDBtrees are processed sequentially, the extra space overhead remains which is essentially a constant.
Since HDIndex is a diskbased structure, we next analyze the number of random disk accesses needed for a kANN query. The computation of distances from the query to the reference objects require disk accesses. However, since , the reference object set can be assumed to fit in memory and, therefore, no I/O cost is incurred. Assuming the order of a RDBtree leaf to be , retrieving objects requires disk accesses. The computation of the distance inequalities do not require any more disk access since the distances to the reference objects are stored in the leaves themselves. Computing exact distances from the candidates require disk accesses. Hence, the total number of disk accesses is .
4.4.2 Discussion
As shown in Sec. 5, it suffices to have for even extremely large datasets. Consequently, , and are essentially constants in the overall analysis. Since and are also small constants, the running time of a kANN query can be summarized as , i.e., it is linear on the number of RDBtree searches, the height of a RDBtree, and the dimensionality. The extra space consumed during the search is , which is negligible. The number of disk accesses is linear in the number of trees, the height of the trees, and candidate size, thus allowing HDIndex to scale gracefully to large datasets and highdimensional spaces.
5 Experiments
All the experiments were done using codes written in C++ on an Intel(R) Core(TM) i74770 machine with 3.4 GHz CPU, 32 GB RAM, and 2 TB external disk space running Linux Ubuntu 12.04. The results are summarized in Sec. 5.6.
Datasets: We present results on largescale highdimensional realworld and benchmark datasets, taken from various publicly available sources [6, 5, 3, 4, 2, 1], as described in Table 4. Our datasets constitute a judicious mix of scale, dimensionality, and domain types.

[nosep,leftmargin=*]

The Audio
dataset comprises of audio features extracted using Marsyas library from the DARPA TIMIT speech database
[2]. 
The SUN dataset comprises of extremely highdimensional GIST features of images [5].

The SIFT datasets comprise of SIFT features [1] at varying scale.

The Yorck dataset comprises of SURF features extracted from images in the Yorck project [6].

The Enron dataset comprises of feature vectors of bigrams of collection of emails [3].

The Glove dataset comprises of word feature vectors extracted from Tweets [4].
Dataset  Type  Dimen  # Objects  Domain  # Queries 

sionality  of values  
SIFT10K  Tiny  128  10,000  [0,255]  100 
Audio  Small  192  54,287  [1,1]  10,000 
SUN  Small  512  80,006  [0,1]  100 
SIFT1M  Medium  128  1,000,000  [0,255]  10,000 
SIFT10M  Medium  128  10,000,000  [0,255]  10,000 
Yorck  Medium  128  15,120,935  [1,1]  1,254 
SIFT100M  Large  128  100,000,000  [0,255]  10,000 
SIFT1B  Large  128  999,494,170  [0,255]  10,000 
Enron  Small  1369  93,986  [0,252429]  1,000 
Glove  Medium  100  1,183,514  [10,10]  10,000 
Note that SIFT and Enron features are integers, while all other datasets possess floating point feature values. The Hilbert curve and RDBtree orders are shown in Table 3.
Next, we state the methods, their parameters and the evaluation metrics used for obtaining the results.
Methods: We compare HDIndex for effectiveness, efficiency and scalability against a number of representative stateoftheart methods, namely, Multicurves, C2LSH, SRS, QALSH, OPQ, and HNSW [66, 26, 64, 33, 27, 48]. In addition, we also consider iDistance [73], which is an exact method for kNN queries and will, therefore, always exhibit perfect quality. For all these techniques, we adopt the C++ implementations made available by their authors.
Parameters: The disk page size for the machine on which the experiments were run is bytes. We set the number of nearest neighbors . The parameters for the competing techniques were set as follows: iDistance: initial radius , ; Multicurves: , ; C2LSH: approximation ratio , interval size , falsepositive percentage
, error probability
; QALSH: approximation ratio , falsepositive percentage , error probability ; SRS: algorithm SRS12, approximation ratio , number of 2stable random projections , early termination threshold , maximum percentage of examined points ; OPQ: number of subspaces ; HNSW: number of nearest neighbors . The search parameters of OPQ and HNSW were set such that their MAP values were close to those of HDIndex. Note that most of these parameters have been set according to the recommendations by their respective authors.Evaluation Metrics: We consider the following metrics:

[nosep,leftmargin=*]

Efficiency: We evaluate the efficiency of the methods using the running time of the queries. For fairness, we turn off buffering and caching effects in all the experiments.

Scalability: To evaluate the scalability, we measure (1) the size and (2) construction time of the index structures, and main memory consumption during (3) index construction and (4) querying.
All the reported values are averages over the entire query set.
5.1 Comments
All the datasets are preprocessed to remove the duplicate points, if any. We used the set of queries provided with the Audio and the SIFT datasets for our experiments. For the datasets where query sets were not provided, namely, SUN, Yorck, Enron, and Glove, we reserved , , , and random data points respectively as queries. Since the implementations of some of the techniques do not support floating point data, to enable them to run on datasets with floating point feature values, the values are scaled by a sufficiently large number. This is as suggested by the corresponding authors of C2LSH and SRS.
Although it is not a requirement of the algorithms, the publicly available implementations of C2LSH, QALSH, and iDistance load the entire dataset into main memory during the index construction step. Thus, it crashed on our machine for the two largest datasets, SIFT100M and SIFT1B, that require 52GB and 520GB of RAM to be loaded respectively. Further, the underlying implementation of Rtrees used by SRS suffers from memory leaks. Hence, to enable SRS run on SIFT100M and SIFT1B, as suggested by the authors, these datasets were divided into chunks, indexed separately and then combined together into a single index for the whole dataset. This discussion highlights the importance of purely diskbased methods, as even with today’s hardware, methods using very large amounts of main memory in a single machine to build and load the index structure may not be deemed as scalable.
5.2 Internal Parameter Tuning
Since the HDIndex structure and its querying mechanism uses some parameters, the first set of experiments is performed to determine a good range of their values. This is a critical step. The parameter tuning process follows an order such that the most general parameter, i.e., the one which is independent of other parameters, is tuned first and so on. In general, we follow the intuitive order of first tuning the parameters specific to HDIndex construction followed by those used during querying.
5.2.1 Indexing: Subspace Partitioning Scheme
During the index construction step, HDIndex requires data to be partitioned into different lower dimensional subspaces. For ease of use, we sequentially partition the feature space. However, that is not necessary, since HDIndex is not dependent upon a particular partitioning scheme. To substantiate this claim, we perform querying on indices constructed by randomly partitioning the full space into multiple subspaces. The MAP@10 values over the indices for the different datasets are as follows: (a) SIFT10K: , (b) Audio: , (c) SUN: , (d) SIFT1M:
. Considering the small standard deviations and comparable results of MAP@10 for contiguous partitioning (as shown later in Fig.
5eFig. 5h), we conclude that quality does not depend significantly on the choice of partitioning scheme.5.2.2 Indexing: Reference Object Selection Method
As mentioned earlier in Sec. 2.2, there are multiple methods of choosing the reference objects. We have experimented with three algorithms: the random method, where random objects are chosen as reference objects, SSS and SSSDyn. Compared to the other sophisticated methods, even the random reference object selection has comparable accuracy levels. (The MAP values are within 90% of those of SSS). The figures are shown in Fig. 10 in Appendix A.
This indicates that the design of HDIndex by itself is good enough to achieve a high accuracy, and the choice of reference objects is not very critical. The SSS method is faster and has similar MAP values as SSSDyn. Also, the larger the dataset, the lesser the difference between the methods. Consequently, we recommend the SSS method as the reference object selection algorithm.
5.2.3 Indexing: Number of Reference Objects
The first important parameter is the number of reference objects, . To effectively tune , we consider the top datasets of up to medium size from Table 4. Fig. 4a shows that the query time increases in a sublinear fashion with . The size of the final index (Fig. 4b) grows linearly with (yaxis in log scale). However, for all the datasets, the quality metrics saturate after , with little or no further changes (Fig. 4c and Fig. 4d). Thus, we choose as the recommended number of reference objects. For the rest of the experiments, we use this value.
5.2.4 Indexing: Number of RDBtrees
Next, we analyze the effect of the number of RDBtrees on the different evaluation metrics. We use the top datasets from Table 4. Figs. 4e and 4f show that, as expected, the running time and the size of the final index grow linearly with increasing , for all the datasets. However, Figs. 4g and 4h portray that quality saturates after a while. The MAP@10 values for all the datasets, except SUN, saturates after . A similar effect is observed for the approximation ratio as well. Thus, we choose as the recommended number of RDBtrees. However, for the SUN dataset, which has a very high dimensionality of , choosing makes each Hilbert curve cater to a dimensionality of . Consequently, for such a high dimensionality, the MAP performance suffers. The MAP performance increases significantly by using double the number of trees, i.e., . Hence, while the default value is , for extremely high dimensional datasets (500+), we recommend . For datasets where the dimensionality is not a power of , we choose parameters that are as close as possible to the recommended values. Thus, for Enron, we use trees, each catering to dimensions, while for Glove, we use trees, each handling dimensions.
5.2.5 Querying: Importance of Ptolemaic Inequality
Having tuned the parameters specific to HDIndex, we next analyze the effect of parameters associated with the querying mechanism. We first evaluate the effect of the lower bounding filters – triangular and Ptolemaic inequalities. Fig. 5 presents a detailed analysis of using (1) just the triangular inequality, i.e., (as explained in Sec. 4.2), and (2) a combination of both triangular and Ptolemaic inequalities, for an initial candidate set size of on different datasets. While it is clear from Figs. (e)e(h)h that MAP@10 obtained using a combination of triangular and Ptolemaic inequality filters is always better when compared to using the triangular inequality alone, the gain is not significant under all scenarios. Notable gains are observed only when we use and for the combined filtering process, and for triangular inequality alone. This is because tighter lower bounds using Ptolemaic inequality are more robust to large reductions during the filtering process. On the other hand, it is evident from Figs. (a)a (d)d that the running times obtained using the combined filters are consistently – times slower when compared to using the triangular inequality alone. A similar pattern is followed for different values as well (additional results are shown in Appendix B).
In sum, although the use of Ptolemaic inequality results in better MAP, the gain in quality is not significant in majority of the cases when compared to the increase in query time, which almost doubles. Therefore, it is more prudent to use the triangular inequality alone to trade a little amount of quality (loss in MAP) for a much higher query efficiency. Thus, we henceforth choose triangular inequality alone as the recommended choice of lower bounding filter.
As an alternate analysis, although the application of Ptolemaic inequality is slower, importantly, it does not affect the number of disk I/Os. Since all the candidates are processed in main memory without the need to access the disk, Ptolemaic inequality only requires more CPU time. Thus, if the performance metric of an application is the number of disk I/Os (instead of the wall clock running time), it is better to use Ptolemaic inequality since it produces better quality results using the same number of disk accesses.
5.2.6 Querying: Filter Parameters
Since we do not use the Ptolemaic inequality, the intermediate filter parameter is not needed any further. We next fine tune the other filter parameters and by plotting (Fig. 6) MAP@10 and running time for the top datasets from Table 4. Figs. 6a, 6c and 6e show that, as expected, the running time scales linearly with increase in . However, Figs. 6b, 6d and 6f portray that the quality saturates after a while (after ). Thus, we choose as the recommended value.
However, for SIFT1M and Yorck, is not sufficient since these datasets are extremely large. MAP improves when double the value, i.e., is used and saturates beyond that. Hence, for very large datasets, we recommend .
Having fixed , we analyze the MAP@10 and running times for all the datasets with varying . Fig. 6g shows that the running time scales linearly with increase in . However, once again, MAP@10 saturates after for all the datasets. Thus, we choose , and , as the recommended values.
5.2.7 Robustness with Number of Nearest Neighbors
We also evaluate MAP@ and running times of the techniques with varying on the tiny, small and medium datasets (results in Appendix C). While the running times of the other techniques increase with , it remains almost constant for HDIndex and Multicurves. This is mainly due to the design of these structures where candidates are selected and then refined to return answers in the end. Since , the running times are not affected by .
5.2.8 Limitations
It needs to be noted that the current parameter tuning of HDIndex is almost purely empirical. In future, we would like to analyze the parameters from a theoretical standpoint and understand what values can be reasonable for a given dataset.
Also, if the dataset really exhibits ultrahigh dimensionality (in order of hundreds of thousands), then HDIndex is unable to process the queries efficiently since there are too many B+trees. However, on that note, we would also like to clarify that to the best of our knowledge, there is no diskbased database indexing method that can really handle such ultrahigh dimensionality. However, our method can be easily parallelized and/or distributed with little synchronization steps due to its nature of building and querying using multiple independent RDBtrees.
For small datasets, HDIndex is not as efficient as some of the other techniques (as explained in more detail in Sec. 5.4). However, as the dataset size increases, it scales better.
5.3 Quality Metrics
With the HDIndex parameters finetuned, we evaluate the two quality metrics, MAP@ and approximation ratio visàvis the other methods. It is evident from Fig. 7a7e that even for good approximation ratios (), MAP values can be surprisingly and significantly low (MAP ). Moreover, with increase in dimensionality , this effect becomes more and more prominent. For instance, in Fig. 7c, and for C2LSH and SRS respectively, while MAP are and respectively.
This analysis highlights a notable limitation of the techniques that possess theoretical guarantees on the approximation ratio as it empirically proves that the ratio loses its significance. Thus, henceforth, we use only MAP@k.
5.4 Comparative Study
In this section, we perform an indepth comparison of the different methods with respect to three aspects: (a) quality of results, (b) efficiency of retrieval in terms of running time, and (c) scalability in terms of space usage for both main memory and on disk.
5.4.1 Quality
Figs. 8a, 8f, 8k show a comparison of MAP@100 of the different techniques. Being an exact technique, iDistance possesses the best quality, with a MAP of . However, it is neither efficient nor scalable. Table 5 shows that HDIndex significantly outperforms all the other techniques except HNSW and OPQ in terms of quality. The empty bars in the figures indicate that the methods could not be run for the corresponding datasets (e.g., in Fig. 8a, Multicurves was unable to run on SUN).
5.4.2 Efficiency
Figs. 8b, 8g, 8l show a comparison of the running times of HDIndex with the different methods. iDistance possesses the highest running times and was almost as slow as the linear scan method and is, therefore, not efficient at all. OPQ and HNSW, being memorybased techniques, run extremely fast. As indicated later, this is possible since they consume and utilize a large amount of main memory. Compared to purely diskbased methods that aim to scale for large datasets, these methods enjoy an unfair advantage in terms of running time efficiency. C2LSH is the most efficient diskbased technique; it, however, crashed on SIFT100M. HDIndex is comparatively not very efficient for smaller datasets, as is clear from Fig. 8b. However, its running time scales gracefully with dataset size and it becomes increasingly efficient with larger data sizes.
5.4.3 Scalability
The final set of results deal with the scalability of the various techniques. Figs. 8c, 8h, 8m show a comparison of the index sizes of the different techniques. Multicurves possesses the largest sized index and consumes up to 1.2TB of diskspace for the SIFT100M dataset. Owing to this, it is unable to scale to the SIFT1B dataset on our machine, as its projected index size in this case is 12TB. HNSW, although extremely fast and fairly accurate, requires a very large amount of main memory. For SIFT1M, it consumes 1.43GB of main memory. Hence, its projected main memory requirement for SIFT100M is beyond the 32GB capacity of our machine. Not surprisingly, it also crashes for this dataset. iDistance possesses the smallest index size. However, in addition to index size, the memory consumption during the index construction step is also significant. iDistance consumes a lot of main memory (RAM) during the index construction step as shown in Figs. 8d, 8i, and 8n. Thus, it crashed for SIFT100M during the index construction step. C2LSH, QALSH, and OPQ also suffer from the same problem and, thus, they also crashed on SIFT100M.
With this, the only two scalable techniques left are HDIndex and SRS, with the index size of SRS being 23 times smaller when compared to that of HDIndex. The implementation of SRS possesses a memory leak, which is evident from the increasing RAM consumption with dataset sizes. To enable SRS to run on larger datasets, as mentioned in Sec. 5.1, the memory leak was solved using chunking of the large datasets. Due to this chunking effect, SRS exhibits a stable memory consumption of 2GB for Yorck and larger datasets. The results depict that HDIndex possesses the smallest memoryfootprint during the index construction step, and is as low as 100MB across the different datasets. This establishes the fact that HDIndex is the most scalable technique.
Dataset  Query  Gain of HDIndex in Query Time over  MAP  Gain of HDIndex in MAP@100 over  

Time (ms)  C2LSH  SRS  Multicurves  QALSH  OPQ  HNSW  @100  C2LSH  SRS  Multicurves  QALSH  OPQ  HNSW  
SIFT10K  19.46  0.06x  0.17x  2.88x  0.20x  0.10x  0.004x  0.98  2.44x  4.45x  0.98x  1.81x  0.99x  0.98x 
Audio  44.18  0.02x  0.88x  2.44x  0.21x  0.02x  0.0007x  0.86  14.33x  6.61x  3.05x  1.28x  0.98x  0.99x 
SUN  105.78  0.12x  0.22x  NP  0.15x  0.02x  0.007x  0.69  3.83x  23.00x  NP  1.72x  1.00x  0.88x 
SIFT1M  25.10  5.30x  1.56x  22.98x  11.27x  0.05x  0.002x  0.56  2.80x  28.00x  0.97x  1.19x  1.00x  0.92x 
Yorck  262.29  0.27x  27.56x  2.21x  33.54x  0.01x  0.002x  0.39  1542.51x  39.18x  1.24x  1.01x  1.00x  1.02x 
SIFT100M  732.04  CR  2.17x  1.73x  CR  CR  CR  0.40  CR  75.72x  1.13x  CR  CR  CR 
SIFT1B  4855.20  CR  DNF  CR  CR  CR  CR  0.25  CR  DNF  CR  CR  CR  CR 
Enron  242.69  0.02x  0.06x  NP  NP  0.06x  0.002x  0.92  5.12x  12.07x  NP  NP  0.99x  0.99x 
Glove  85.20  0.75x  0.06x  2.6x  2.28x  0.05x  0.0003x  0.24  8.87x  80.00x  1.26x  1.43x  0.99x  0.31x 
We also analyze the main memory consumption during the querying stage. Figs. 8e, 8j, 8o show that apart from HDIndex, Multicurves and QALSH (each consuming less than 40MB), all the other techniques consume a significant amount of RAM. Utilizing a large amount of RAM during the querying phase puts these methods at an unfair advantageous position visàvis the strictly diskbased methods when running times are concerned.
5.4.4 Billion Scale Dataset
We could not compare the results for the largest SIFT1B dataset with any other technique, as none of them were able to run on it. Since C2LSH, iDistance, QALSH, OPQ and HNSW crashed on SIFT100M (we use a commodity machine with 32GB of RAM and 2TB of HDD), they could not be run. Multicurves consumed 1.2TB of disk space on SIFT100M and, hence, could not scale to the 10 times larger SIFT1B dataset on our machine. Although the memory leak in the index construction step of SRS was solved using chunking of the datasets, even after running for more than 25 days, it was unable to complete the index construction for SIFT1B.
HDIndex required around 10 days and 1.2TB of space to construct the index. Querying produced a MAP@100 of 0.25 in 4.8s per query by consuming 30MB of main memory.
5.5 RealWorld Application: Image Search
We finally highlight the importance of kANN searches and the MAP measure through an actual image retrieval application. We use the Yorck dataset
[6] that consists of SURF descriptors from 10,000 art images. We mention only the quality results here and do not repeat the running times.All the descriptors from the query image are searched for nearest neighbors. The retrieved results, i.e., the database descriptors are then aggregated using the Borda count [57]. In Borda count, a score is given to each image based on their depth in the kANN result of each of the descriptors in the query image (details are in Appendix D). The images with the highest Borda counts are retrieved as the top similar images to the query image. The top image results for a representative query for the different methods are shown in Appendix E.
In general, HDIndex, QALSH, OPQ and HNSW possess the highest overlap with the ground truth produced by the linear scan method. While linear scan has impractical running times, HDIndex, OPQ, HNSW and C2LSH have better running times than SRS and QALSH. However, C2LSH has a poor image retrieval quality, while SRS exhibits moderate quality.
HDIndex is faster than QALSH but slower than both OPQ and HNSW. However, it has a much smaller memory footprint when compared to OPQ and HNSW while both indexing and querying.
5.6 Summary
A good indexing technique for approximate nearest neighbor search stands on three pillars: quality of retrieved objects, running time efficiency, and memory footprint (both external memory while storing the index and main memory while querying). Our experimental analyses compared HDIndex with stateoftheart techniques across all of these features. A quantitative summary of the results is presented in Table 5 while a broad qualitative summary is depicted in Fig. 9.
QALSH [33], SRS [64] and C2LSH [26]
are the stateoftheart techniques with respect to quality, memory footprint and running time efficiency respectively. However, just optimizing one category is not sufficient. For instance, it is useless to produce spurious results extremely fast with a high memory overhead. Therefore, for a technique to be useful, it should be classified at the intersection of at least two classes, that too preferably “QE” or “QM”, since poor quality results is of no importance. If quality is sacrificed, even a technique producing random results without any computation would fall in the class “ME”. To this end, Multicurves
[66], HNSW [48] and OPQ [27] (classified as “QE”) are useful; however, they possess several disadvantages. First, owing to its large index space requirements, Multicurves finds difficulty in scaling to massive datasets. Next, due to its inherent design, it cannot scale to extremely highdimensional datasets such as the dimensional SUN dataset. HNSW and OPQ, being memorybased techniques, are quite fast and fairly accurate; however, their memory requirements are impractically large. They, therefore, fail to run on large datasets. Thus, as verified empirically, none of these methods could run on the billionscale dataset.In sum, HDIndex (classified as “QME”) stands strong on all the three pillars, and is an effective, efficient and scalable index structure which scales gracefully to massive highdimensional datasets even on commodity hardware.
6 Conclusions and Future Work
In this paper, we addressed the problem of approximate nearest neighbor searching in massive highdimensional datasets within practical compute times while simultaneously exhibiting high quality. We designed an efficient indexing scheme, HDIndex, based on a novel index structure, RDBtree. Experiments portrayed the effectiveness, efficiency and scalability of HDIndex.
In future, we would like to attempt a parallel implementation to cater to higher dimensional datasets. Also, we would like to analyze the effect of parameters from a theoretical point of view.
References
 [1] Evaluation of approximate nearest neighbors: Large datasets. http://corpustexmex.irisa.fr/.
 [2] The CASS Project Repository. http://cs.princeton.edu/cass/audio.tar.gz.
 [3] The Enron Database. http://www.cs.cmu.edu/ enron/.
 [4] The Glove Database. https://nlp.stanford.edu/projects/glove/.
 [5] The SUN Database. http://groups.csail.mit.edu/vision/SUN/.
 [6] The Yorck Project. https://commons.wikimedia.org/wiki/Category:PDArt_(Yorck_Project).
 [7] G. Amato and P. Savino. Approximate similarity search in metric spaces using inverted files. In ICST, volume 6, pages 1–10, 2008.
 [8] K. Aoyama, K. Saito, H. Sawada, and N. Ueda. Fast approximate similarity search based on degreereduced neighborhood graphs. In KDD, pages 1055–1063, 2011.
 [9] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R*tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322–331, 1990.
 [10] S. Berchtold, C. Böhm, H. V. Jagadish, H.P. Kriegel, and J. Sander. Independent quantization: An index compression technique for highdimensional data spaces. In ICDE, pages 577–588, 2000.
 [11] S. Berchtold, C. Böhm, and H.P. Kriegel. The pyramidtechnique: towards breaking the curse of dimensionality. In SIGMOD, pages 142–153, 1998.
 [12] S. Berchtold, D. Keim, and H. P. Kriegel. The Xtree: An index structure for highdimensional data. In VLDB, pages 28–39, 1996.
 [13] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages 97–104, 2008.
 [14] A. Bhattacharya. Fundamentals of Database Indexing and Searching. CRC Press, 2014.
 [15] C. Böhm, S. Berchtold, and D. A. Keim. Searching in highdimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comp. Sur. (CSUR), 33(3):322–373, 2001.
 [16] N. R. Brisaboa, A. Farina, O. Pedreira, and N. Reyes. Similarity search using sparse pivots for efficient multimedia information retrieval. In ISM, pages 881–888, 2006.
 [17] B. Bustos, G. Navarro, and E. Chavez. Pivot selection techniques for proximity searching in metric spaces. Patt. Recog. Lett., 24(14):2357–2366, 2003.
 [18] B. Bustos, O. Pedreira, and N. Brisaboa. A dynamic pivot selection technique for similarity search. In SISAP (ICDE Workshops), pages 394–401, 2008.
 [19] A. R. Butz. Alternative algorithm for Hilbert’s spacefilling curve. IEEE Trans. Computers, C20(4):424–426, 1971.
 [20] L. Chen, Y. Gao, X. Li, C. Jensen, and G. Chen. Efficient metric indexing for similarity search. In ICDE, pages 591–602, 2015.
 [21] P. Ciaccia, M. Patella, and P. Zezula. Mtree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426–435, 1997.
 [22] D. Comer. The ubiquitous Btree. ACM Comp. Sur., 11(2):121–137, 1979.

[23]
S. Dasgupta.
Performance guarantees for hierarchical clustering.
J. Comput. Syst. Sci., 70(4):555–569, 2005.  [24] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In SCG, pages 253–262, 2004.
 [25] R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, pages 301–312. ACM, 2003.
 [26] J. Gan, J. Feng, Q. Fang, and W. Ng. Localitysensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541–552, 2012.
 [27] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In CVPR, pages 2946–2953, 2013.
 [28] A. Guttman. Rtrees: A dynamic index structure for spatial searching. In SIGMOD, pages 47–57, 1984.
 [29] C. Hennig and L. J. Latecki. The choice of vantage objects for image retrieval. Pattern Recognition, 36(9):2187–2196, 2003.
 [30] M. L. Hetland, T. Skopal, J. Lokoc, and C. Beecks. Ptolemaic access methods: Challenging the reign of the metric space model. Inf. Syst., 38(7):989–1006, 2013.
 [31] M. E. Houle and M. Nett. Rankbased similarity search: Reducing the dimensional dependence. PAMI, 37(1):136–150, 2015.
 [32] M. E. Houle and J. Sukuma. Fast approximate similarity search in extremely highdimensional data sets. In ICDE, pages 619–630, 2005.
 [33] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Queryaware localitysensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2015.
 [34] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pages 604–613, 1998.
 [35] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. PAMI, 33(1):117–128, 2011.
 [36] C. T. Jr., R. F. S. Filho, A. J. M. Traina, M. R. Vieira, and C. Faloutsos. The Omnifamily of allpurpose access methods: A simple and effective way to make similarity search more efficient. VLDB J., 16(4):483–505, 2007.
 [37] I. Kamel and C. Faloutsos. On packing Rtrees. In CIKM, pages 490–499, 1993.
 [38] N. Katayama and S. Satoh. The SRtree: An index structure for highdimensional nearest neighbor queries. In SIGMOD, pages 369–380, 1997.
 [39] R. Krauthgamer and J. R. Lee. Navigating nets: simple algorithms for proximity search. In SODA, pages 798–807, 2004.
 [40] J. K. Lawder and P. J. H. King. Using spacefilling curves for multidimensional indexing. In British National Conf. on Databases, pages 20–35, 2000.
 [41] R. H. V. Leuken and R. C. Veltkamp. Selecting vantage objects for similarity indexing. ACM Trans. Mult. Comput., Commun., and Appl., 7(3):16, 2011.
 [42] S. Liao, M. A. Lopez, and S. T. Leutenegger. High dimensional similarity search with space filling curves. In ICDE, pages 615–622, 2001.
 [43] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An investigation of practical approximate nearest neighbor algorithms. In NIPS, pages 825–832, 2004.
 [44] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SKLSH: An efficient index structure for approximate nearest neighbor search. PVLDB, 7(9):745–756, 2014.
 [45] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multiprobe lsh: efficient indexing for highdimensional similarity search. In VLDB, pages 950–961, 2007.
 [46] G. MainarRuiz and J.C. PerezCortes. Approximate nearest neighbor search using a single spacefilling curve and multiple representations of the data points. In ICPR, volume 2, pages 502–505, 2006.
 [47] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, 2014.
 [48] Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv:1603.09320 [cs.DS], 2016.
 [49] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
 [50] M. Marin, V. GilCosta, and R. Uribe. Hybrid index for metric space databases. In ICCS, pages 327–336, 2008.
 [51] N. Megiddo and U. Shaft. Efficient nearest neighbor indexing based on a collection of space filling curves. Technical Report IBM Research Report RJ 10093 (91909), IBM Almaden Research Center, San Jose California, 1997.
 [52] M. L. Micó, J. Oncina, and E. Vidal. A new version of the nearestneighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Patt. Recog. Lett., 15(1):9–17, 1994.
 [53] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36, 2014.
 [54] M. Norouzi and D. J. Fleet. Cartesian kmeans. In CVPR, pages 3017–3024, 2013.
 [55] B. C. Ooi, K.L. Tan, C. Yu, and S. Bressan. Indexing the edges – a simple and yet efficient approach to highdimensional indexing. In PODS, pages 166–174, 2000.
 [56] O. Pedreira and N. R. Brisaboa. Spatial selection of sparse pivots for similarity search in metric spaces. In SOFSEM (1), pages 434–445, 2007.
 [57] B. Reilly. Social choice in the south seas: Electoral innovation and the borda count in the pacific island countries. Int. Political Sc. Rev., 23, 2002.
 [58] E. V. Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Patt. Recogn. Lett., 4(3):145–157, 1986.
 [59] H. Sagan. SpaceFilling Curves. Springer, 1994.
 [60] Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The Atree: An index structure for highdimensional spaces using relative approximation. In VLDB, pages 516–526, 2000.
 [61] H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2005.
 [62] J. Shepherd, X. Zhu, and N. Megiddo. A fast indexing method for multidimensional nearest neighbor search. In SPIE, pages 350–355, 1999.
 [63] C. SilpaAnan and R. Hartley. Optimised kdtrees for fast image descriptor matching. In CVPR, pages 1–8, 2008.
 [64] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: Solving capproximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. PVLDB, 8(1):1–12, 2014.
 [65] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563–576, 2009.
 [66] E. Valle, M. Cord, and S. PhilippFoliguet. Highdimensional descriptor indexing for large multimedia databases. In CIKM, pages 739–748, 2008.
 [67] J. Venkateswaran, T. Kahveci, C. M. Jermaine, and D. Lachwani. Referencebased indexing for metric spaces with costly distance measures. VLDB J., 17(5):1231–1251, 2008.
 [68] E. Vidal. New formulation and improvements of the nearestneighbour approximating and eliminating search algorithm (AESA). Patt. Recogn. Lett., 15(1):1–7, 1994.
 [69] J. Wang and S. Li. Querydriven iterated neighborhood graph search for large scale indexing. In ACM Multimedia, pages 179–188, 2012.
 [70] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo. Fast neighborhood graph search using Cartesian concatenation. In Multimedia Data Mining and Analytics, pages 397–417, 2015.
 [71] R. Weber, H.J. Schek, and S. Blott. A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces. In VLDB, pages 194–205, 1998.
 [72] D. A. White and R. Jain. Similarity indexing with the SStree. In ICDE, pages 516–523, 1996.
 [73] C. Yu, B. C. Ooi, K.L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, pages 421–430, 2001.
Appendix A Reference Object Selection
The comparison of different reference object selection methods is shown in Fig. 10. The method selected is SSS.
Appendix B Ptolemaic Inequality
Fig. 11 and Fig. 12 describe the detailed results of using the triangular inequality alone versus a combination of triangular and Ptolemaic inequalities. For a fixed value of and the same amount of reduction through the filtering process, using only the triangular inequality provides the fastest response times, being  times faster than when both the inequalities are used. However, it is clear from the highlighted entries in the 5 column that usually the best MAP@10 is obtained when the Ptolemaic inequality is employed either alone or in conjunction with the triangular inequality. Since Ptolemaic inequality provides tighter bounds, the improvements in MAP values are prominent where the amount of reduction using the filters is large.
Appendix C Number of Nearest Neighbors
The results of varying the number of nearest neighbors, , for all the methods on the tiny, small and medium datasets are shown in Fig. 13. Since iDistance is an exact method, it always returns correct answers and, hence, possesses a MAP@k of . Consequently, we do not show its MAP results. Moreover, the results of the Multicurves technique is not shown for SUN (Figs. (c)c and (h)h) due to reasons explained in Sec. 5.4.
The running time remains almost constant for HDIndex and Multicurves, while for the other techniques, it increase with . This is mainly due to the design of HDIndex and Multicurves as explained in Sec. 5.2.7.
Moreover, as shown in Figs. (a)a to (e)e, even the MAP@k for HDIndex and Multicurves almost remain constant with varying . For the other techniques, MAP@k is unstable with varying . This is due to the fact that when these techniques are allowed to examine more objects, i.e., is increased, some actual answers are sometimes retrieved. Since the answers are returned in a sorted manner, the true answers find their correct places and MAP increases.
Overall, HDIndex reports the best MAP@ and scales gracefully with dataset size.
Appendix D Borda Count
Consider a query image with descriptors. Each descriptor is queried against the database and its kANN result set with descriptors, sorted according to their distance to are retrieved. The Borda count method [57] is then employed to get the nearest neighbor images of .
Consider a database image . A score is given to based on the depth of its descriptors in the kANN result sets of . For each result set , if a descriptor of image is found at position , then the image accumulates a score . These scores are then summed over all the results sets of to obtain the Borda count of against :
(7) 
where is the indicator function that is only when a descriptor from image is found at position of result :
The images with the largest Borda counts are returned in order as the nearest neighbors of the query image .
Appendix E Image Search Results
The top results for a representative query for the different methods is shown in Table 6.
Linear  HDIndex  SRS  C2LSH  Multicurves  QALSH 
