Many large multimedia retrieval applications require efficient processing of nearest neighbor queries in high-dimensional spaces. Exact tree-based indexing structures, such as KD-tree, SR-tree, etc., work well for low-dimensional spaces () but suffer from the notorious curse of dimensionality for high-dimensional spaces. They are often outperformed by brute-force linear scans . One solution to this problem is to search for good enough approximate results instead. Approximate techniques sacrifice some accuracy for a significant improvement in the overall processing time. In many applications where 100% is not needed, this tradeoff is very useful in saving time. The goal of the approximate version of the nearest neighbor problem, also called c-approximate Nearest Neighbor search, is to return points that are within distance from the query point. Here, is a user-defined approximation ratio and denotes the distance of the query point and its nearest neighbor.
1.1 Locality Sensitive Hashing
Locality Sensitive Hashing (LSH)  is one of the most popular techniques for finding approximate nearest neighbors in high-dimensional spaces. LSH was first introduced in  for the Hamming distance, but was later extended to several distances, such as the popular Euclidean distance . LSH uses random
hash projections to map the original high-dimensional space to the projected low-dimensional space. The main idea behind LSH is that nearby points in the original high-dimensional space will map to similar hash buckets in the low-dimensional space with a higher probability than mapping to dissimilar or far away points to the same buckets. Since LSH was first proposed in, there have been several works that have focused on improving the search accuracy and/or performance [3, 8, 10, 17, 19, 24, 16, 5].
1.2 Motivation for using LSH
Locality Sensitive Hashing (LSH) is known for two main advantages: its sub-linear query performance (in terms of the data size) and theoretical guarantees on the query accuracy. Additionally, LSH uses random hash functions which are data-independent (i.e. data properties such as data distribution are not needed to generate these random hash functions). Since LSH uses random hash functions, the generation of these hash functions is a simple process that takes negligible time. Additionally, the data distribution does not affect the generation of these hash functions. Hence, in applications where data is changing or where newer data is coming in, these hash functions do not require any change during runtime. While the original LSH index structure suffered from large index sizes (in order to obtain a high query accuracy) [3, 19], state-of-the-art LSH techniques [8, 10] have alleviated this issue by using advanced methods such as Collision Counting and Virtual Rehashing. In addition to their fast index maintenance, fast query performance, and theoretical guarantees on the query accuracy, LSH algorithms are easy to implement as external memory-based algorithms, and hence are more scalable than in-memory algorithms (such as graph-based ANN algorithms) .
1.3 Motivation of our Survey
Locality Sensitive Hashing techniques have two dominant costs for finding nearest neighbors: 1) cost of reading the index files from the external memory to the main memory (which we call Index I/Os), and 2) cost of finding candidates and removing false positives (which we call Algorithm time). As mentioned in Section 1.2, one of the benefits of LSH is that it is a scalable algorithm. Some of the existing LSH techniques (e.g. C2LSH  and QALSH ) are not entirely external memory-based (i.e. even though the indexes are stored on the disk, their implementations require the entire data and indexes should fit into the main memory during the index creation phase). Thus, existing works (such as ) do not compare their results with C2LSH and QALSH on large datasets since they do not fit in the main memory. Additionally, some recent works (such as ) only compare the Index I/Os without comparing the important Algorithm time. This leads to other recent papers (such as [15, 14, 26]) to unfairly compare their Algorithm time with QALSH or I-LSH  since they are deemed as the state-of-the-art LSH techniques.
1.4 Contributions of this Survey paper
We modify the implementations of C2LSH and QALSH to create fully external memory-based implementations such that the entire dataset and/or the entire index do not need to be in the main memory for the algorithms to work during index generation or query processing.111These implementations will be made public.
We show the importance of experimentally analyzing and comparing the Index I/Os and Algorithm time of all algorithms.
We compare these three algorithms on real datasets with different characteristics under differing system parameters.
2 Related Work
Nearest Neighbor problem is an important problem for multimedia applications in many diverse domains such as multimedia retrieval, image processing, machine learning, etc. Since tree-based index structures can be outperformed by a linear scan, due to thecurse of dimensionality
, in high-dimensional spaces, approximate techniques are preferred due to their fast performance at the expense of some accuracy. Due to the importance of the nearest neighbor problem in various domains, several diverse techniques have been proposed by researchers. These techniques can be broadly classified into three main categories: Hashing-based methods, Partition-based methods, and Graph-based methods.222We refer the reader to a recent survey  for an in-depth survey on these categories. Hashing-based methods can be further classified into learning-based hashing techniques and random hashing techniques. The benefit of random hashing techniques, such as Locality Sensitive Hashing , are that they are easy to construct, no need for training data, and easy to maintain and update. Additionally, LSH provides a sub-linear (in terms of the data size) query performance and theoretical guarantees on the query accuracy.
Locality Sensitive Hashing and its variants: The main idea of Locality Sensitive Hashing is to create random projections and hash data points in these random projections such that nearby data points in the original high-dimensional space will be mapped to the same hash bucket with a high probability (and conversely, data points that are far apart from each other in the original high-dimensional space will be mapped to the same hash bucket with a low probability). It was originally proposed in  for the Hamming distance and then later extended to the popular Euclidean distance . In this original work on Euclidean distance (E2LSH), instead of a single hash function (or a projection), a hash table consisted of several hash functions (represented by Compound Hash Keys) in order to reduce false positives. But this also generated false negatives. Hence several hash tables had to be used to reduce the number of false positives and false negatives, while keeping the accuracy of the query high. The main drawbacks of this approach were the size of the index structure (since large number of hash tables were required to return the desired number of results with a high accuracy) and the need to determine the width of the hash bucket during index creation (a larger width returned enough results but also with a potential of too many false positives, whereas a smaller width had a potential of misses resulting in insufficient results). This user-defined width, which was mainly dependent on the data distribution, had to be often determined through a trial and error process.
LSH-Forest  was proposed where the compound hash-keys were hierarchically stored such that the algorithm could stop at a higher level in the tree if more results were needed. In Multi-probe LSH , the authors proposed a technique to probe into neighboring buckets when more results were needed. The intuition is that neighboring buckets are more likely to contain nearby points. Hence, if the bucket width was underestimated (which is better than overestimation which can lead to significant wasteful processing), neighboring buckets were probed to find the desired number of results.
Later, C2LSH  introduced two main concepts of Collision Counting and Virtual Rehashing that solved the two main drawbacks of E2LSH . In C2LSH, the authors proposed to create base hash functions and choose candidate points based on how many times a data point collides with the query point (and hence instead of creating several hash tables of several hash functions, only 1 table of base hash functions is needed), which reduced the size of the index structure. Additionally, in Virtual Rehashing, the neighboring buckets in each hash function are read incrementally when sufficient number of results are not found.
In SK-LSH , the authors propose a linear ordering on the Compound Hash Keys (using a space-filling curve) such that nearby Compound Hash Keys are stored on the same (or nearby) page on the disk, thus reducing the total number of I/Os. The design of SK-LSH is still build on the original E2LSH, and hence suffers from the parameter tuning problem, where the user is expected to enter important parameters such as number of hash functions and the radius at which results will be found. Wrong choice of parameters can negatively affect the accuracy and efficiency of the algorithm.
QALSH  was later proposed that built query-aware hash functions such that the hash value of the query point is considered as the anchor bucket during query processing and this idea would solve the issue when close points to a query were partitioned into different buckets when query was near the bucket boundaries. Additionally, B+trees are built on each hash function for efficient lookups into neighboring buckets (which translate to range queries). QALSH utilizes the concepts of Collision Counting and Virtual Rehashing.
HD-Index  was introduced which generated Hilbert keys of the dataset points and also stored the distances of the points to each other to efficiently prune the results based on distance filters. HD-Index stores the Hilbert keys using modified B+-trees, called RDB-trees. Due to the reliance on space-filling curves (Hilbert curves) and B+-trees, HD-Index cannot scale for moderately high-dimensional datasets .
uses the Euclidean distance between two points in the projected space to estimate their distance in the original space. In order to find the next nearest neighbor in the projected space, SRS uses an R-tree to index the points in the projected space. This incremental finding of the NN is similar to I-LSH. The main goal of SRS is to introduce a very lightweight index structure to solve the ANN problem. SRS is shown to suffer from memory leaks and slow running times as compared with C2LSH, and hence not included in our work.
Recently, I-LSH , which is considered to be the state-of-the-art LSH technique , was proposed to improve the Virtual Rehashing process of QALSH (where the range of the lookups are incremented exponentially). In I-LSH, the authors propose to increase the range of the lookups based on the distance to the nearest point (in the projected space) instead of increasing the range exponentially. While this strategy results in less disk I/Os, it also leads to high disk seeks (random I/Os) and algorithm time as we show in Section 5.
Very recently, PM-LSH 
was proposed where the idea was to estimate the Euclidean distance based on a tunable confidence interval value such that the overall query processing time is reduced.333The code of PM-LSH was not released before the submission date of SISAP.
3 Background and Preliminaries
Hash Functions: A hash function family is (, , , )-sensitive if it satisfies the following conditions for any two points and in a -dimensional dataset :
if , then , and
if , then
Here, and are probabilities and is an approximation ratio. LSH requires that and . The above definition states that the two points and are hashed to the same bucket with a very high probability if they are close to each other (i.e. the distance between the two points is less than or equal to ), and if they are not close to each other (i.e. the distance between the two points is greater than ), then they will be hashed to the same bucket with a low probability . In the original LSH scheme for Euclidean distance, each hash function is defined as where is aand is a real number chosen uniformly from , such that is the width of the hash bucket . This leads to the following collision probability function , which states that if , then the probability that and map to the same hash bucket for a given hash function is: . Here, the collision probability is decreasing on for a given . For a , which is the largest absolute value of a coordinate of point in , and for every uniformly drawn from the interval and for some we have that is -sensitive, where and .
4 State-of-the-art Techniques
In Section 2, we explained the benefits and drawbacks of different LSH techniques. In this paper, we will experimentally analyze the three state-of-the-art external memory-based LSH techniques, C2LSH , QALSH , and I-LSH . In this section, we will introduce the concepts introduced by these techniques. C2LSH  introduced the concepts of Collision Counting and Virtual Rehashing. In , authors theoretically show that two close points and collide in at least hash layers with a probability , when the total number, , of hash layers are equal to: . Here, , where is the allowed false positive percentage (i.e. the allowed number of points whose distance with a query point is greater than ). C2LSH sets , where is the cardinality of the dataset. Further, only those points that collide at least times, where is the collision count threshold, which is calculated as following: , where the collision threshold percentage, , is . C2LSH creates only one hash function per hash table, and hence the number of hash functions are equal to the number of hash table.
Instead of assuming a magic radius (which traditional LSH methods did), C2LSH sets the initial radius to 1. It is possible that with , there are not enough results for a top- query to be returned. C2LSH increases the radius of the query in the following sequence: . If at level-R, enough candidates are not found, the radius is increased until enough query results are found. This exponential expansion process is called Virtual Rehashing.
Moreover, C2LSH uses two terminating conditions to stop the algorithm when the conditions are met. These conditions specify that 1) at the end of each virtual rehashing at least candidates should have been found whose Euclidean distance to the query are less than or equal to , and 2) at any point, candidates are found.
QALSH introduces query-aware hash functions . For a query , once the query projection is found by computing , QALSH uses the query as the “anchor” to find the anchor bucket with width with the interval . If the projected location for a point falls in the same anchor bucket as , i.e., , then QALSH considers that has collided with under . QALSH  also utilizes these concepts of Collision Counting and Virtual Rehashing to build query-aware hash functions. Another main difference of QALSH is that it uses B+-trees to represent the hash tables. An exponential expansion in each hash table is thus the same as a range query on a B+-tree. By using query-aware hash functions and B+-trees, QALSH improves the theoretical bounds by reducing the total number of hash functions required to satisfy the quality guarantee. Additionally, QALSH can work for any approximation ratio, , greater than 1, while C2LSH can only work for . While the reduction in number of hash functions generates a smaller index, the overhead of using B+-trees makes QALSH much slower as we experimentally show in Section 5.
I-LSH  uses the query-aware hash functions (that are proposed by QALSH) and proposes an incremental expansion strategy to reduce the overall index I/Os. In order to do that, i-LSH finds the next closest point in each projection. While this process leads to less overall index I/Os, it still requires disk seeks and (as we show in Section 5) the algorithm overhead is far more than the savings in the disk I/Os.
5 Experimental Analysis
In this section, we first explain our carefully designed experimental evaluation plan. We experimentally analyze C2LSH, QALSH, and I-LSH on different datasets and report the results for varying criteria. All experiments were run on the nodes of the Bigdat cluster 444Supported by NSF Award #1337884 with the following specifications: two Intel Xeon E5-2695, 256GB RAM, and CentOS 6.5 operating system. All codes were written in C++11 and compiled with gcc v4.7.2 with the -O3 optimization flag. As mentioned in Section 1.4, we extend the implementations of C2LSH and QALSH to be completely external-memory based implementations (i.e. the entire dataset or the index files are not needed to be in the main memory in order to construct the LSH indexes).
We use the following six diverse high-dimensional datasets with varying cardinality and dimensionality:
P53 consists of 5409-dimensional points which are generated based on the biophysical features of mutant p53 proteins and can be used to predict p53 transcriptional activity. The values of this dataset are normalized between zero and and duplicate rows are removed.
Sift1M consists of 128-dimensional points that were created by running the SIFT feature extraction algorithm on real images. The values of this dataset are integers between zero and .
Mnist8M This dataset, also known as the InfiMNIST dataset, contains 784-dimensional points that represent images of the digits 0 to 9 which are grayscale and of size 28 28.
Tiny80M This dataset contains 384-dimensional points generated using Gist feature extraction algorithm on million colored images and its values are normalized between zero and .
All datasets are normalized to contain only integers since C2LSH requires the data format to be only integers .
5.2 Evaluation Criteria and Parameters
The goal of our paper is to present a detailed analysis of the performance of the state-of-the-art LSH techniques. We also compare the accuracy of these algorithms. We randomly choose 50 queries from the dataset and report the average of the results of these 50 queries. We used the same parameters suggested in their papers ( for QALSH and for C2LSH). We choose and (since C2LSH cannot give guarantees for ). Since our goal is to present the performance analysis of query processing, we do not compare the index time and index construction time of these three algorithms. Since I-LSH uses the same hash functions as QALSH, their index size and index construction time are the same.  shows the difference between these two criteria for C2LSH and QALSH for different datasets, and hence we avoid it in this paper.
After careful analysis of performance of LSH techniques, we present the following breakdown of the query processing time ():
Index Read Cost: LSH techniques need to read index files (from the external memory) in order to find the candidates. This dominant cost of reading index files can be further broken down into the number of disk seeks (i.e. random I/Os) and the total amount of data read. Following , we also consider the number of disk seeks and amount read in our cost formulation.
Algorithm Time: Another dominant cost in LSH processing is the processing of index files once they are read into the main memory. LSH techniques need to find points that are considered as candidates. Techniques such as Collision Counting (explained in 4) are included in this cost.
False Positive Removal Cost: Once a point is deemed as a candidate, the LSH technique brings the actual data point (resulting in a random seek) into the main memory to calculate the Euclidean distance with the query point. Since the state-of-the-art LSH techniques have an upper bound of the number of candidates that are generated (which is set to ), this cost is negligible as compared to the previous two costs.
It is well-known that random I/Os are much more expensive than sequential I/Os . Additionally, the difference in the cost changes significantly depending on whether the external storage medium is an HDD or an SSD. The difference in the costs of random I/Os and sequential I/Os is significantly more in HDDs than in SSDs (mainly because random disk seeks are faster in SSDs than HDDs) . We noticed that the number of disk seeks are significantly different in these state-of-the-art LSH techniques due to their strategy in finding neighboring points in projected spaces. Hence, we model the overall Query Processing Time (QPT) for both HDDs and SSDs. For an HDD, we use the reported benchmarks for Seagate Barracuda HDD with 7200 RPM and 1TB: average disk seek requires 8.5 ms and the average data read rate is 0.156 MB/ms . Similarly, for an SSD, we use the reported benchmarks for the Seagate Barracuda 120 SSD with 1TB storage: average disk seek requires 0.01 ms and the average data read rate is 0.56 MB/ms .
We use the same accuracy measure, the overall ratio, used in several prior works [8, 10, 17, 16]: . Here, is the th point returned by the technique and is the true th nearest point from (ground truth). Ratio of 1 means the returned results have the same distance from the query as the ground truth. The closer the ratio is to 1, the higher is the accuracy of the LSH technique.
5.3 Discussion of the Performance Results
Number of Disk Seeks: Figure 1 shows the required number of disk seeks (random I/Os) by the experimented techniques. The interesting observation is that I-LSH performs the best for P53, LabelMe, Sift, and Deep datasets. However, its performance degrades as the dataset size becomes large (i.e. greater than approximately one million points). This is because I-LSH needs to find the closest projected point each time the radius needs to be expanded, which further requires reading the indexed points from the disk several times. We also observe that QALSH has a better performance compared to C2LSH for smaller datasets (i.e. P53), but as the dataset size (number of points) increases, the number of seeks are significantly higher than C2LSH and I-LSH. This is happening because the search radiuses of QALSH are larger than C2LSH in larger datasets, which results in more radius expansions, which further results in higher disk seeks.
Amount of Data Read: Figure 2 shows the total amount of data that was read from the index files. I-LSH always has the least amount of data read for all datasets because it incrementally searches for the nearest points in the projections instead of having buckets and fixed widths. However, we later show that these I/O savings are offset by the processing time of finding these nearest points. C2LSH reads more data than QALSH for most datasets (except Mnist) because it has more projections to process (since QALSH uses less hash projections because they are query-aware).
Algorithm Time: Figure 3 shows the time needed by an algorithm to find the candidates (excluding the I/O times). This figure shows the huge overhead of I-LSH which is caused due to their incremental searching for the nearest projected neighbors. Also, since I-LSH and QALSH both use B+-trees, which become huge for the larger datasets, their performance degrades heavily in these cases while searching for candidates. Since C2LSH does not have any overhead of additional index structures (such as B+-tree), it has the least Algorithm time for all datasets. In terms of Algorithm Time, I-LSH is faster than QALSH (except for the P53 dataset - which is the smallest dataset in our experiments) mainly because it has to process less hash functions than QALSH .
False Positive Removal Time: We also analyzed the time it takes to read the actual data point from the external memory in order to calculate Euclidean distance with the query (for removing false positives). Since all three algorithms have an upper bound of the number of candidates () it produces, all algorithms took similar time which was less than 0.5 ms. Due to space limitations, we do not show these results.
Query Processing Time (on HDD): Figure 4 shows the overall time required to solve a given k-NN query on a Hard Disk Drive. I-LSH performs the best for smaller datasets (P53 and LabelMe) because its Algorithm Time overhead is small, but as the dataset size increases, the Algorithm Time overhead offsets the savings in disk seeks and performs worse than C2LSH (but better than QALSH). Except for the smallest dataset (P53), QALSH is the slowest of the three algorithms. It works good for smaller datasets (P53) but does not scale well for moderate and large sized datasets. For larger datasets, C2LSH is always the fastest technique since its having better algorithm time and number of disk seeks compared to the other two algorithms.
Query Processing Time (on SSD): Figure 5 shows the overall time required to solve a given k-NN query on a Solid State Drive. In SSDs, I/O operations are much faster and the overall Query Processing Time is mainly dominated by the algorithm time. Therefore, C2LSH (which has the best Algorithm time) always performs the best on SSDs (for all datasets) followed by I-LSH (except for the smallest dataset, P53).
Accuracy Ratio: Figure 6 shows the accuracy of the compared techniques. Having a ratio equal to 1 equates to highest accuracy. Except for the Mnist dataset, C2LSH produces the best accuracy among the three algorithms. QALSH is more accurate than I-LSH, which we believe is mainly because it uses more hash functions than I-LSH. Except for C2LSH’s accuracy on the Mnist dataset, all three algorithms produce accurate results for all datasets.
Overall, we find that C2LSH can find k-NN results faster than QALSH and I-LSH. Additionally, all three algorithms produce accurate results (with C2LSH producing slightly better accurate results than QALSH and I-LSH for most datasets).
Approximate similarity search in high dimensional spaces has been an important problem in many diverse domains. In this paper, we focused on Locality Sensitive Hashing based techniques and presented a detailed experimental analysis on three famous LSH algorithms, C2LSH, QALSH, and I-LSH. For this analysis, we used various sizes of datasets and different yet important evaluation metrics. The results showed us that although a specific technique can perform better for smaller datasets but may not prove to be scalable and work well for larger datasets. We also observed that improvements in one portion of the LSH (e.g. I/O operations), do not results in overall improvements. Thus, trade-offs and different evaluation metrics should always be considered when comparing different techniques. In future we plan to also analyze the effect of changing the user-defined parameters on the performance of different techniques.
Arora, A., et al., “Hd-index: Pushing the scalability-accuracy boundary for approximate knn search,”VLDB 2018.
-  Babenko, A., et al., “Efficient indexing of billion-scale datasets of deep descriptors,” CVPR 2016.
-  Bawa, M., et al., “Lsh forest: Self-tuning indexes for similarity search,” WWW 2005.
-  Chávez, E., et al., “Searching in metric spaces,” CSUR 2001.
-  Christiani, T., “Fast locality-sensitive hashing frameworks for approximate near neighbor search,” SISAP 2019.
Danziger, S.A., et al., “Predicting positive p53 cancer rescue regions using most informative positive (mip) active learning,”PLoS computational biology 2009.
-  Datar, M., et al., “Locality-sensitive hashing scheme based on p-stable distributions,” SOCG 2004.
-  Gan, J., et al., “Locality-sensitive hashing scheme based on dynamic collision counting,” SIGMOD 2012.
-  Gionis, A., et al., “Similarity search in high dimensions via hashing,” VLDB 1999.
-  Huang, Q., et al., “Query-aware locality-sensitive hashing for approximate nearest neighbor search,” VLDB 2015.
-  Jegou, H., et al., “Product quantization for nearest neighbor search,” TPAMI 2010.
-  Kim, A., et al., “Optimally leveraging density and locality for exploratory browsing and sampling,” HILDA 2018.
-  Leis, V., et al., “Query optimization through the looking glass, and what we found running the join order benchmark,” VLDB 2018.
-  Li, M., et al., “I/o efficient approximate nearest neighbour search based on learned functions,” ICDE 2020.
-  Li, W., et al., “Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement,” TKDE 2019.
-  Liu, W., et al., “I-lsh: I/o efficient c-approximate nearest neighbor search in high-dimensional space,” ICDE 2019.
-  Liu, Y., et al., “Sk-lsh: An efficient index structure for approximate nearest neighbor search,” VLDB 2014.
Loosli, G., et al., “Training invariant support vector machines using selective sampling,”Large scale kernel machines 2007.
-  Lv, Q., et al., “Multi-probe lsh: Efficient indexing for high-dimensional similarity search,” VLDB 2007.
-  Russell, B.C., et al., “Labelme: a database and web-based tool for image annotation.,” IJCV 2008.
-  Seagate Barracuda 120 SSD Manual.: https://www.seagate.com/www-content/datasheets/pdfs/barracuda-120-sata-DS2022-1-1909US-en_US.pdf
-  Seagate ST2000DM001 Manual.: https://www.seagate.com/files/staticfiles/docs/pdf/datasheet/disc/barracuda-ds1737-1-1111us.pdf
-  Sun, Y., et al., “Srs: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index,” VLDB 2014.
-  Tao, Y., et al., “Efficient and accurate nearest neighbor and closest pair search in high-dimensional space,” TODS 2010.
Torralba, A., et al., “80 million tiny images: A large data set for nonparametric object and scene recognition,”TPAMI 2008.
-  Zheng, B., et al., “Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search,” VLDB 2020.