spark-neighbors
Spark-based approximate nearest neighbor search using locality-sensitive hashing
view repo
Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space.
READ FULL TEXT VIEW PDF
Hashing has been widely used for large-scale approximate nearest neighbo...
read it
The explosive growth in big data has attracted much attention in designi...
read it
Nearest neighbor search is a problem of finding the data points from the...
read it
In large scale systems, approximate nearest neighbour search is a crucia...
read it
Locality sensitive hashing (LSH) is a powerful tool for sublinear-time
a...
read it
The probability Jaccard similarity was recently proposed as a natural
ge...
read it
The fruit fly Drosophila's olfactory circuit has inspired a new locality...
read it
Spark-based approximate nearest neighbor search using locality-sensitive hashing
insight data engineering fellow project
None
The problem of similarity search, also known as nearest neighbor search, proximity search, or close item search, is to find an item that is the nearest to a query item, called nearest neighbor, under some distance measure from a search (reference) database. In the case that the reference database is very large or that the distance computation between the query item and the database item is costly, it is often computationally infeasible to find the exact nearest neighbor. Thus, a lot of research efforts have been devoted to approximate nearest neighbor search that is shown to be enough and useful for many practical problems.
Hashing is one of the popular solutions for approximate nearest neighbor search. In general, hashing is an approach of transforming the data item to a low-dimensional representation, or equivalently a short code consisting of a sequence of bits. The application of hashing to approximate nearest neighbor search includes two ways: indexing data items using hash tables that is formed by storing the items with the same code in a hash bucket, and approximating the distance using the one computed with short codes.
The former way regards the items lying the buckets corresponding to the codes of the query as the nearest neighbor candidates, which exploits the locality sensitive property that similar items have larger probability to be mapped to the same code than dissimilar items. The main research efforts along this direction consist of designing hash functions satisfying the locality sensitive property and designing efficient search schemes using and beyond hash tables.
The latter way ranks the items according to the distances computed using the short codes, which exploits the property that the distance computation using the short codes is efficient. The main research effort along this direction is to design the effective ways to compute the short codes and design the distance measure using the short codes guaranteeing the computational efficiency and preserving the similarity.
Nearest neighbor search, also known as similarity search, proximity search, or close item search, is defined as: Given a query item , the goal is to find an item , called nearest neighbor, from a set of items so that , where is a distance computed between and . A straightforward generalization is a -NN search, where -nearest neighbors () are needed to be found.
The problem is not fully specified without the distance between an arbitrary pair of items and . As a typical example, the search (reference) database lies in a -dimensional space and the distance is induced by an norm, . The search problem under the Euclidean distance, i.e., the norm, is widely studied. Other notions of search database, such as each item formed by a set, and distance measure, such as
distance, cosine similarity and so on are also possible.
The fixed-radius near neighbor (-near neighbor) problem, an alternative of nearest neighbor search, is defined as: Given a query item , the goal is to find the items that are within the distance of , .
There exists efficient algorithms for exact nearest neighbor and -near neighbor search problems in low-dimensional cases. It turns out that the problems become hard in the large scale high-dimensional case and even most algorithms take higher computational cost than the naive solution, linear scan. Therefore, a lot of recent efforts are moved to approximate nearest neighbor search problems. The -approximate nearest neighbor search problem, , is defined as: Given a query , the goal is to find an item so that , where is the true nearest neighbor. The -approximate -near neighbor search problem is defined as: Given a query , the goal is to find some item , called -near neighbor, so that , where is the true nearest neighbor.
The randomized search problem aims to report the (approximate) nearest (or near) neighbors with probability instead of deterministically. There are two widely-studied randomized search problems: randomized -approximate -near neighbor search and randomized -near neighbor search. The former one is defined as: Given a query , the goal is to report some -near neighbor of the query with probability , where . The latter one is defined as: Given a query , the goal is to report some -near neighbor of the query with probability .
The hashing approach aims to map the reference and/or query items to the target items so that approximate nearest neighbor search can be efficiently and accurately performed using the target items and possibly a small subset of the raw reference items. The target items are called hash codes (also known as hash values, simply hashes). In this paper, we may also call it short/compact code interchangeably.
Formally, the hash function is defined as: , where is the hash code and is the function. In the application to approximate nearest neighbor search, usually several hash functions are used together to compute the hash code: , where and
. Here we use a vector
to represent the hash code for presentation convenience.There are two basic strategies for using hash codes to perform nearest (near) neighbor search: hash table lookup and Fast distance approximation.
The hash table is a data structure that is composed of buckets, each of which is indexed by a hash code. Each reference item is placed into a bucket . Different from the conventional hashing algorithm in computer science that avoids collisions (i.e., avoids mapping two items into some same bucket), the hashing approach using a hash table aims to maximize the probability of collision of near items. Given the query , the items lying in the bucket are retrieved as near items of .
To improve the recall, hash tables are constructed, and the items lying in the (, ) hash buckets are retrieved as near items of for randomized -near neighbor search (or randomized -approximate -near neighbor search). To guarantee the precision, each of the hash codes, , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes .
The direct way is to perform an exhaustive search: compare the query with each reference item by fast computing the distance between the query and the hash code of the reference item and retrieve the reference items with the smallest distances as the candidates of nearest neighbors, which is usually followed by a reranking step: rerank the nearest neighbor candidates retrieved with hash codes according to the true distances computed using the original features and attain the nearest neighbors or -near neighbor.
This strategy exploits two advantages of hash codes. The first one is that the distance using hash codes can be efficiently computed and the cost is much smaller than that of the computation in the input space. The second one is that the size of the hash codes is much smaller than the input features and hence can be loaded into memory, resulting the disk I/O cost reduction in the case the original features are too large to be loaded into memory.
One practical way of speeding up the search is to perform a non-exhaustive search: first retrieve a set of candidates using inverted index and then compute the distances of the query with the candidates using the short codes. Other research efforts includes organizing the hash codes with a data structure, such as a tree or a graph structure, to avoid exhaustive search.
The organization of the remaining part is given as follows. Section 3 presents the definition of the locality sensitive hashing (LSH) family and the instances of LSH with various distances. Section 4 presents some research works on how to perform efficient search given LSH codes and model and analyze LSH in aspects Sections 5, 6,and 7 review the learning-to-hash algorithms. Finally, Section 9 concludes this survey.
The term “locality-sensitive hashing” (LSH) was introduced in 1998 [42], to name a randomized hashing framework for efficient approximate nearest neighbor (ANN) search in high dimensional space. It is based on the definition of LSH family , a family of hash functions mapping similar input items to the same hash code with higher probability than dissimilar items. However, the first specific LSH family, min-hash, was invented in 1997 by Andrei Broder [11], for near-duplicate web page detection and clustering, and it is one of the most popular LSH method that is extensively-studied in theory and widely-used in practice.
Locality-sensitive hashing was first studied by the theoretical computer science community. The theoretical research mainly focuses on three aspects. The first one is on developing different LSH families for various distances or similarities, for example, p-stable distribution LSH for distance [20], sign-random-projection (or sim-hash) for angle-based distance [13], min-hash for Jaccard coefficient [11, 12] and so on, and many variants are developed based on these basic LSH families [19]. The second one is on exploring the theoretical boundary of the LSH framework, including the bound on the search efficiency (both time and space) that the best possible LSH family can achieve for certain distances and similarities [20, 94, 105], the tight characteristics for a similarity measure to admit an LSH family [13, 16], and so on. The third one focuses on improving the search scheme of the LSH methods, to achieve theoretically provable better search efficiency [107, 19].
Shortly after it was proposed by the theoretical computer science community, the database and related communities began to study LSH, aiming at building real database systems for high dimensional similarity search. Research from this side mainly focuses on developing better data structures and search schemes that lead to better search quality and efficiency in practice [91, 25]
. The quality criteria include precision and recall, and the efficiency criteria are commonly the query time, storage requirement, I/O consumption and so on. Some of these work also provide theoretical guarantees on the search quality of their algorithms
[25].In recent years, LSH has attracted extensive attention from other communities including computer vision (CV), machine learning, statistics, natural language processing (NLP) and so on. For example, in computer vision, high dimensional features are often required for various tasks, such as image matching, classification. LSH, as a probabilistic dimension reduction method, has been used in various CV applications which often reduce to approximate nearest neighbor search
[17, 18]. However, the performance of LSH is limited due to the fact that it is totally probabilistic and data-independent, and thus it does not take the data distribution into account. On the other hand, as an inspiration of LSH, the concept of “small code” or “compact code” has become the focus of many researchers from the CV community, and many learning-based hashing methods have come in to being [135][125][126][30][83][139][127][29][82][130][31]. These methods aim at learning the hash functions for better fitting the data distribution and labeling information, and thus overcoming the drawback of LSH. This part of the research often takes LSH as the baseline for comparison.The machine learning and statistics community also contribute to the study of LSH. Research from this side often view LSH as a probabilistic similarity-preserving dimensionality reduction method, from which the hash codes that are produced can provide estimations to some pairwise distance or similarity. This part of the study mainly focuses on developing variants of LSH functions that provide an (unbiased) estimator of certain distance or similarity, with smaller variance
[68, 52, 73, 51], or smaller storage requirement of the hash codes [70, 71], or faster computation of hash functions [69, 73, 51, 118]. Besides, the machine learning community also devotes to developing learning-based hashing methods.In practice, LSH is widely and successfully used in the IT industry, for near-duplicate web page and image detection, clustering and so on. Specifically, The Altavista search engine uses min-hash to detect near-duplicate web pages [11, 12], while Google uses sim-hash to fulfill the same goal [92].
In the subsequent sections, we will first introduce different LSH families for various kinds of distances or similarities, and then we review the study focusing on the search scheme and the work devoted to modeling LSH and ANN problem.
The locality-sensitive hashing (LSH) algorithm is introduced in [42, 27], to solve the -near neighbor problem. It is based on the definition of LSH family , a family of hash functions mapping similar input items to the same hash code with higher probability than dissimilar items. Formally, an LSH family is defined as follows:
A family of is called -sensitive if for any two items and ,
if , then ,
if , then .
Here , and . The parameter governs the search performance, the smaller , the better search performance. Given such an LSH family for distance measure , there exists an algorithm for -near neighbor problem which uses space, with query time dominated by distance computations and evaluations of hash functions [20].
The LSH scheme indexes all items in hash tables and searches for near items via hash table lookup. The hash table is a data structure that is composed of buckets, each of which is indexed by a hash code. Each reference item is placed into a bucket . Different from the conventional hashing algorithm in computer science that avoids collisions (i.e., avoids mapping two items into some same bucket), the LSH approach aims to maximize the probability of collision of near items. Given the query , the items lying in the bucket are considered as near items of .
Given an LSH family , the LSH scheme amplifies the gap between the high probability and the low probability by concatenating several functions. In particular, for parameter , functions , where () are chosen independently and uniformly at random from , form a compound hash function . The output of this compound hash function identifies a bucket id in a hash table. However, the concatenation of functions also reduces the chance of collision between similar items. To improve the recall, such compound hash functions are sampled independently, each of which corresponds to a hash table. These functions are used to hash each data point into hash codes, and hash tables are constructed to index the buckets corresponding to these hash codes respectively. The items lying in the hash buckets are retrieved as near items of for randomized -near neighbor search (or randomized -approximate -near neighbor search).
In practice, to guarantee the precision, each of the hash codes, , needs to be a long code (or is large), and thus the total number of the buckets is too large to index directly. Therefore, only the nonempty buckets are retained by resorting to conventional hashing of the hash codes .
There are different kinds of LSH families for different distances or similarities, including distance, or angular distance, Hamming distance, Jaccard coefficient and so on.
The LSH scheme based on the -stable distributions, presented in [20], is designed to solve the search problem under the distance , where . The -stable distribution is defined as: A distribution is called -stable, where , if for any real numbers and i.i.d. variables with distribution
, the random variable
has the same distribution as the variable , where is a random variable with distribution. The well-known Gaussian distribution
, defined by the density function , is -stable.In the case that , the exponent is equal to , and later it is shown in [94] that it is impossible to achieve . Recent study in [105] provides more lower bound analysis for Hamming distance, Euclidean distance, and Jaccard distance.
The LSH scheme using the -stable distribution to generate hash codes is described as follows. The hash function is formulated as . Here, is a -dimensional vector with entries chosen independently from a -stable distribution. is a real number chosen uniformly from the range . is the window size, thus a positive real number.
The following equation can be proved
(1) |
where , which means that such a hash function belongs to the LSH family under the distance.
Specifically, to solve the search problem under the Euclidean distance, the -stable distribution, i.e., the Gaussian distribution, is chosen to generate the random projection . In this case (), the exponent drops strictly below for some (carefully chosen) finite value of .
It is claimed that uniform quantization [72] without the offset , is more accurate and uses fewer bits than the scheme with the offset.
Leech lattice LSH [1] is an LSH algorithm for the search in the Euclidean space. It is a multi-dimensional version of the aforementioned approach. The approach firstly randomly projects the data points into , is a small super-constant ( in the aforementioned approach). The space is partitioned into cells, using Leech lattice, which is a constellation in dimensions. The nearest point in Leech lattice can be found using a (bounded) decoder which performs only floating point operations per decoded point. On the other hand, the exponent is quite attractive: is less than . lattice is used because its decoding is much cheaper than Leech lattice (its quantization performance is slightly worse) A comparison of LSH methods for the Euclidean distance is given in [108].
Spherical LSH [123] is an LSH algorithm designed for points that are on a unit hypersphere in the Euclidean space. The idea is to consider the regular polytope, simplex, orthoplex, and hypercube, for example, that are inscribed into the hypersphere and rotated at random. The hash function maps a vector on the hypersphere into the closest polytope vertex lying on the hypersphere. It means that the buckets of the hash function are the Voronoi cells of the polytope vertices. Though there is no theoretic analysis about exponent , the Monte Carlo simulation shows that it is an improvement over the Leech lattice approach [1].
Beyond LSH [3] improves the ANN search in the Euclidean space, specifically solving -ANN. It consists of two-level hashing structures: outer hash table and inner hash table. The outer hash scheme aims to partition the data into buckets with a filtered out process such that all the pairs of points in the bucket are not more than a threshold, and find a -approximation to the minimum enclosing ball for the remaining points. The inner hash tables are constructed by first computing the center of the ball corresponding to a non-empty bucket in outer hash tables and partitioning the points belonging to the ball into a set of over-lapped subsets, for each of which the differences of the distance of the points to the center is within and the distance of the overlapped area to the center is within . For the subset, an LSH scheme is conducted. The query process first locates a bucket from outer hash tables for a query. If the bucket is empty, the algorithm stops. If the distance of the query to the bucket center is not larger than , then the points in the bucket are output as the results. Otherwise, the process further checks the subsets in the bucket whose distances to the query lie in a specific range and then does the LSH query in those subsets.
The LSH algorithm based on random projection [2, 13] is developed to solve the near neighbor search problem under the angle between vectors, . The hash function is formulated as , where follows the standard Gaussian distribution. It is easily shown that , where is the angle between and , thus such a hash function belongs to the LSH family with the angle-based distance.
Super-bit LSH [52] aims to improve the above hashing functions for arccos (angular) similarity, by dividing the random projections into groups then orthogonalizing random projections for each group, obtaining new random projections and thus -super bits. It is shown that the Hamming distance over the super bits is an unbiased estimation for the angular distance and the variance is smaller than the above random projection algorithm.
Kernel LSH [64, 65] aims to build LSH functions with the angle defined in the kernel space, . The key challenge is in constructing a projection vector from the Gaussian distribution. Define where is a natural number, and is a set of
database items chosen i.i.d.. The central limit theorem shows that for sufficiently large
, the random variablesfollows a Normal distribution
. Then the hash function is given as(2) |
The covariance matrix and the mean are estimated over a set of randomly chosen
database items, using a technique similar to that used in kernel principal component analysis.
Multi-kernel LSH [133, 132], uses multiple kernels instead of a single kernel to form the hash functions with assigning the same number of bits to each kernel hash function. A boosted version of multi-kernel LSH is presented in [137], which adopts the boosting scheme to automatically assign various number of bits to each kernel hash function.
Semi-supervised LSH [45, 46, 66] first learns a Mahalanobis metric from the semi-supervised information and then form the hash function according to the pairwise similarity , where and is the learnt metric from the semi-supervised information. An extension, distribution aware LSH [146], is proposed, which, however, partitions the data along each projection direction into multiple parts instead of only two parts.
Concomitant LSH [23] is an LSH algorithm that uses concomitant rank order statistics to form the hash functions for cosine similarity. There are two schemes: concomitant min hash and concomitant min -multi-hash.
Concomitant min hash is formulated as follows: generate random projections , each of which is drawn independently from the standard normal distribution . The hash code is computed in two steps: compute the projections along the projection directions, and output the index of the projection direction along which the projection value is the smallest, formally written by . It is shown that the probability is a monotonically increasing function with respect to .
Concomitant min -multi-hash instead generates hash codes: the indices of the projection directions along which the projection values are the top smallest. It can be shown that the collision probability is similar to that of Concomitant min hash.
Generating a hash code of length means that it requires random projections and vector multiplications, which is too high. To solve this problem, a cascading scheme is adopted: e.g., generate two concomitant hash functions, each of which generates a code of length , and compose them together, yielding a code of bits, which only requires random projections and vector multiplications. There are two schemes proposed in [23]: cascade concomitant min & max hash that composes the two codes , and cascade concomitant min & max hash multi-hash which is formed using the indices of the top smallest and largest projection values.
The goal of searching nearest neighbors to a query hyperplane is to retrieve the points from the database
that are closest to a query hyperplane whose normal is given by . The Euclidean distance of a point to a hyperplane with the normal is:(3) |
The hyperplane hashing family [47, 124], under the assumption that the hyperplane passes through origin and the data points and the normal are unit norm (which indicates that hyperplane hashing corresponds to search with absolute cosine similarity), is defined as follows,
(4) |
Here , where and are sampled independently from a standard Gaussian distribution.
It is shown that the above hashing family belongs to LSH: it is -sensitive for the angle distance , where . The angle distance is equivalent to the distance of a point to the query hyperplane.
The below family, called XOR -bit hyperplane hashing,
(5) |
is shown to be -sensitive for the angle distance , where .
Embedded hyperplane hashing transforms the database vector (the normal of the query hyperplane) into a high-dimensional vector,
(6) |
Assuming and to be unit vectors, the Euclidean distance between the embeddings and is given , which means that minimizing the distance between the two embeddings is equivalent to minimizing .
The embedded hyperplane hash function family is defined as
(7) |
It is shown to be for the angle distance , where .
It is also shown that the exponent for embedded hyperplane hashing is similar to that for XOR -bit hyperplane hashing and stronger than that for hyperplane hashing.
One LSH function for the Hamming distance with binary vectors is proposed in [42], , where is a randomly-sampled index. It can be shown that . It is proven that the exponent is .
The Jaccard coefficient, a similarity measure between two sets, , is defined as . Its corresponding distance is taken as . Min-hash [11, 12] is an LSH function for the Jaccard similarity. Min-hash is defined as follows: pick a random permutation from the ground universe , and define . It is easily shown that . Given the hash values of two sets, the Jaccard similarity is estimated as , where each corresponds to a random permutation that is independently generated.
-min sketch [11, 12] is a generalization of min-wise sketch (forming the hash values using the smallest nonzeros from one permutation) used for min-hash. It also provides an unbiased estimator of the Jaccard coefficient but with a smaller variance, which however cannot be used for approximate nearest neighbor search using hash tables like min-hash. Conditional random sampling [68, 67] also takes the smallest nonzeros from one permutation, and is shown to be a more accurate similarity estimator. One-permutation hashing [73], also uses one permutation, but breaks the space into bins, and stores the smallest nonzero position in each bin and concatenates them together to generate a sketch. However, it is not directly applicable to nearest neighbor search by building hash tables due to empty bins. This issue is solved by performing rotation over one permutation hashing [118]. Specifically, if one bin is empty, the hashed value from the first non-empty bin on the right (circular) is borrowed as the key of this bin, which supplies an unbiased estimate of the resemblance unlike [73].
Min-max hash [51], instead of keeping the smallest hash value of each random permutation, keeps both the smallest and largest values of each random permutation. Min-max hash can generate hash values, using random permutations, while still providing an unbiased estimator of the Jaccard coefficient, with a slightly smaller variance than min-hash.
Sim-min-hash [149] extends min-hash to compare sets of real-valued vectors. This approach first quantizes the real-valued vectors and assigns an index (word) for each real-valued vector. Then, like the conventional min-hash, several random permutations are used to generate the hash keys. The different thing is that the similarity is estimated as , where ( ) is the real-valued vector (or Hamming embedding) that is assigned to the word (), and is the similarity measure.
-LSH [33] is a locality sensitive hashing function for the distance. The distance over two vectors and is defined as
(8) |
The distance can also be defined without the square-root, and the below developments still hold by substituting to in all the equations.
The -LSH function is defined as
(9) |
where , each entry of is drawn from a -stable distribution, and
is drawn from a uniform distribution over
.It can be shown that
(10) |
where
denotes the probability density function of the absolute value of the
-stable distribution, .Let . It can be shown that decreases monotonically with respect to and . Thus, we can show it belongs to the LSH family.
Winner Take All (WTA) hash [140] is a sparse embedding method that transforms the input feature space into binary codes such that the Hamming distance in the resulting space closely correlates with rank similarity measure. The rank similarity measure is shown to be more useful for high-dimensional features than the Euclidean distance, in particular in the case of normalized feature vectors (e.g., the norm is equal to ). The used similarity measure is a pairwise-order function, defined as
(11) | ||||
(12) |
where and .
WTA hash generates a set of random permutations . Each permutation is used to reorder the elements of , yielding a new vector . The th hash code is computed as , taking a value between and . The final hash code is a concatenation of values each corresponding to a permutation. It is shown that WTA hash codes satisfy the LSH property and min-hash is a special case of WTA hash.
Locality sensitive binary coding using shift invariant kernel hashing [109] exploits the property that the binary mapping of the original data is guaranteed to preserve the value of a shift-invariant kernel, the random Fourier features (RFF) [110]. The RFF is defined as
(13) |
where and . For example, for the Gaussian Kernel , . It can be shown that .
The binary code is computed as
(14) |
where is a random threshold, . It is shown that the normalized Hamming distance (i.e., the Hamming distance divided by the number of bits in the code string) are both lower bounded and upper bounded and that the codes preserve the similarity in a probabilistic way.
Non-metric LSH [98] extends LSH to non-metric data by embedding the data in the original space into an implicit reproducing kernel Kreĭn space where the hash function is defined. The kreĭn space with the indefinite inner product admits an orthogonal decomposition as a direct sum , where and are separable Hilbert spaces with their corresponding positive definite inner products. The inner product is then computed as
(15) |
Given the orthogonality of and of, the pairwise distance in is compute as
(16) |
The projections with the definite inner product and can be computed using the technology in kernel LSH, denoted by and , respectively. The hash function with the input being and the output being two binary bits is defined as,
(17) |
where and are assumed to be normalized to and is a real number uniformly drawn from . It can be shown that , which indicates that the hash function belongs to the LSH family.
The basic idea of distance-based hashing [4] uses a line projection function
(18) |
to formulate a hash function,
(19) |
Here, and are randomly selected data items, is the distance measure, and and are two thresholds, selected so that half of the data items are hashed to and the other half to .
Similar to LSH, distance-based hashing generates a compound hash function using distance-based hash functions and accordingly compound hash functions, yielding hash tables. However, it cannot be shown that the theoretic guarantee in LSH holds for DBH. There are some other schemes discussed in [4], including optimizing and from the dataset, applying DBH hierarchically so that different set of queries use different parameters and , and so on.
The entropy-based search algorithm [107], given a query point , picks a set of () random points from , a ball centered at with the radius and searches in the buckets , to find -near neighbors. Here is the number of the database items, , is the entropy where is a random point in , and denotes the upper bound on the probability that two points that are at least distance apart will be hashed to the same bucket. In addition, the search algorithm suggests to build a single hash table with hash bits.
The paper [107] presents the theoretic evidence theoretically guaranteeing the search quality.
LSH forest [9] represents each hash table, built from LSH, using a tree, by pruning subtrees (nodes) that do not contain any database points and also restricting the depth of each leaf node not larger than a threshold. Different from the conventional scheme that finds the candidates from the hash buckets corresponding to the hash codes of the query point, the search algorithm finds the points contained in subtrees over LSH forest having the largest prefix match by a two-phase approach: the first top-down phase descends each LSH tree to find the leaf having the largest prefix match with the hash code of the query, the second bottom-up phase back-tracks each tree from the discovered leaf nodes in the first phase in the largest-prefix-match-first manner to find subtrees having the largest prefix match with the hash code of the query.
The basic idea of adaptative LSH [48] is to select the most relevant hash codes based on the relevance value. The relevance value is computed by accumulating the differences between the projection value and the mean of the corresponding line segment along the projection direction (or equivalently the difference of the projection values along the projection directions and the center of the corresponding bucket).
The basic idea of multi-probe LSH [91] is to intelligently probe multiple buckets that are likely to contain query results in a hash table, whose hash values may not necessarily be the same to the hash value of the query vector. Given a query , with its hash code denoted by , multi-probe LSH finds a sequence of hash perturbation vector, and sequentially probe the hash buckets . A score, computed as , where is the distance of from the boundary of the slot , is used to sort the perturbation vectors, so that the buckets are accessed in order of increasing the scores. The paper [91] also proposes to use the expectation , which is estimated with the assumption that is uniformly distributed in ( is the width of the hash function used for Euclidean LSH), to replace for sorting the perturbation vectors. Compared with conventional LSH, to achieve the same search quality, multi-probe LSH has a similar time efficiency while reducing the number of hash tables by an order of magnitude.
The posteriori multi-probe LSH algorithm presented in [56] gives a probabilistic interpretation of multi-probe LSH and presents a probabilistic score, to sort the perturbation vectors. The basic ideas of the probabilistic score computation include the property (likelihood) that the difference of the projections of two vectors along a random projection direction drawn from a Gaussian distribution follows a Gaussian distribution, as well as estimating the distribution (prior) of the neighboring points of a point from the train query points and their neighboring points with assuming that the neighbor points of a query point follow a Gaussian distribution.
The collision counting LSH scheme introduced in [25] uses a base of single hash functions to construct dynamic compound hash functions, instead of static compound hash functions each of which is composed of hash functions. This scheme regards a data vector that collides with the query vector over at least hash functions out of the base of single hash functions as a good cR-NN candidate. The theoretical analysis shows that such a scheme by appropriately choosing and can have a guarantee on search quality. In case that there is no data returned for a query (i.e., no data vector has at least collisions with the query), a virtual reranking scheme is presented with the essential idea of expanding the window width gradually in the hash function for E2LSH, to increase the collision chance, until finding enough number of data vectors that have at least collisions with the query.
The goal of Bayesian LSH [113]
is to estimate the probability distribution,
, of the true similarity in the case that matches out of hash bits for a pair of hash codes of the query vector and a NN candidate , which is denoted by , and prune the candidate if the probability for the case with being a threshold is less than . In addition, if the concentration probability , or intuitively the true similarity under the distribution is almost located near the mode, , the similarity evaluation is early stopped and such a pair is regarded as similar enough, which is an alternative of computing the exact similarity of such a pair in the original space. The paper [113] gives two examples of Bayesian LSH for the Jaccard similarity and the arccos similarity for which are instantiated.Fast LSH [19] presents two algorithms, ACHash and DHHash, that formulate -bits compound hash functions. ACHash pre-conditions the input vector using a random diagonal matrix and a Hadamard transform, and then applies a sparse Gaussian matrix followed by a rounding. DHHash does the same pre-conditioning process and then applies a random permutation, followed by a random diagonal Gaussian matrix and an another Hadamard transform. It is shown that it takes only for both ACHash and DHHash to compute hash codes instead of . The algorithms are also extended to the angle-based similarity, where the query time to -approximate the angle between two vectors is reduced from to .
The first level of bi-level LSH [106] uses a random-projection tree to divide the dataset into subgroups with bounded aspect ratios. The second level is an LSH table, which is basically implemented by randomly projecting data points into a low-dimensional space and then partitioning the low-dimensional space into cells. The table is enhanced using a hierarchical structure. The hierarchy, implemented using the space filling Morton curve (a.k.a., the Lebesgue or Z-order curve), is useful when there are not enough candidates retrieved for the multi-probe LSH algorithm. In addition, the
lattice is used for partitioning the low-dimensional space to overcome the curse of dimensionality caused by the basic
lattice.SortingKeys LSH [88] aims at improving the search scheme of LSH by reducing random I/O operations when retrieving candidate data points. The paper defines a distance measure between compound hash keys to estimate the true distance between data points, and introduces a linear order on the set of compound hash keys. The method sorts all the compound hash keys in ascending order and stores the corresponding data points on disk according to this order, then close data points are likely to be stored locally. During ANN search, a limited number of pages on the disk, which are “close” to the query in terms of the distance defined between compound hash keys, are needed to be accessed for sufficient candidate generation, leading to much shorter response time due to the reduction of random I/O operations, yet with higher search accuracy.
The purpose [21] is to model the recall and the selectivity and apply it to determine the optimal parameters, the window size , the number of hash functions forming the compound hash function, the number of tables , and the number of bins probed in each table for LSH. The recall is defined as the percentage of the true NNs in the retrieved NN candidates. The selectivity is defined as the ratio of the number of the retrieved candidates to the number of the database points. The two factors are formulated as a function of the data distribution, for which the squared
distance is assumed to follow a Gamma distribution that is estimated from the real data. The estimated distributions of
-NN, -NNs, and so on are used to compute the recall and selectivity. Finally, the optimal parameters are computed to minimize the selectivity with the constraint that the recall is not less than a required value. A similar and more complete analysis for parameter optimization is given in [119][36] introduces a new measure, relative contrast for analyzing the meaningfulness and difficulty of nearest neighbor search. The relative contrast for a query , given a dataset is defined as . The relative contrast expectation with respect to the queries is given as follows, .
Define a random variable , and let the mean be and the variance be . Define the normalized variance: . It is shown that if are independent and satisfy Lindeberg’s condition, the expected relative contrast is approximated as,
(20) |
where is the number of database points, is the cumulative density function of standard Gaussian,
is the normalized standard deviation, and
is the distance metric norm. It can also be generalized to the relative contrast for the th nearest neighbor,(21) |
where is the distance of the query to the th nearest neighbor.
Given the approximate relative contrast, it is clear how the data dimensionality , the database size , the metric norm , and the sparsity of the data vector (determining ) influence the relative contrast.
It is shown that LSH, under the -norm distance, can find the exact nearest neighbor with probability by returning candidate points, where is a function monotonically decreasing with , and that, in the context of linear hashing , the optimal projection that maximizes the relative contrast is , where and , subject to , .
The LSH scheme has very nice theoretic properties. However, as the hash functions are data-independent, the practical performance is not as good as expected in certain applications. Therefore, there are a lot of followups that learn hash functions from the data.
Learning to hash is a task of learning a compound hash function, , mapping an input item to a compact code , such that the nearest neighbor search in the coding space is efficient and the result is an effective approximation of the true nearest search result in the input space. An instance of the learning-to-hash approach includes three elements: hash function, similarity measure in the coding space, and optimization criterion. Here The similarity in similarity measure is a general concept, and may mean distance or other forms of similarity.
Hash function.
The hash function can be based on linear projection, spherical function, kernels, and neural network, even a non-parametric function, and so on. One popular hash function is a linear hash function:
, where if and otherwise. Another widely-used hash function is a function based on nearest vector assignment: , where is a set of centers, computed by some algorithm, e.g., -means.The choice of hash function types influences the efficiency of computing hash codes and the flexility of the hash codes, or the flexibility of partitioning the space. The optimization of hash function parameters is dependent to both distance measure and distance preserving.
Similarity measure. There are two main distance measure schemes in the coding space. Hamming distance with its variants, and Euclidean distance. Hamming distance is widely used when the hashing function maps the data point into a Hamming code for which each entry is either or , and is defined as the number of bits at which the corresponding values are different. There are some other variants, such as weighted Hamming distance, distance table lookup, and so on. Euclidean distance is used in the approaches based on nearest vector assignment and evaluated between the vectors corresponding to the hash codes, i.e., the nearest vectors assigned to the data vectors, which is efficiently computed by looking up a precomputed distance table. There is a variant, asymmetric Euclidean distance, for which only one vector is approximated by its nearest vector while the other vector is not approximated. There are also some works learning a distance table between hash codes by assuming the hash codes are already given.
Optimization criterion. The approximate nearest neighbor search result is evaluated by comparing it with the true search result, that is the result according to the distance computed in the input space. Most similarity preserving criteria design various forms as the surrogate of such an evaluation.
The straightforward form is to directly compare the order of the ANN search result with that of the true result (using the reference data points as queries), which called the order-preserving criterion. The empirical results show that the ANN search result usually has higher probability to approach the true search result if the distance computed in the coding space accurately approximates the distance computed in the input space. This motivates the so-called similarity alignment criterion, which directly minimizes the differences between the distances (similarities) computed in the coding and input space. An alternative surrogate is coding consistent hashing, which penalizes the larger distances in the coding space but with the larger similarities in the input space (called coding consistent to similarity, shorted as coding consistent as a major of algorithms use it) and encourages the smaller (larger) distances in the coding space but with the smaller (larger) distances in the input space (called coding consistent to distance). One typical approach, the space partitioning approach, assumes that space partitioning has already implicitly preserved the similarity to some degree.
Besides similarity preserving, another widely-used criterion is coding balance, which means that the reference vectors should be uniformly distributed in each bucket (corresponding to a hash code). Other related criteria, such as bit balance, bit independence, search efficiency, and so on, are essentially (degraded) forms of coding balance.
type | abbreviation |
---|---|
linear | LI |
bilinear | BILI |
Laplacian eigenfunction |
LE |
kernel | KE |
quantizer | QU |
D quantizer | OQ |
spline | SP |
neural network | NN |
spherical function | SF |
classifier | CL |
type | abbreviation |
---|---|
Hamming distance | HD |
normalized Hamming distance | NHD |
asymmetric Hamming distance | AHD |
weighted Hamming distance | WHD |
query-dependent weighted Hamming distance | QWHD |
normalized Hamming affinity | NHA |
Manhattan | MD |
asymmetric Euclidean distance | AED |
symmetric Euclidean distance | SED |
lower bound | LB |
type | abbreviation |
---|---|
Hamming embedding | |
coding consistent | CC |
coding consistent to distance | CCD |
code balance | CB |
bit balance | BB |
bit uncorrelation | BU |
projection uncorrelation | PU |
mutual information maximization | MIM |
minimizing differences between distances | MDD |
minimizing differences between similarities | MDS |
minimizing differences between similarity distribution | MDSD |
hinge-like loss | HL |
rank order loss | ROL |
triplet loss | TL |
classification error | CE |
space partitioning | SP |
complementary partitioning | CP |
pair-wise bit balance | PBB |
maximum margin | MM |
Quantization | |
bit allocation | BA |
quantization error | QE |
equal variance | EV |
maximum cosine similarity | MCS |
method | input sim. | hash function | dist. measure | optimization criteria |
---|---|---|---|---|
spectral hashing [135] | E | LE | HD | CC + BB + BU |
kernelized spectral hashing [37] | S, E | KE | HD | CC + BB + BU |
Hypergraph spectral hashing [153, 89] | S | CL | HD | CC + BB + BU |
Topology preserving hashing [145] | E | LI | HD | CC + CCD + BB + BU |
hashing with graphs [83] | S | KE | HD | CC + BB |
ICA Hashing [35] | E | LI, KE | HD | CC + BB + BU + MIM |
Semi-supervised hashing [125, 126, 127] | S, E | LI | HD | CC + BB + PU |
LDA hash [122] | S | LI | HD | CC + PU |
binary reconstructive embedding [63] | E | LI, KE | HD | MDD |
supervised hashing with kernels [82] | E, S | LI, KE | HD | MDS |
spec hashing [78] | S | CL | HD | MDSD |
bilinear hyperplane hashing [84] | ACS | BILI | HD | MDS |
minimal loss hashing [101] | E, S | LI | HD | HL |
order preserving hashing [130] | E | LI | HD | ROL |
Triplet loss hashing [103] | E, S | Any | HD, AHD | TL |
listwise supervision hashing [128] | E, S | LI | HD | TL |
Similarity sensitive coding (SSC) [114] | S | CL | WHD | CE |
parameter sensitive hashing [115] | S | CL | WHD | CE |
column generation hashing [75] | S | CL | WHD | CE |
complementary projection hashing [55] | E | LI, KE | HD | SP + CP + PBB |
label-regularized maximum margin hashing [96] | E, S | KE | HD | SP + MM + BB |
Random maximum margin hashing [57] | E | LI, KE | HD | SP + MM + BB |
spherical hashing [38] | E | SF | NHD | SP + PBB |
density sensitive hashing [79] | E | LI | HD | SP + BB |
multi-dimensional spectral hashing [134] | E | LE | WHD | CC + BB + BU |
Weighted hashing [131] | E | LI | WHD | CC + BB + BU |
Query-adaptive bit weights [53, 54] | S | LI (all) | QWHD | CE |
Query adaptive hashing [81] | S | LI | QWHD | CE |
In the following, we review Hamming bedding based hashing algorithms. Table IV presents the summary of the algorithms reviewed from Section 5.1 to Section 5.5, with some concepts given in Tables I, I and III.
Coding consistent hashing refers to a category of hashing functions based on minimizing the similarity weighted distance, (and possibly maximizing ), to formulate the objective function. Here, is the similarity between and computed from the input space or given from the semantic meaning.
Spectral hashing [135], the pioneering coding consistency hashing algorithm, aims to find an easy-evaluated hash function so that (1) similar items are mapped to similar hash codes based on the Hamming distance (coding consistency) and (2) a small number of hash bits are required. The second requirement is a form similar to coding balance, which is transformed to two requirements: bit balance and bit uncorrelation. The balance means that each bit has around chance of being or (). The uncorrelation means that different bits are uncorrelated.
Let be the hash codes of the data items, each be a binary vector of length . Let be the similarity that correlates with the Euclidean distance. The formulation is given as follows:
(22) | ||||
(23) | ||||
(24) | ||||
(25) |
where , is a matrix of size , is a diagonal matrix , and . is called Laplacian matrix and . corresponds to the bit balance requirement. corresponds to the bit uncorrelation requirement.
Rather than solving the problem Equation 25 directly, a simple approximate solution with the assumption of uniform data distribution is presented in [135]. The algorithm is given as follows:
Find the principal components of the -dimensional reference data items using principal component analysis (PCA).
Compute the
D Laplacian eigenfunctions with the smallest eigenvalues along each PCA direction.
Pick the eigenfunctions with the smallest eigenvalues among eigenfunctions.
Threshold the eigenfunction at zero, obtaining the binary codes.
The D Laplacian eigenfunction for the case of uniform distribution on is and the corresponding eigenvalue is
Comments
There are no comments yet.