On the Problem of p_1^-1 in Locality-Sensitive Hashing

05/25/2020 ∙ by Thomas Dybdahl Ahle, et al. ∙ IT University of Copenhagen 0

A Locality-Sensitive Hash (LSH) function is called (r,cr,p_1,p_2)-sensitive, if two data-points with a distance less than r collide with probability at least p_1 while data points with a distance greater than cr collide with probability at most p_2. These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time Õ(n^ρ/p_1) where ρ=log1/p_1/log1/p_2∈(0,1). That is, sub-linear time, as long as p_1 is not too small. This is significant since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can't be solved exact, faster than a brute force linear-time scan of the database. Unfortunately, the best LSH functions tend to have very low collision probabilities, p_1 and p_2. Including the best functions for Cosine and Jaccard Similarity. This means that the n^ρ/p_1 query time of LSH is often not sub-linear after all, even for approximate nearest neighbours! In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to Õ(n^ρ/p_1^1-ρ) (and the space usage correspondingly.) Since n^ρ p_1^ρ-1 < n ⇔ p_1 > n^-1, our algorithm always obtains sublinear query time, for any collision probabilities at least 1/n. For p_1 and p_2 small enough, our improvement over all previous methods can be up to a factor n in both query time and space. The improvement comes from a simple change to the Indyk-Motwani algorithm, which can easily be implemented in existing software packages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Locality Sensitive-Hashing (LSH) framework [18] is one of the most efficient approaches to the nearest neighbour search problem in high dimensional spaces. It comes with theoretical guarantees, and it has the advantage of easy adaption to nearly any metric or similarity function one might want to search.

The -near neighbour problem is defined as follows: Given a set of points, we build a data-structure, such that given a query, we can quickly find a point with distance to , or determine that has no points with distance to . Given a solution to this “gap” problem, one can obtain a -approximate nearest neighbour data structure, or even an exact111In general we expect the exact problem to be impossible to solve in sub-linear time, given the hardness results of [5, 1]. However for practical datasets it is often possible. solution using known reductions [2, 14, 17].

For any measure of similarity, the gap problem can be solved by LSH: we find a distribution of functions , such that when and are similar, and when and are dissimilar. Such a distribution is called -sensitive. If the LSH framework gives a data-structure with query time for , which is usually significantly faster than the alternatives.

At least when is not too small.

The two most common families of LSH is Cross-Polytope (or Spherical) LSH [6]

for Cosine similarity and MinHash 

[11, 10] for Jaccard Similarity.

Cross-Polytope is the basis of the Falconn software package [21], and solves the -near neighbour problem on the sphere in time . Here , where is the distance between two close points. We see that already at

(which corresponds to near orthogonal vectors) the

factor results in a factor slow-down. For larger the slow-down can grow arbitrary large. Using dimensionality reduction techniques, like the Johnson Lindenstrauss transform, one may assume at the cost of a factor distortion of the distances. However if is just , the slow-down factor of is still worse than, say, for datasets of size up to , and so if we get that s larger than . So worse than a brute force scan of the database!

The MinHash algorithm was introduced by Broder et al. for the Alta Vista search engine, but is used today for similarity search on sets in everything from natural language processing to gene sequencing. MinHash solves the

gap similarity search problem, where is the Jaccard Similarity of similar sets, and is that of dissimilar sets, in time where . (In particular MinHash is -sensitive in the sense defined above.) Now consider the case and . This is fairly common as illustrated in fig. 0(a). In this case , so we end up with . Again worse than a brute force scan of the database!

In this paper we reduce the query time of LSH to , which is less than for all . In the MinHash example above, we get . More than a factor improvement(!) In general the improvement of may be as large as a factor of when and are both close to . This is illustrated in fig. 0(b).

The improvements to LSH comes from a simple observation: During the algorithm of Indyk and a certain “amplification” procedure has to be applied times. When does not divide , which is extremely likely, the amount of amplification has to be approximated by the next integer. We propose instead an ensemble of LSH tables with different amounts of amplification, which when analysed sufficiently precisely yields the improvements described above.

(a) Density plots of pairwise Jaccard Similarities in the datasets studied by Mann et al. [20]. The similarities are normalized by the dataset sizes, so we can compare the effect of with the effect of . We see that reasonable values for range between and on those datasets.
(b) Saving possible, in query time and space, over classical LSH as a function of and . With and we save a factor of .
Figure 1: Overview over available savings

1.1 Related Work

We will review various directions in which LSH has been improved and generalized, and how those results related to what is presented in the present article.

In many cases, the time required to sample and evaluate the hash functions dominate the time required by LSH. Recent papers [16, 12] have reduced the number of distinct calls to the hash functions which is needed. The most recent paper in the line of work is [12], which reduces the number of calls to . On top of that, however, they still require work, so the issue with small isn’t touched upon. In fact, some of the some of the algorithms in [12] increase the dependency from to .

Other work has sought to generalize the concept of Locality Sensitive Hashing to so-called Locality Sensitive Filtering, LSF [9]. However, the best work for set similarity search based on LSF [4, 13] still have factors similar in spirit to . E.g., the Chosen Path algorithm in [13] uses query time , where is the similarity between close sets.

A third line of work has sought to derandomize LSH. The result is so-called Las Vegas LSH [3, 22]. Here the families are built combinatorially, rather than at random, to guarantee the data structure always return a near neighbour, when one exists. While these methods don’t have probabilities, they still end up with similar factors for similar reasons.

As mentioned, the reason shows up in all these different approaches, is that they all rely on the same amplification procedure, which has to be applied an integer number of times. One might wonder if tree based methods, which do an adaptive amount of amplification, could get rid of the dependency. However as evidenced by the classical and current work [8, 7, 14, 15] these methods still have a factor . We leave it open whether this might be avoidable with better analysis, perhaps inspired by the results in this paper.

2 Preliminaries

Before we give the new LSH algorithm, we will recap the traditional analysis. For a more comprehensive introduction to LSH, see the Mining of Massive Datasets book [19], Chapter 3. In the remainder of the article we will use the notation .

Assume we are given a -sensitive LSH family, , as defined in the introduction. Let and be some integers defined later, and let be the range of the hash functions, . Let be an upper bound on the number of points to be inserted. 222If we don’t know how many points will be inserted, several black box reductions allow transforming LSH into a dynamic data structure. The Indyk-Motwani data-structure consists of hash tables, each with hash buckets.

To insert a point, , we draw functions from , denoted by . In each table we insert into the bucket keyed by . Given a query point , the algorithm iterates over the tables and retrieves the data points hashed into the same buckets as . The process stops as soon as a point is found within distance from .

The algorithm as described has the performance characteristics listed below. Here we assume the hash functions can be sampled and evaluated in constant time. If this is not the case, one can use the improvements discussed in the related work.

  • Query time: .

  • Space: plus the space to store the data points.

  • Success probability .

To get these bounds, we have defined and

(1)

It’s clear from this analysis that the factor is only necessary when is not an integer. However in those cases it is clearly necessary, since there is no obvious way to make a non-integer number of function evaluations. We also cannot round down instead of up, since the number of false positives would explode: rounding down would result in a factor of instead of — much worse.

3 LSH with High-Low Tables

The idea of the main algorithm is to create some LSH tables with rounded down, and some with rounded up. We call those respectively “high probability” tables and “low probability” tables. In short “LSH with High-Low Tables”.

The main theorem is the following:

Theorem 3.1

Let be a -sensitive LSH family, and let . Assume and . Then there exists a solution to the -near neighbour problem with the following properties:

  • Query time: .

  • Space: plus the space to store the data points.

  • Success probability .

Proof

Assume are given. Define , , and . We build tables (for to be defined), where the first use the hash function concatenated times as keys, and the remaining use it concatenated times.

The total number of tables to query is then . The expected total number of far points we have to retrieve is

(2)
(3)
(4)
(5)

For the second equality, we used the definition of : . We only count the expected number of points seen that are at least away from the query. This is because the algorithm, like classical LSH, terminates as soon as it sees a point with distance less than .

Given any point in the database within distance we must be able to find it with high enough probability. This requires that the query and the point shares a hash-bucket in one of the tables. The probability that this is not the case is

(6)
(7)
(8)
(9)

For the equality, we used the definition of and : . For the last inequality we have assumed so , and that , since otherwise we could just get the theorem from the classical LSH algorithm.

We now define such that and . By the previous calculations this will guarantee the number of false positives is not more than the number of tables, and a constant success probability.

We can achieve this by taking

(10)

We note that both values are non-negative, since .

When actually implementing the LSH algorithm width High-Low buckets, these are the values you should use for the number of respectively the high and low probability tables. That will ensure you take full advantage of when is not worst case, and you may do even better than the theorem assumes.

To complete the theorem we need to prove . For this we bound

(11)
(12)
(13)
(14)

Here

is the Kullback-Leibler divergence. The two inequalities are proven in the Appendix as

lemma 1 and lemma 3. The first bound comes from maximizing over , so in principle we might be able to do better if is close to an integer. The second bound is harder, but the realization that the left hand side can be written on the form of a divergence helps a lot. The bound is tight up to a factor 2, so no significant improvement is possible.

Finally we can boost the success probability from to 99% by repeating the entire data-structure 16 times.

References

  • [1] Abboud, A., Rubinstein, A., Williams, R.: Distributed pcp theorems for hardness of approximation in p. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). pp. 25–36. IEEE (2017)
  • [2] Ahle, T.D., Aumüller, M., Pagh, R.: Parameter-free locality sensitive hashing for spherical range reporting. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 239–256. SIAM (2017)
  • [3] Ahle, T.D.: Optimal las vegas locality sensitive data structures. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). pp. 938–949. IEEE (2017)
  • [4] Ahle, T.D., Knudsen, J.B.T.: Subsets and supermajorities: Optimal hashing-based set similarity search. arXiv preprint arXiv:1904.04045 (2020)
  • [5] Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. pp. 151–164. ACM (2016)
  • [6] Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal lsh for angular distance. In: Advances in Neural Information Processing Systems. pp. 1225–1233 (2015)
  • [7] Andoni, A., Razenshteyn, I., Nosatzki, N.S.: Lsh forest: Practical algorithms made theoretical. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 67–78. SIAM (2017)
  • [8] Bawa, M., Condie, T., Ganesan, P.: Lsh forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on World Wide Web. pp. 651–660 (2005)
  • [9] Becker, A., Ducas, L., Gama, N., Laarhoven, T.: New directions in nearest neighbor searching with applications to lattice sieving. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 10–24. SIAM (2016)
  • [10] Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997. Proceedings. pp. 21–29. IEEE (1997)
  • [11]

    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. pp. 327–336. ACM (1998)

  • [12] Christiani, T.: Fast locality-sensitive hashing frameworks for approximate near neighbor search. In: International Conference on Similarity Search and Applications. pp. 3–17. Springer (2019)
  • [13] Christiani, T., Pagh, R.: Set similarity search beyond minhash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017. pp. 1094–1107 (2017). https://doi.org/10.1145/3055399.3055443, https://doi.org/10.1145/3055399.3055443
  • [14] Christiani, T., Pagh, R., Thorup, M.: Confirmation sampling for exact nearest neighbor search. arXiv preprint arXiv:1812.02603 (2018)
  • [15] Christiani, T.L., Pagh, R., Aumüller, M., Vesterli, M.E.: Puffinn: Parameterless and universally fast finding of nearest neighbors. In: European Symposium on Algorithms. pp. 1–16 (2019)
  • [16] Dahlgaard, S., Knudsen, M.B.T., Thorup, M.: Fast similarity sketching. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). pp. 663–671. IEEE (2017)
  • [17] Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on Computational geometry. pp. 253–262. ACM (2004)
  • [18] Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. pp. 604–613. ACM (1998)
  • [19] Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of massive data sets. Cambridge university press (2020)
  • [20] Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment 9(9), 636–647 (2016)
  • [21] Razenshteyn, I., Schmidt, L.: Falconn-fast lookups of cosine and other nearest neighbors (2018)
  • [22] Wei, A.: Optimal las vegas approximate near neighbors in lp. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1794–1813. SIAM (2019)

4 Appendix

Lemma 1

For all we have

(15)

where .

Proof

We first show that is log-concave, which implies it is maximized at the unique such that . Log-concavity follows easily by noting

(16)

Meanwhile

(17)

which implies is maximized in

(18)

Plugging into yields the lemma.

Note that is not regularly concave as and gets small enough. Hence the use of log-concavity is necessary.

Next, we state a useful inequality, which is needed for the last proof.

Lemma 2

Let , then

(19)
Proof

We have , so is convex as a function of . Since is it’s tangent (at ) we get the first inequality.

For the second inequality, define . Then and is non-decreasing, since

(20)

This shows that for we have , which is what we wanted to prove.

Lemma 3

Let and let , then

(21)

where .

Proof

Using the upper bound of lemma 2 it follows directly that . This suffices to show the first inequality of (21), since for we have and so the second term of is non-positive.

For the second inequality, we note that it is equivalent after manipulations to . Plugging in , and after more simple manipulations, that’s equivalent in the range to , which is lemma 2.

This finishes the proof of lemma 3.

It’s somewhat surprising that the last argument in the proof of lemma 3 works, since if we had plugged the lower bound from lemma 2 directly into the problem we would have had

(22)

which is much weaker than what we prove, and not even defined for .