Improved approximate near neighbor search without false negatives for l_2

09/28/2017
by   Piotr Wygocki, et al.
University of Warsaw
0

We present a new algorithm for the c--approximate nearest neighbor search without false negatives for l_2^d. We enhance the dimension reduction method presented in wygos_red and combine it with the standard results of Indyk and Motwani motwani. We present an efficient algorithm with Las Vegas guaranties for any c>1. This improves over the previous results, which require c=ω(n) wygos_red, where n is the number of the input points. Moreover, we improve both the query time and the pre-processing time. Our algorithm is tunable, which allows for different compromises between the query and the pre-processing times. In order to illustrate this flexibility, we present two variants of the algorithm. The "efficient query" variant involves the query time of O(d^2) and the polynomial pre-processing time. The "efficient pre-processing" variant involves the pre-processing time equal to O(d^ω-1 n) and the query time sub-linear in n, where ω is the exponent in the complexity of the fast matrix multiplication. In addition, we introduce batch versions of the mentioned algorithms, where the queries come in batches of size d. In this case, the amortized query time of the "efficient query" algorithm is reduced to O(d^ω -1).

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/26/2018

An Algorithm for Reducing Approximate Nearest Neighbor to Approximate Near Neighbor with O(logn) Query Time

This paper proposes a new algorithm for reducing Approximate Nearest Nei...
11/15/2018

Boosting Search Performance Using Query Variations

Rank fusion is a powerful technique that allows multiple sources of info...
08/08/2017

A discriminative view of MRF pre-processing algorithms

While Markov Random Fields (MRFs) are widely used in computer vision, th...
03/29/2018

Modified SMOTE Using Mutual Information and Different Sorts of Entropies

SMOTE is one of the oversampling techniques for balancing the datasets a...
01/03/2022

Line Clipping in E3 with Expected Complexity O(1)

A new line clipping algorithm against convex polyhedron in E3 with an ex...
01/10/2013

Pre-processing for Triangulation of Probabilistic Networks

The currently most efficient algorithm for inference with a probabilisti...
10/07/2019

Fast and Bayes-consistent nearest neighbors

Research on nearest-neighbor methods tends to focus somewhat dichotomous...

1 Introduction

The near neighbor problem has various applications in image processing, search engines, recommendation engines, prediction and machine learning. We define the near neighbor problem as follows: for a given input set, a query point and distance

, return a point (optionally all points) from the input set closer than to the query point in some metric (usually for ) or report that such a point does not exist.111Some authors refer to this problem as the Point Location in Equal Balls (PLEB) [10]. The input set and the distance are known in advance. Because of this, the input set can be preprocessed, which can afterwards shorten the query time. The problem of finding the nearest neighbor with no given, can be efficiently reduced to the problem defined as above [10].

Unfortunately, the near neighbor search is hard for high dimensional spaces such as for a large . The existence of an algorithm with the query time sub-linear in the input set size and non-exponential in and the pre-processing time non-exponential in would contradict the strong hypothetical time hypothesis [15]. In order to overcome this obstacle, the –near neighbor problem was introduced. In this problem, the query result is allowed to contain points located at a maximum distance of from the query point. In other words, the points located at a distance smaller than

from the query point are classified as neighbors, points further than

are classified as ”far away” points, while the rest can be classified in either of these two categories. Naturally, we consider . This assumption makes the problem easier for many metric spaces, such as for or the Hamming [10] space. On the one hand, the queries are sub-linear in the input size. On the other hand, queries and pre-production time are polynomial in the dimension of space.

Many previously known algorithms for the –approximate near neighbor use locality sensitive hashing and give Monte Carlo guarantees for the returned points (see, for example, [2, 6, 10]). That is, any input point within the distance

from the query point is classified as neighbor with some probability, which means there may be false negatives. Locality sensitive hashing functions are functions which roughly preserve distances, i.e., given two points, the distance between their hashes approximates the distance between them with high probability. A common choice for the hash functions is

or , where

is a vector of numbers drawn independently from some probability distribution

[2, 10, 16]

. For the Gaussian distribution,

is also Gaussian with zero mean and standard deviation equal to

. It is easy to see that these are LSH functions for , but as mentioned above, they only provide probabilistic guaranties. In this paper, we aim to enhance this by focusing on the –approximate near neighbor search without false negatives for . In other words, we consider algorithms, where a point ’close’ to the query point is guaranteed to be returned. Such class of guaranties is often called Las Vegas. An algorithm with Las Vegas guaranties can be adjusted to one with Monte Carlo guaranties. Markov’s inequality implies, that if the expected value of the computation time is small, then with large probability the computation time is also small. We can break the computation after the certain amount of time passed and return the empty result which gives Monte Carlo guaranties.

Throughout this paper, we assume that the size of the input set and . This represents a situation where the exhaustive scan through all the input points, as well as the usage of data structures exponentially dependent on , become intractable. If not explicitly specified, all statements assume the usage of the norm.

2 Related Work

2.1 Algorithms for constant dimension

There is a number of algorithms for the –approximate near neighbor problem assuming constant [5, 11, 7]. In each of them, either the pre-processing time or the query time depends exponentially on . Nevertheless, these are the best fully deterministic algorithms that are known [5, 10]. A particularly interesting algorithm is presented in [10], having the pre-processing time and the query time equal to , where . We will use this algorithm to obtain our results.

2.2 Monte Carlo algorithms

There exists an efficient Monte Carlo –approximate near neighbor algorithm for with the query and the pre-processing complexity equal to and , respectively [10]. For in turn, there exists a near to optimal algorithm with the query and the pre-processing complexity equal to and , respectively [2] [12]. Moreover, the algorithms presented in [10] work for for any . There are also data dependent algorithms, taking into account the actual distribution of the input set [4], which achieve query time and space , where .

Recently, the optimal hashing–based time–space trade-offs for the –approximate near neighbor in were considered [3]. For any such that:

there is a –approximate near neighbor algorithm with the storage and the query time .

2.3 Las Vegas algorithms

Pagh [13] considered the –approximate near neighbor search without false negatives () for the Hamming space, obtaining the results close to those presented in [10]. He showed that the bounds of his algorithm for differ by at most a factor of in the exponent in comparison to the bounds in [10]. Recently, Ahle showed an optimal algorithm for the –approximate near neighbor without false negatives for the Hamming space and Braun-Blanquet metric [1][12]. Indyk [8] provided a deterministic algorithm for for with the storage and the query time for some tunable parameter . He proved that the –approximate near neighbor without false negatives for for is as hard as the subset query problem, a long-standing combinatorial problem. This indicates that the –approximate near neighbor without false negatives for might be hard to solve for any .

Indyk [9] considered deterministic mappings , for , which might be useful for constructing efficient algorithms for the –approximate near neighbor without false negatives. If we were able to efficiently embed into the Hamming space (which is just with distance function) with additional guaranties for false negatives, it would also give an algorithm for and .222 To the best of the author’s knowledge, such an embedding will be presented at FOCS 2017 in the conference version of the paper of Ahle [1].

Algorithms for any are presented in [16]. Two hashing function families are considered, giving different trade-offs between the execution times and the conditions on . Unfortunately, these algorithms work only for . In further work, Sankowski et al. [14] showed a dimension reduction technique with Las Vegas guaranties. Application of the algorithm introduced in [16] to the problem with the reduced dimension results in an algorithm for . This might be further reduced to [14].

In this work, we use the dimension reduction introduced in [14] and apply the algorithm of Indyk and Motwani [10] to the problem in the reduced space. After a slight strengthening of the results of [14], we get the algorithms for any .

2.3.1 Las Vegas dimension reduction

Sankowski et al. [14] showed that:

Lemma 1.

[Reduction Lemma – Lemma 1 in [14]]

For any parameter and , there exist linear mappings , from to , such that:

  1. for each point such that , there exists , which satisfies ,

  2. for each point such that , where , for each : , we have

Since for , the above bound is not trivial for . Applying the reduction lemma gives the following reduction for the –approximate near neighbor without false negatives:

Corollary 1.

[generalization of Corollary 3 in [14]]

For any and , the can be reduced to instances of the , where for a tunable parameter and:

If the queries are provided in the batches of the size , we obtain the algorithm with the query time:

In the above version of Corollary 1, we generalize the Corollary 3 from [14] in the following way:

  • We introduce an additional parameter , which allows us to set different compromises between the pre-processing time and the query time.

  • We observe, that given the assumption of , the preprocessing of all points can be expressed as the multiplication of matrices of dimensions and which can be performed in time . Analogical argument leads to the conclusion that the query time of the batch version is . In further theorems, we provide the amortized query times for the batch version. This can be easily turned into a non-batch version by substituting the term with in the query complexities.

  • We present slightly stronger bound for .

The proof of Corollary 1 is essentially the same as the proof presented in [14]. For the reader’s convenience, this proof is included in Appendix A.

Combining the above corollary with the results introduced in [16], we can achieve an algorithm with the polynomial pre-processing time and the sub-linear query time for [16]. In this paper, we relax this restriction and show an algorithm for any .

3 Our contribution

Recently, efficient algorithms were proposed for solving the –approximate near neighbor search without false negatives for in for any and for for [16, 14]. The main problem with these algorithms is the constraint on . The contribution of this paper is relaxing this condition and improving the complexity of the algorithms for :

The can be solved with the amortized query time and the pre-processing time:

  • for any ,

  • for any ,

for some tunable parameter and . As mentioned before, we assume, that the queries are provided in batches of size . This assumption can be omitted, which leads to an algorithm with the query time of . The above is also valid for the other presented algorithms (in particular, for Theorem 3). We focus on the batch version to avoid unnecessary complexity.

In particular, for and , we achieve an algorithm with the query time and the pre-processing time . For small , our results are incomparable with the previously discussed algorithms. In particular, our results are similar to the algorithm presented recently in [3], which works with the query time and the pre-processing time . Results presented in [2, 3] give weaker Monte Carlo guaranties. Increasing the parameter allows us to reduce the preprocessing complexity. For we achieve an algorithm with the query time and the pre-processing time . Setting gives the algorithm with the polynomial pre-processing time independent of .

In addition, we show the pre-processing efficient versions of the algorithms, which have an optimal in therms of , linear complexity. The can be solved with the pre-processing time and the amortized query time:

  • for any ,

  • for any ,

where .

This gives new results for probably the most interesting case from the practical point of view. In particular, for , we achieve the query time:

Again, our algorithm gives results similar to one presented in [3] which gives the query time equal to .

All of the presented algorithms give Las Vegas guaranties, which are stronger than the previously considered Monte Carlo guaranties. The provided algorithms are practical in terms of implementation.

author guaranties query pre-processing space
[1] Las Vegas Hamming for
[8] Deterministic for
[14] Las Vegas for
[3] Monte Carlo for
this work Las Vegas for
[3] Monte Carlo for
this work Las Vegas for

Table 1: Comparison of the results for the –approximate near neighbor. We present only “fast query” and “fast pre-processing” parts of results for possibly small . Also, results presented in [3] are under assumption that ). Results in [8] are for a tunable parameter .

4 Notations

The input set is always assumed to contain points. The –approximate near neighbor search without false negatives with parameter and the dimension of the space equal to , is denoted as . The considered problem is solved in the standard euclidean space (i.e., the standard norm in ) : . The expected query and pre-processing time complexities of will be denoted as and respectively.

In this paper, we use the term “pre-processing” to refer to the sum of the actual pre-processing time and the storage required. W.l.o.g, throughout this work we assume, that – a given radius equals 1 (otherwise, all vectors’ lengths might be rescaled by ). In this work, it is often convenient to use instead of . Whenever appears, it is assumed to be defined as above. Finally, we use to denote the exponent in the complexity of the fast matrix multiplication (currently ).

5 Nearest neighbors without false negatives for any

In this section, we show an efficient algorithm for solving the in . Indyk and Motwani [10] showed an algorithm with the pre-processing time and the query time equal to , where . The idea of the algorithm is the following. We start with a quantization of the given space, which reduces the problem to finding the near neighbor in a space with integer coefficients. After the quantization, there is a finite number of points which have a neighbor in the input set. It is enough to provide the data structure which will store all such points with accompanying near neighbors from the input set. It is proved that the number of neighbors of each input point is . So in total, we need to store of such points. We can fetch a point from the data structure in the time proportional to the size of this point, thus the query time is .

The only issue left is to provide appropriate storage. Indyk and Motwani [10] provided a hash-map storage. Let us consider the following standard, deterministic construction. For each of the stored points, we store its bit representation in a binary tree. This way the length of the branch representing a point equals to its bit-length. Hence, the query time is proportional to the bit size of the query. The size of the whole tree is bounded by the total size of the binary representation of all the stored points.

The above construction gives an algorithm for the –approximate near neighbor in with the efficient query time. Unfortunately, unless , the pre-processing time is exponential. If the dimension is larger than , with defined as in Corollary 1, we may reduce the complexity of the pre-processing by reducing the dimension of the input space.

5.1 Fast query

In this section we prove Theorem 3:

  • :

    For , we set in Corollary 1. It follows that can be reduced to instances of and:

    • the query time equals ,

    • the pre-processing time equals

    Since for ,

    Consequently, we reduce the problem to instances of . By [10], each of these instances is solved with the pre-processing time and with the query time .

  • :

    After setting to any constant value such that in Corollary 1, we reduce the problem to instances of . We have:

    Each of these instances is solved with the pre-processing time and with the query time .

This ends the proof of the Theorem 3.

5.2 Fast pre-processing

In this section, we prove Theorem 3. To achieve the algorithm with the fast, linear pre-processing, we will store all input points in the hash-map. During the query phase, we will ask the hash-map for all of of points which are close to this query point. This way, most of the computation is moved from the pre-processing to the query phase. The dimension reduction is used in a similar manner as in Section 5.1. We skip the computation steps which are analogical to the corresponding ones in the previous section.

  • For , after setting in Corollary 1 we get an algorithm with:

    • the query time: ,

    • the pre-processing time:

    Let us assume that the query time is . After setting

    we get the query time complexity equal to .

  • For , after setting to any constant value such that and in Corollary 1, we get the algorithm with:

    • the query time:

    • the pre-processing time:

This ends the proof of the Theorem 3.

One can produce the Monte Carlo version of Theorems 3 and 3, which have only slightly better complexities (some factors of would be removed), because the dimension reduction is simpler in this case.

6 Conclusion and Future Work

We have presented the –approximate near neighbor algorithm without false negatives in for any . The future works concerns reducing the time complexity of the algorithm or proving that these restrictions are essential. We wish to match the time complexities given in [10] or show that the achieved bounds are optimal.

References

Appendix A Dimension Reduction

In this section, we repeat arguments presented in [14]. Let us start with the well-known Johnson-Lindenstrauss Lemma which is crucial to our results: [Johnson-Lindenstrauss] Let be chosen uniformly from the surface of the -dimensional sphere. Let be the projection onto the first coordinates, where . Then for any :

(1)

The basic idea is the following: we will introduce a number of linear mappings to transform the -dimensional problem to a number of problems with reduced dimension.

We will introduce 333For simplicity, let us assume that divides

, this can be achieved by padding extra dimensions with

’s. linear mappings , where and show the following properties:

  1. for each point , such that , there exists , such that ,

  2. for each point , such that , where , the probability that there exists , such that is bounded.

The property states, that for a given ’short’ vector (with a length smaller than ), there is always at least one mapping, which transforms this vector to a vector of length smaller than . Moreover, we will show, that there exists at least one mapping , which does not increase the length of the vector, i.e., such that . The property states, that we can bound the probability of a ’long’ vector (), being mapped to a ’short’ one (). Using the standard concentration measure arguments, we will prove that this probability decays exponentially in .

a.1 Linear mappings

In this section, we will introduce linear mappings satisfying properties 1. and 2. Our technique will depend on the concentration bound used to prove the classic Johnson-Lindenstrauss Lemma. In Lemma A, we take a random vector and project it to the first vectors of the standard basis of . In our settings, we will project the given vector to a random orthonormal basis which gives the same guaranties. The mapping consists of consecutive vectors from the random basis of the space scaled by . The following reduction describes the basic properties of our construction (presented also in related work): See 1

Proof.

Let be a random basis of . Each of the mappings is represented by a dimensional matrix. We will use for denoting both the mapping and the corresponding matrix. The th row of the matrix equals . In other words, the rows of consist of consecutive vectors from the random basis of the space scaled by .

To prove the first property, observe that , since the distance is independent of the basis. Assume on the contrary, that for each , . It follows that . This contradiction ends the proof of the first property.

For any , such that , the probability:

Applying Lemma A ends the proof. ∎

The algorithm works as follows: for each , we project to using and solve the corresponding problem in the smaller space. For each query point, we need to merge the solutions obtained for each sub-problem. This results in reducing the to instances of .

[Lemma 2 in [14]] For and , the can be reduced to instances of the . The expected pre-processing time equals and the expected query time equals .

Proof.

We use the assumption that and to simplify the complexities. The pre-processing time consists of:

  • : the time of computing a random orthonormal basis of .

  • : the time of changing the basis to .

  • : the time of computing for all and for all points.

  • : the expected pre-processing time of all sub-problems.

The query time consists of:

  • : the amortized time of changing the basis to .

  • : the expected number of false positives (by Lemma 1).

  • : the expected query time of all sub-problems.

The following corollary simplifies the formulas used in Lemma A.1 and shows that the can be reduced to a number of problems of dimension in an efficient way. Namely, setting we get (presented also in related work):

See 1