DeepAI
Log In Sign Up

A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice

In the k-nearest neighborhood model (k-NN), we are given a set of points P, and we shall answer queries q by returning the k nearest neighbors of q in P according to some metric. This concept is crucial in many areas of data analysis and data processing, e.g., computer vision, document retrieval and machine learning. Many k-NN algorithms have been published and implemented, but often the relation between parameters and accuracy of the computed k-NN is not explicit. We study property testing of k-NN graphs in theory and evaluate it empirically: given a point set P ⊂R^δ and a directed graph G=(P,E), is G a k-NN graph, i.e., every point p ∈ P has outgoing edges to its k nearest neighbors, or is it ϵ-far from being a k-NN graph? Here, ϵ-far means that one has to change more than an ϵ-fraction of the edges in order to make G a k-NN graph. We develop a randomized algorithm with one-sided error that decides this question, i.e., a property tester for the k-NN property, with complexity O(√(n) k^2 / ϵ^2) measured in terms of the number of vertices and edges it inspects, and we prove a lower bound of Ω(√(n / ϵ k)). We evaluate our tester empirically on the k-NN models computed by various algorithms and show that it can be used to detect k-NN models with bad accuracy in significantly less time than the building time of the k-NN model.

READ FULL TEXT VIEW PDF

page 12

page 13

08/01/2019

A True O(n n) Algorithm for the All-k-Nearest-Neighbors Problem

In this paper we examined an algorithm for the All-k-Nearest-Neighbor pr...
09/02/2022

Learning task-specific features for 3D pointcloud graph creation

Processing 3D pointclouds with Deep Learning methods is not an easy task...
01/08/2010

Boosting k-NN for categorization of natural scenes

The k-nearest neighbors (k-NN) classification rule has proven extremely ...
08/29/2022

Learned k-NN Distance Estimation

Big data mining is well known to be an important task for data science, ...
05/28/2015

An Analogy Based Method for Freight Forwarding Cost Estimation

The author explored estimation by analogy (EBA) as a means of estimating...
11/18/2019

Consistent recovery threshold of hidden nearest neighbor graphs

Motivated by applications such as discovering strong ties in social netw...
03/26/2018

Efficient space virtualisation for Hoshen--Kopelman algorithm

In this paper the efficient space virtualisation for Hoshen--Kopelman al...

1 Introduction

The -nearest neighborhood (-NN) of a point with respect to some set of points is one of the most fundamental concepts used in data analysis tasks such as classification, regression and machine learning. In the past decades, many algorithms have been proposed in theory as well as in practice to efficiently answer -NN queries [[, e.g.,]]FriAlg75,FukBra75,CalDec95,ConFas10,IndApp98,CheFas09,MujFas09,PedSci11,Nms13,MaiKNN17,ZhaEff18,AlgKgr18. For example, one can construct a -NN graph of a point set , i.e., a directed graph of size such that contains an edge for every -nearest neighbor of for every , in time for constant dimension

CalDec95. Due to restrictions on computational resources, approximations and heuristics are often used instead (see, e.g., CheFas09,ConFas10 and the discussion therein for details). Given the output graph

of such a randomized approximation algorithm or heuristic, one might want to check whether resembles a -NN graph before using it, e.g., in a data processing pipeline. However, the time required for exact verification might cancel out the advantages gained by using an approximation algorithm or a heuristic. On the other hand, testing whether is at least close to a -NN graph will suffice for many purposes. Property testing is a framework for the theoretical analysis of decision and verification problems that are relaxed in favor of sublinear complexity. One motivation of property testing is to fathom the theoretical foundations of efficiently assessing approximation and heuristic algorithms’ outputs.

Property testing RubRob96, and in particular property testing of graphs GolPro98, has been studied quite extensively since its founding. A one-sided error -tester for a property of graphs with average degree bounded by has to accept every graph and it has to reject every graph that is -far from

with probability at least

(i.e., if graphs that are -far are relevant, it has precision  and recall ). A graph of size is -far from some property if more than edges have to be added or removed to transform it into a graph that is in . A two-sided error -tester may also err with probability less than if the graph has the property. The computational complexity of a property tester is the number of adjacency list entries it reads, denoted its queries. Many works in graph property testing focus on testing plain graphs that contain only the pure combinatorial information. However, most graphs that model real data contain some additional information that may, for example, indicate the type of an atom, the bandwidth of a data link or spatial information of an object that is represented by a vertex or an edge, respectively. In this work, we consider geometric graphs with bounded average degree. In particular, the graphs are embedded into , i.e., every vertex has a coordinate . The coordinate of a vertex may be obtained by a query.

Main Results

Our first result is a property tester with one-sided error for the property that a given geometric graph with bounded average degree is a -nearest neighborhood graph of its underlying point set (i.e., it has precision  and recall  when taking -far graphs as relevant).

Theorem 1.

Given an input graph of size with bounded average degree , there exists a one-sided error -tester that tests whether is a -nearest neighbourhood graph. It has query complexity , where is the -dimensional kissing number and is a universal constant.

We emphasize that it is not necessary to compute the ground truth (i.e., the -NN of ) in order to run the property tester. Furthermore, the tester can be easily adapted for graphs such that and we only require that for every , contains an edge for every -nearest neighbor of in . This is more natural when we think of as a training set and as a test set or query domain. To complement this result, we prove a lower bound that holds even for two-sided error testers.

Theorem 2.

Testing whether a given input graph of size is a -nearest neighbourhood graph with one-sided or two-sided error requires queries.

Finally, we provide an experimental evaluation of our property tester on approximate nearest neighbor (ANN) indices computed by various ANN algorithms. Our results indicate that the tester requires significantly less time than the ANN algorithm to build the ANN index, most times just a -fraction. Therefore, it can often detect badly chosen parameters of the ANN algorithm at almost no additional cost and before the ANN index is fed into the remaining data processing pipeline.

Related Work

We give an overview of sublinear algorithms for geometric graphs, which is the topic of research that is most relevant to our work. As mentioned above, the research on -NN algorithms is very broad and diverse. See, e.g., DasNea91,ShaNea05 for surveys. Testing whether a geometric graph that is embedded into the plane is a Euclidean minimum spanning tree has been studied by ben2007lower and czumaj2008testing. In ben2007lower, the authors show that any non-adaptive tester has to make queries, and that any adaptive tester has query complexity . In czumaj2008testing, a one-sided eror tester with query complexity

is given. In a fashion similar to property testing, CzuApp05 estimate the weight of Euclidean Minimum Spanning Trees in

time, and CzuEst09 approximate the weight of Metric Minimum Spanning Trees in time for constant dimension, respectively. hellweg2010testing develop a tester for Euclidean -spanners. Property testers for many other geometric problems can, for example, be found in Czumaj2000,ParTes01.

2 Preliminaries

Let be fixed parameters. In this paper, we consider property testing on directed geometric graphs with bounded average degree . full

Definition 1 (geometric graph).

A graph with an associated function is a geometric graph, where each vertex is assigned a coordinate . Given , we denote its degree by and the set of adjacent vertices . full The Euclidean distance between two points is denoted by . For the sake of simplicity, we write for two vertices . When there is no ambiguity, we also refer to by simply writing . We denote the size of the graph at hand by .

Definition 2 (k-nearest neighborhood graph).

A geometric graph is a -nearest neighbourhood (-NN) graph if for every , the points that lie nearest to according to are neighbors of in , i.e., for all (breaking ties arbitrarily).

Let be a geometric graph. We say that a graph is -far from a geometric graph property  if at least edges of have to be modified in order to convert it into a graph that satisfies the property . We assume that the graph is represented by a function , where denotes the neighbor of if has at least  neighbors (otherwise, ), a degree function that outputs the degree of a vertex and a coordinate function that outputs the coordinates of a vertex.

Definition 3 (-tester).

A one-sided (error) -tester for a property with query complexity is a randomized algorithm that makes queries to , and for a graph . The algorithm accepts if has the property . If is -far from , then it rejects with probability at least .

The motivation to consider query complexity is that the cost of accessing the graph, e.g., through an ANN index, is costly but cannot be influenced. Therefore, one should minimize access to the graph.

Definition 4 (witness).

Let denote the number of vertices that lie nearer to than . Further let denote the set of ’s -nearest neighbors. Let define the subset of that is not adjacent to . If or , we call incomplete, and we call elements of the witnesses of .

If is -far from being a -nearest neighborhood graph, an -fraction of its vertices are incomplete. conferenceThe proof follows from common arguments in property testing (see the full version [fullversion] full version bib).

Lemma 5.

If is -far from being a -nearest neighborhood graph, at least vertices are incomplete.

full

Proof.

Assume the contrary. For every incomplete vertex , delete edges such that the distance to the property does not increase and insert the missing edges from to its nearest neighbors. By the assumption, the total number of inserted or deleted edges is less than . Therefore, is -close to being a -nearest neighborhood graph. ∎

The main challenge for the property tester will be to find matching witnesses for a fixed set of incomplete vertices. The following result from coding theory for Euclidean codes bounds the maximum number of points that can have the same fixed point as nearest neighbor.

Lemma 6.

[333884] Given a point set and , the maximum number of points that can have as nearest neighbour is bounded by the -dimensional kissing number , where [wyner1965capabilities] and [kabatiansky1978bounds] (asymptotic notation with respect to ).

3 Upper Bound

The idea of the tester is as follows (see Algorithm 1). Two samples are drawn uniformly at random: , which shall contain many incomplete vertices if is -far from being a -nearest neighborhood graph and , which shall contain at least one witness of an incomplete vertex in . For every , the algorithm should query its degree, its coordinate as well as every adjacent vertex and their coordinates and calculate the distance to them. If or if one of the vertices in is a witness of , the algorithm found an incomplete vertex, and hence rejects. Otherwise, it accepts.

However, we have to deal with the case that some vertices in have non-constant degree, say, , such that querying all their adjacent vertices would require too many queries. To this end, we prove that one can prune these vertices to obtain a subset of low degree vertices that still contains many incomplete vertices with sufficient probability.

Data: , , ,
Result: accept or reject
sample vertices from u.a.r. without replacement;
sample vertices from u.a.r. with replacement;
;
for  do
       if  then
             reject;
            
       end if
      
end for
accept;
Algorithm 1 Tester for -nearest neighborhood

Proof of creftype 1

We prove that Algorithm 1 is an -tester as claimed by creftype 1. Since Algorithm 1 does never reject a -nearest neighbourhood graph, assume without loss of generality that is -far from being a -nearest neighborhood graph. Algorithm 1 only queries the neighbors of , and therefore its query complexity is at most . It remains to prove the correctness.

In the following, let denote the set of all vertices in that have low degree, let denote the set of incomplete vertices in , and let denote the set of incomplete vertices in . By an averaging argument, . It follows from creftype 5 that contains at least

incomplete vertices, and therefore we focus on finding incomplete vertices that have low degree. fullThe following random variable identifies witnesses of vertices incomplete vertices in

.

Definition 7.

Given , let be a random variable that is if is a witness of an incomplete vertex and otherwise. full

The proof of creftype 1 follows from the following three claims. First, note that is a uniform sample without replacement from whose size is random. However, is sufficiently large with constant probability. conferenceThis claim follows from Markov’s inequality (see full version [fullversion] full version bib).

Claim 8.

With probability at least , .

full

Proof.

The expected cardinality of is . Therefore, the probability that is less than is at most by Markov’s inequality. ∎

In the subsequent sections, we prove the following two claims. Given that is sufficiently large, it will contain at least incomplete vertices with constant probability.

Claim 9.

full[creftype 11] If , it holds with probability at least that .

Finally, we show that if contains at least incomplete vertices, then will contain at least one witness of such an incomplete vertex with constant probability.

Claim 10 (creftype 14).

If , with probability at least , .

The correctness follows by a union bound over these three bad events.

Analysis of the Sample S: Proof of creftype 9

fullWe bound the cardinality of such that contains at least incomplete vertices.

Lemma 11.

If , then with probability at least .

Proof.

Since was sampled without replacement, the random variable

follows the hypergeometric distribution. Let

be a random variable that denotes the number of draws that are needed to obtain incomplete vertices in , which therefore follows the negative hypergeometric distribution. By creftype 5, we have . By the definition of and , we have . We apply Markov’s inequality to obtain . It follows that ensures with sufficient probability. full

Analysis of the Sample T: Proof of creftype 10

We prove the following lower bound on the number of witnesses in , which will imply a bound on by -reducing it to the case .

Proposition 12.

Given a point set , and , the maximum number of points that can have as -nearest neighbor is bounded by .

We note that this bound is tight, as shown in LABEL:thm:kreducing_tight.

Definition 13 (-reducing).

Let be an arbitrary point. Fix . Repeat the following steps until .

  • Pick a point that lies furthest from and let .

  • Set .

Proof of creftype 12.

We apply creftype 13 to and prove that the size of at the beginning of the process is at most , which proves the claim.

At first we show that every vertex that is picked by stays in : Let be arbitrary points that are picked by in the process of -reducing, with being picked in an earlier iteration than . The latter implies . Assume that at the time is selected, and therefore is removed from . Since is deleted by , it holds that , which is a contradiction as has been selected before .

We continue to bound the maximum number of vertices that share their -nearest neighbor: Because is the nearest point for the remaining , we apply creftype 6 and conclude that at most vertices are remaining in after -reducing. Since every iteration of step removed at most points from , the cardinality of at the beginning of the process was at most . ∎

Since at most vertices can share a witness by creftype 12, there are at least distinct witnesses of vertices in . We employ this bound to calculate the size of the sample such that it contains at least one witness of an incomplete vertex in with constant probability.

Lemma 14.

If and , then .

Proof.

Since every vertex is sampled uniformly at random with replacement, the event that one vertex is a witness is a Bernoulli trial with probability . Therefore . We have