On k Nearest Neighbor Queries in the Plane for General Distance Functions

05/05/2018
by   Chih-Hung Liu, et al.
ETH Zurich
0

We study k nearest neighbor queries in the plane for general (convex, pairwise disjoint) sites of constant description complexity (such as line segment, disks, quadrilaterals and so on) and with respect to a general family of distance functions including the L_p-norms and constant-size convex distance functions. We develop a data structure with O( n log log n ) space, O( log n + k ) query time, and expected O(n polylog n) preprocessing time, removing a ( log^2 n log log n )-factor from the O( n ( log^2 n ) ( log log n)^2 ) space of Bohler et al's recent SoCG'16 work. In addition, our dynamic version (that allows insertions and deletions of sites) also improves the space of Kaplan et al.'s recent SODA'17 work from O( n log^3 n ) to O( n log n ), and reduces a ( log^2 n )-factor from their deletion time. We obtain these improvements based on linear-size shallow cuttings, which are a standard technique to deal with the k nearest neighbor problem for point sites in the Euclidean metric. Kaplan et al. has generalized shallow cuttings to general distance functions, but the size of their version has an extra double logarithmic factor. We successfully design linear-size shallow cuttings for general distance functions, indicating that for general distance functions it could still be possible to achieve the same complexities as point sites in the Euclidean metric. Our breakthrough to achieve the linear size is a new random sampling technique (for the configuration space) that employs relatively many local conflicts to prevent relatively few global conflicts. We believe this new random sampling technique has its own merit for further applications.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

09/24/2021

Dynamic Data Structures for k-Nearest Neighbor Queries

Our aim is to develop dynamic data structures that support k-nearest nei...
03/30/2021

Approximate Nearest-Neighbor Search for Line Segments

Approximate nearest-neighbor search is a fundamental algorithmic problem...
05/10/2020

Plurality in Spatial Voting Games with constant β

Consider a set of voters V, represented by a multiset in a metric space ...
03/15/2018

Improved Dynamic Geodesic Nearest Neighbor Searching in a Simple Polygon

We present an efficient dynamic data structure that supports geodesic ne...
05/30/2019

Learning Nearest Neighbor Graphs from Noisy Distance Samples

We consider the problem of learning the nearest neighbor graph of a data...
11/15/2021

Margin-Independent Online Multiclass Learning via Convex Geometry

We consider the problem of multi-class classification, where a stream of...
10/23/2017

An iterative closest point method for measuring the level of similarity of 3d log scans in wood industry

In the Canadian's lumber industry, simulators are used to predict the lu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of most classical problems in computational geometry is to search the nearest neighbors in the plane, dating back to Shamos and Hoey (1975) [34]. Given a set of geometric sites in the plane and a distance measure, the nearest neighbor problem is to build a data structure that answers for a query point and a query integer , the nearest sites of in . A related problem called circular range problem is instead to answer for a query point and a query radius , all the sites in whose distance to is at most . A circular range query can be answered through nearest neighbor queries for until all the sites inside the circular range have been found. For point sites in the Euclidean metric, these two problems have received considerable attention [34, 9, 22, 17, 7, 30, 13, 33, 1, 14, 15, 27]. Many practical scenarios, however, entail non-point sites and non-Euclidean distance measures, which has been extensively discussed in Kaplan et al.’s recent work [27]. Therefore, for potential practical applications, it is beneficial to study the nearest neighbor problem for general distance functions.

The key technique for the standard case is shallow cuttings, a notion to be defined momentarily. Kaplan et al.’s [27] has generalized this notion to general distance functions, but the size of their version has an extra double logarithmic factor, preventing the derivation of an optimal data structure. Our main contribution is to devise a linear-size shallow cutting for general distance functions, shedding light on achieving the same complexities as the standard case. Based on our linear-size shallow cuttings, we build a data structure for the nearest neighbor problem under general distance functions with space and query time, removing a -factor from the space of previous works [11, 17]. In addition, our shallow cuttings also enable a dynamic version that allows insertions and deletions of sites with space, improving the space of Kaplan et al.’s work [27] by a -factor. Our breakthrough is a new random sampling technique that employs relatively many local conflicts to prevent relatively few global conflicts, providing a new way to analyzing geometric structures.

We now describe the setting in detail, which is identical to Kaplan et al.’s [27]. Let be a set of pairwise disjoint sites that are simply-shaped compact convex regions in the plane (e.g. line segments, disks, squares, etc.) and let be a continuous distance function between two points in the plane. and the sites in are defined by a constant number of polynomial equations and inequalities of constant maximum degree. For each site , its distance function with respect to points is defined by . Let denote the collection of distance functions ; the lower envelope of is the pointwise minimum . Assume that for any subset of , has faces, edges, and vertices; this assumption holds for many practical applications [8, 27].

Each function in can be interpreted as an -monotone surface in where the -coordinate is the distance from the -coordinates to the respective site. For example, the surface for a point site in the metric is the inverted pyramid . Then, the nearest sites of a point correspond to the lowest surfaces along the vertical line passing through .

For point sites in the Euclidean metric, a standard lifting technique can map each site to a plane tangent to the unit paraboloid  [31]. An optimal data structure for the lowest plane problem has recently been developed with space, query time, and preprocessing time [1, 15]. The dynamic version allows query time, amortized insertion time, and amortized deletion time [14, 15, 27].

Shallow Cuttings.

Let be a set of planes in , and define the level of a point in as the number of planes in lying vertically below it and the -level of as the set of points in with level of at most . An -shallow -cutting for is a set of disjoint semi-unbounded vertical triangular prisms, i.e., tetrahedra with a vertex at , covering the -level of such that each prism intersects at most planes. We abbreviate the -shallow -cutting as -shallow-cutting. Since a -shallow-cutting covers the -level of , if and each prism stores the planes intersecting it, then the lowest planes of a query vertical line can be answered by locating the prism intersected by the line and checking the stored planes.

Matoušek [30] proved the existence of a -shallow-cutting of tetrahedra, Ramos [33] proposed an -time randomized algorithm to construct -shallow-cuttings for , and Chan [13] turned tetrahedra into semi-unbounded triangular prisms, leading to a data structure for the lowest plane problem with space, query time, and expected preprocessing time. Afshani and Chan [1] exploited Matoušek’s shallow partition theorem [30] to achieve the space. Chan [14] also designed a dynamic version based on shallow cuttings with query time, expected amortized insertion time and expected amortized deletion time. Chan and Tsakalidis [15] proposed a deterministic algorithm for the -shallow-cuttings, making the above-mentioned time complexities deterministic; Kaplan et al. [27] improved the deletion time to amortized .

Generalization and Difficulty.

Recently, Kaplan et al. [27] integrated the vertical decomposition of surfaces [18, 3] and the (, )-approximations [26] to design shallow cuttings for -monotone surfaces in each of which corresponds to the distance function of a geometric site in the plane. The expected size of their -shallow-cutting is , directly yielding a data structure for the nearest neighbor problem with space, query time and expected preprocessing time, where is the maximum length of a Davenport-Schinzel sequence of order and is a constant dependent on surfaces. Their dynamic version (that allows insertions and deletions of sites) achieves query time, and expected amortized insertion time and expected amortized deletion time. (Although they only claimed for the case , the general case directly follows from Chan’s original idea [14].) To achieve or smaller space, -shallow-cuttings of size would be required, but the existence of such shallow cuttings is still open.

Matoušek’s method [30] for a -shallow-cutting picks planes randomly, builds the canonical triangulation for the arrangement of the planes, includes all tetrahedra, called relevant, that intersect the -level of the input planes, and if a relevant tetrahedron intersects more than planes, refines this “heavy” one into smaller “light” ones. Although his method works for the vertical decomposition of surfaces, his analysis does not seem directly applicable.

His analysis counts relevant tetrahedra by their vertices, and a vertex is also an intersection point among three of the “input” planes. Since a tetrahedron intersects no

“sample” plane, the probability that it intersects at least

planes is , i.e., using relatively few local conflicts to prevent relatively many global ones. Thus, the probability that an intersection point with level at least is a vertex of a relevant tetraherdon is , leading to an expected sum of . In other words, his analysis bounds the number of local configurations, i.e., vertices of relevant tetrahedra, through global configurations

, i.e., intersections among input hyperplanes.

However, the vertical decomposition of surfaces consists of pseudo-prisms, and a vertex of a pseudo-prism is not necessarily an intersection point among three surfaces, preventing a direct application of Matoušek’s analysis. To overcome this difficulty, a new random sampling technique that enables a direct analysis for the number of relevant pseudo-prisms would be required; for an example in this direction, see Section 1.1.

Other General Results.

Agarwal et al. [4, 6] studied the range search problem with semialgebraic sets. They considered a set of points in , and a collection of ranges each of which is a subset of and defined by a constant number of constant-degree polynomial inequalities. They constructed an -space data structure in time that for a query range , reports all the points inside within time, where is unknown before the query. Their data structure can be applied to the circular range problem by mapping each geometric site to a point site in higher dimensions, e.g. a disk and a line segment can be mapped to a point in and a point in , respectively.

Bohler et al. [10] generalized the order- Voronoi diagram [29, 8] to Klein’s abstract setting [28], which is based on a bisecting curve system for sites rather than concrete geometric sites and distance measures. They also proposed randomized divide-and-conquer and incremental construction algorithms [12, 11]. A combination of their results and Chazelle et al’s nearest neighbor algorithm [17] yields a data structure with space, query time, and expected preprocessing time for the nearest neighbor problem.

1.1 Our Contributions

We propose a new random sampling technique (Theorem 2.2) for the configuration space (Section 2.1). At a high level, our technique says if the local conflict size is large, the global conflict size is less probably small, while the existing ones say if the local conflict size is small, the global conflict size is less probably large. More precisely, for a set of objects and an -element random subset of , we prove that if a configuration conflicts with objects in , the probability that it conflicts with at most objects in decreases factorially in . By contrast, many state-of-the-art techniques [21, 19, 24, 5] indicate that if a configuration conflicts with no object in , the probability that it conflicts with at least objects in decreases exponentially in .

This conceptual contrast provides an alternative way to develop and analyze algorithms. Roughly speaking, to bound the number of configurations satisfying certain properties, by those existing techniques, one would adopt global configurations, e.g., Matoušek’s analysis for -shallow-cuttings, while by ours, one could make use of local configurations. To tell the difference, we give an alternative analysis for the expected number of relevant tetrahedra. Since a relevant tetrahedron intersects the -level of the planes, it lies above at most planes. Our technique can show that if a tetrahedron lies above sample planes, the probability that it lies above at most planes is . Since the triangulation of the sample planes has such trapezoids [36], the expected number of relevant tetrahedra is . Therefore, we believe our random sampling technique has its own merit for further applications.

Based on our random sampling technique, we design a -shallow-cutting using the vertical decomposition of surfaces and prove its expected size to be , indicating that for general distance functions, it could still be possible to achieve the same complexities as point sites in the Euclidean metric. In the viewpoint of the standard version, our proof conceptually confirms that a tetrahedron lying above relatively many “sample” planes is less probably “relevant.”

Then, we adopt Afshani and Chan’s ideas [1] to compose our shallow cuttings and Agarwal et al.’s data structure [6] into a data structure for the nearest neighbor problem under general distance functions with the space and the optimal query time, improving the combination of Bohler et al.’s and Chazelle et al.’s methods [11, 17] by a -factor in space. Our general version works for point sites in any constant-size algebraic convex distance function, and disjoint line segments, disks, and constant-size convex polygons in the norms or under the Hausdorff metric.

The preprocessing time is for which we modify Kaplan et al.’s construction algorithm [27] to compute our shallow cuttings; for the constant , please see Section 3.1. Replacing the shallow cuttings in Kaplan et al.’s dynamic data structure with ours attains space, query time, expected amortized insertion time, and expected amortized deletion time, improving their space from to and reducing a ()-factor from their deletion time.

The remaining challenges are the optimal space and the optimal preprocessing time. For the former, a generalization of the shallow partition theorem [30] to general distance functions would be advantageous, while the original proof significantly depends on certain geometric properties of hyperplanes. For the latter, the traversal idea by Chan [13] and Ramos [33] seems not to work directly since a pseudo-prism is possibly adjacent to a “non-constant” number of pseudo-prisms in the vertical decomposition.

This paper is organized as follow. Section 2 introduces the configuration space and derives the random sampling technique; Section 3 formulates surfaces, designs the -shallow-cutting, and proves its size to be ; Section 4 presents the construction algorithm for shallow cuttings; Section 5 composes the data structure for the nearest neighbor problem.

2 Random Sampling

We first introduce the configuration space and discuss several classical random sampling techniques. Then, we propose a new random sampling technique that utilizes relatively many local conflicts to prevent relatively few global conflicts, in contradiction to most state-of-the-art works that adopt relatively few local conflicts to prevent relatively many global conflicts, leading to different applications. Finally, since our new technique requires some boundary conditions, we further prove that those boundary conditions are sufficient at high probability. Our random sampling technique is very general, and for further applications, we describe it in an abstract form.

2.1 Configuration Space

Let be a set of objects, and for a subset , define a collection of “configurations.” For example, objects are planes in three dimensions, and a configuration is a tetrahedron in the so-called canonical triangulation [2, 32] for the arrangement formed by those planes. Let be as the set of configurations defined by all subsets of .

For each configuration , we associate with two subsets . , called the defining set, defines in a suitable geometric sense. For instance, is a tetrahedron, and is the set of planes that define the vertices of . Assume that for every , for a constant . , called the conflict set, comprises objects being said to conflict with . The meaning of depends on the subject. For computing the arrangement of planes, is a tetrahedron, and is the set of planes intersecting . Let .

Furthermore, we let be the set of configurations with (i.e., the local conflict size is ), let be the set of configurations with (i.e., the global conflict size is ), and let be .

Most existing works focus on , and Agarwal et al. [5] considered two conditions:

  1. For any , and .

  2. If and is a subset of with , then .

They generalized Chazelle and Friedman’s concept [19] to bound the expected number of configurations that conflict with no object in an -element sample but at least objects: For an -element random subset of , if satisfies Conditions (i) and (ii),

where is a parameter with and is a random subset of of size .

In addition to the expected results, several high probability results exist if satisfies a property called bounded valence: for all subsets , , and for all configurations , . The following lemma is a corollary of [32, Theorem 5.1.2].

If satisfies the bounded valency, for a random subset of of size and a sufficiently large constant , with probability greater than , every configuration in that conflicts with no object in conflicts with at most objects in .

2.2 Many Local Conflicts Prevent Few Global Conflicts

Potential applications would need to utilize relatively many local conflicts to prevent relatively few global conflicts as our alternative proof for the expected number of relevant tetrahedra in Section 1.1.

To realize this utilization, we generalize the Conditions (i) and (ii) as follows:

  1. For any , and .

  2. If and is a subset of with and , then .

One could assume to be empty by replacing with , but the main issue is that is not necessarily empty, distinguishing the technical details of bounding in the proof of Theorem 2.2 from previous works [5, 19, 24, 30].

We establish Theorem 2.2, which roughly states that if the local conflict size of a configuration is , the probability that its global conflict size is linear in roughly decreases factorially in .

Configurations defined by
Configurations in in conflict with objects in
Configurations in in conflict with objects in
Configurations in in conflict with objects in and objects in
Configurations defined by all possible subsets .
Configurations in in conflict with at most objects in
objects that define
size of
objects that conflict with
size of
Table 1: Symbol Table.

Let be an -element random subset of with , and let be an integer with . If satisfies Conditions  and , then

where is an -element random subset of . ( is also a random subset of .)

Proof.

To simplify descriptions, let be and be the set of configurations in that conflict with at most objects in . Consider a configuration , and let be . It is clear that . We attempt to prove

(1)

where is an -element random subset of . Then, we have

Let be a fixed configuration. Also let us assume that holds; otherwise, must not belong to , making the claim (1) obvious.

Let be the event that and , and let be the event that and , the latter of which implies that .

According to Condition (I), we have

and

So, we have

According to Condition (II), we have

implying that

(2)

Let be , be and be , i.e., . Recall that since .

which derives the claim (1) from the claim (2). The second to last inequality comes from that , and the last inequality comes from that (since ).

Since needs to contain all the objects in and exactly objects in , we let be at least to allow the case that and ; we also let be at least to allow the case that . These two settings lead to the condition that . ∎

2.3 Logarithmic Local Conflicts are Enough

Theorem 2.2 requires to be at most , but could be in the worst case. Therefore, we will prove that at high probability, is . First of all, we analyze the probability that a configuration conflicts with few elements in but relatively many element in .

Let be an -element random subset of , let be a configuration with , let be , and let be . If and , then the probability that is at most .

Proof.

Let be , and be . It is clear that . Since must contain the elements of and elements of , the probability is

Since , we have , implying that the probability is at most

The first inequality comes from the Stirling’s approximation, the second inequality comes from that is inversely proportional to and that , and the third inequality comes from that , i.e., . ∎

Then, we assume that satisfies the bounded valency, i.e., for any subset , and for any configuration , , and prove the following theorem.

Let be an -element random subset of . If satisfies the bounded valency and , then the probability that there exists a configuration in in conflict with at most objects in but at least objects in is .

Proof.

Since satisfies the bounded valency, . By Lemma 2.3 and the union bound, the probability is . ∎

Corollary 4.3 by Clarkson [20] can also lead to the same result, but he adopted the random sampling with replacement. Since is in our situation, his random sampling gets a multi-set at high probability, and thus his result could not be directly applied. (In his applications, either is far from or a multi-set is feasible, but we are not the case.) Of course, there could be a way that extends his proof to address our purpose, but we did not find an obvious one.

3 Shallow Cutting

We first formulate the function graphs (surfaces) and introduce the vertical decomposition of surfaces. Then, we design a -shallow-cutting for the surfaces using vertical decompositions. Finally, we adopt our new random sampling technique to prove the expected size of our -shallow-cutting to be .

3.1 Surfaces and Vertical Decomposition

Let be a set of bivariate functions that are continuous, totally defined, and algebraic. Assume that the graph of each function in is a semialgebraic set, defined by a constant number of polynomial equalities and inequalities of constant maximum degree. The lower envelope of is the graph of the pointwise minimum of the functions in ; the upper envelope is defined symmetrically. We further assume that for any subset , has faces, edges, and vertices, which holds for many applications as discussed by Kaplan et al’s [27].

For conceptual simplicity, we view each function in as an -monotone surface in . We make a general position assumption on : no more than three surfaces intersect at a common point, no more than two surfaces intersect in a one-dimensional curve, no pair of surfaces are tangent to each other, and if two surfaces intersect, their intersection are one-dimensional curves. Moreover, we define as the maximum number of co-vertical pairs of points with , over all quadruples of surfaces of , and assume to be a constant. For a point , the level of with respect to is the number of surfaces in lying below , and the -level of is the set of points in whose level with respect to is at most .

For a subset , let be the arrangement formed by the surfaces in . For each cell in , its boundary consists of upper and lower hulls. The upper hull is a part of the lower envelope of surfaces in lying above , and the lower hull is a part of the upper envelope of surfaces in lying below . The topmost (resp. bottommost) cell in does not have an upper (resp. lower) hull. If the level of with respect to is , then the vertical line through a point in intersects the boundary of at its and lowest surfaces in .

The vertical decomposition of , proposed by Chazelle et al. [18], decomposes each cell of into pseudo-prisms or shortly prisms, a notion to be defined below; we also refer to [35, Section 8.3]. First, we project the lower and upper hulls of , namely their edges and vertices, onto the -plane, and overlap the two projections. Then, we build the so-called vertical trapezoidal decomposition [23, 32] for the overlapping by erecting a -vertical segment from each vertex, from each intersection point between edges, and from each locally -extreme point on the edges, which yields a collection of pseudo-trapezoids. Finally, we extend each pseudo-trapezoid to a trapezoidal prism , and form a prism .

Each prism has six faces, top, bottom, left, right, front, and back. Its top (resp. bottom) face is a part of a surface in . Its left (resp. right) face is a part of a plane perpendicular to the -axis. Its front (resp. back) face is a part of a vertical wall through a intersection curve between two surfaces in . Its top and bottom faces are kind of pseudo-trapezoids on their respective surfaces, so a prism is also the collection of points lying vertically between the two pseudo-trapezoids.

contains prisms and can be computed in time [18]. The prisms in the topmost and bottommost cells of are semi-unbounded. For our algorithmic aspect, we imagine a surface , so that each prism in the topmost cell has a top face lying on . For each prism , let denote the set of surfaces in intersecting .

It is not difficult to verify that a prism is defined by at most 10 surfaces under the general position assumption. First, its top (resp. bottom) face belongs to a surface, and we call this surface top (resp. bottom). Then, we look at the pseudo-trapezoid that is the -projection of . A pseudo-trapezoid is defined by bisecting curves in the plane [11], each of which is the -projections of an intersection curve between two surfaces. Since a bisecting curve defining must be associated with the top surface or the bottom surface of , it is sufficient to bound the number of bisecting curves to define , each of which counts one additional surface. The upper (resp. lower) edge of belongs one bisecting curve. The left (resp. right) edge of belongs to a vertical line passing through the left (resp. right) endpoint of the upper edge, the left (resp. right) endpoint of the lower edge, or an -extreme point of a bisecting curve. Each of the first two cases results from one additional bisecting curve, namely each counts for one additional surface. Although the last case may occur more than once, as same as Chazelle’s algorithm [16], we can introduce zero-width trapezoids to solve such degenerate situation. In conclusion, for the top and bottom faces, we count 2 surfaces, for the upper and lower edges, we count 2 additional surfaces, and for the left and right edges, we count additional surfaces, leading to a sum of 10.

3.2 Design of Shallow Cutting

A -shallow-cutting for is a set of disjoint prisms satisfying the following three conditions:

  1. They cover the ()-level of .

  2. Each of them is intersected by surfaces in .

  3. They are semi-unbounded, i.e., no bottom face.

To design such a -shallow-cutting, we take an -element random subset of and adopt to generate prisms satisfying the three conditions.

For condition (a), it is natural to consider the prisms in that intersect the ()-level of , but it is hard to compute those prisms exactly. Thus, we instead select a super set that consists of prisms in lying above at most surfaces in .

For condition (b), if a prism intersects more than surfaces in , we will refine it into smaller prisms, and select the ones lying above at most surfaces in . Let be , where is the set of surfaces in intersecting . If , we refine as follows:

  1. Take a random subset of of size , and construct by clipping each surface in with , building the vertical decomposition on the clipped surfaces plus the top and bottom faces of , and including the prisms lying inside .

  2. If one prism in intersects more than surfaces in , then repeat Step 1.

  3. For each prism , if lies above more than surfaces in , discard .

Lemma 2.1 guarantees the existence of that satisfies the requirement of Step 2. By Section