Practical Low-Dimensional Halfspace Range Space Sampling

04/30/2018
by   Michael Matheny, et al.
0

We develop, analyze, implement, and compare new algorithms for creating ε-samples of range spaces defined by halfspaces which have size sub-quadratic in 1/ε, and have runtime linear in the input size and near-quadratic in 1/ε. The key to our solution is an efficient construction of partition trees. Despite not requiring any techniques developed after the early 1990s, apparently such a result was not ever explicitly described. We demonstrate that our implementations, including new implementations of several variants of partition trees, do indeed run in time linear in the input, appear to run linear in output size, and observe smaller error for the same size sample compared to the ubiquitous random sample (which requires size quadratic in 1/ε). This result has direct applications in speeding up discrepancy evaluation, approximate range counting, and spatial anomaly detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/15/2021

Distribution Compression in Near-linear Time

In distribution compression, one aims to accurately summarize a probabil...
06/25/2021

Approximate Maximum Halfspace Discrepancy

Consider the geometric range space (X, ℋ_d) where X ⊂ℝ^d and ℋ_d is the ...
03/21/2022

Improved Sampling-to-Counting Reductions in High-Dimensional Expanders and Faster Parallel Determinantal Sampling

We study parallel sampling algorithms for classes of distributions defin...
04/30/2018

Computing Approximate Statistical Discrepancy

Consider a geometric range space (X,A̧) where each data point x ∈ X has ...
01/25/2019

Faster Boosting with Smaller Memory

The two state-of-the-art implementations of boosted trees: XGBoost and L...
05/23/2016

A Sub-Quadratic Exact Medoid Algorithm

We present a new algorithm, trimed, for obtaining the medoid of a set, t...
10/02/2017

Large-Scale Quadratically Constrained Quadratic Program via Low-Discrepancy Sequences

We consider the problem of solving a large-scale Quadratically Constrain...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Taming the relationship between a point set and its interaction with halfspaces , has long been a focus of computational geometry. Understanding and controlling this interaction is at the heart of problems in range searching, linear classification, coresets, and spatial anomaly detection. This pair describes a range space, the combinatorial set of all subsets of defined by for any halfspace . In this paper we focus on two specific and closely-interrelated (as it turns out) constructions for : -samples and partitions, defined next.

An -sample of is a small point set that approximately preserves density with respect to halfspaces: for all , and error parameter it bounds

It is known that -samples of size exist for halfspaces [2], and in general this size may be required [21]. For many years (c.f., [22, 7]) such proofs were not constructive, as they relied on the “partial coloring lemma”; until in 2010 when Bansal [4] introduced a polynomial time construction. The runtime of the low-discrepancy coloring on points was later reduced [16] to , this within the standard merge-reduce framework [9] results in a runtime for sample construction– which is still not very efficient. For instance for , this requires time. A random sample, which can be generated in time, is an -sample of size

with probability at least

 [28, 15]. The above discrepancy-based algorithm can be run on the output of this sample to get optimal size, but it only reduces the overall runtime of the -sample construction to .

There are other constructions for -samples, which either focus on small space (to work in a stream) [27, 3] or have better performance in practice without size guarantees below that of random sampling [13]. As with the optimal algorithms, these require the enumeration of all combinatorial halfspaces associated with a set of size roughly the size of the final -sample, requiring at least time. Indeed Suri et al. [27] concludes with: “The high computational complexity of the currently known algorithms for these subroutines may be prohibitive for data stream applications. It is a long standing open problem to find efficient exact or approximation algorithms for either of them.

A partition of is a set of pairs where each is a small complexity region and contains , and is the disjoint union of the s. It is a -partition when there are pairs, ; and each crosses cells. The smallest possible guarantee for is , and an algorithm for such a construction was provided by Matoušek [20], that takes time after preprocessing time for any . Chan provided a refined algorithm which takes time, and has a few nicer structural properties. There are other algorithms which generate -partitions for large values of . For instance in Edelsbrunner and Welzl [10] describe an algorithm with and a structure similar to a kd-tree leads to a size of  [30].

Our results.

In this paper, we use partition construction algorithms to efficiently create -samples for . Our algorithm takes time and produces an -sample of size , nearly matching the lower bound.

We also implement several variants of these algorithms in . We know of no other implementation of -sample construction for which is guaranteed to get subquadratic size in . We know of no implementations of optimal partitions, although Har-Peled [12] has implemented a related concept called a cutting, which (as we will explain) is a key ingredient for creating partitions. We choose to build our own implementation of cuttings, and explain why we did not use Har-Peled’s in Section 4.

We are able to demonstrate that our algorithm indeed scales linearly in , scales linearly in the output size, and produces -samples with less measured error than random samples.

Our initial goal in fast -sample construction comes from finding approximately maximal ranges in range spaces, as part of a large-scale spatial anomaly detection framework [18, 17]. At a high level, these algorithms follow two phases: (1) create an -sample , (2) use to find an approximately maximal range. The second step takes or , so it is only worth using a smaller -sample of size roughly if it takes less than or time to create. We show this is the case in theory, and in practice. Similar overall runtime gains exist when using for classification, or approximate range counting, or other tasks where the use of is more expensive than the new construction time.

2 Overview and Proof for Fast -Samples

The key to our construction of an -sample for a range space is to first create a partition over . Given such a partition algorithm, our algorithm constructs an -sample as follows. Randomly sample , construct the partition on , and return a single point at random from each weighted by .

For range space with and constant , with constant probability an -sample of size can be constructed in time.

Proof.

Take a random uniform sample of size then is an -sample of with constant probability. Next we build a -partition on in time [6]; this results in a set of partitions of each containing at most points such that any halfspace in will only cross of them. From each partition we will choose a single point at random to put in our result , and weight it proportional to the number of points in the partition.

In our construction any partition contained completely inside a halfspace or outside does not contribute to the error of the sample. Only regions crossing the boundary of the halfspace

contribute to the error. The error in each boundary region is an independent bounded random variable

with value in the range . There are at most boundary regions for some constant , so we can apply Hoeffding’s inequality, with failure probability

Rearranging the last inequality, gives that with , for any one halfspace , is more than with probability at most .

There are halfspaces in , so setting for some constant , and the additivity property of -approximations [7], gives an -approximation of size with constant probability. By setting the total error is and the size of the -sample is for constant . Creating takes time, the partition tree construction takes time since , and the re-weighting and sampling step takes time. In total therefore the entire algorithm takes time. ∎

The same proof technique will work with other -partitions in place of Chan’s [6]. In general, for , a scheme that generates a -partition of cells where any halfspace crosses at most of the cells results in an -sample of size . For instance in , Edelsbrunner and Welzl’s result [10] in an -sample of size . Alternatively, Willards result in  [30] results in an -sample size of .

3 Overview of Algorithms for Constructing the Partition

Random sampling, and sampling a point from each cell of a partition is straight-forward; the challenge in our implementation of Theorem 2 is the creation of a partition. In this section we describe the key components of the two prominent optimal size ) algorithms: Matoušek’s efficient partitioning [20] (ComputePartition-Mat) and Chan’s Optimal partitioning [6] (ComputePartition-Chan).

These algorithms rely on a related object called a cutting, defined over and a set of hyperplanes . For a parameter , a -cutting is a decomposition of into cells , so no cell is crossed by more than hyperplanes in . Such cuttings exist and can be computed in time [8, 19].

Cuttings are almost enough to compute partitions. A set of points in induces combinatorially distinct halfspaces . Letting , the total number of crossings will be , so the average per region will be . Also, ignoring dependences, the average cell contains points, as desired. The main challenge is ensuring that these average properties of the cutting map to the specific properties required for the partition. In short, we can create an appropriate cutting, detect where it does not satisfy the partition properties, and then amend it so it does.

We specifically focus our implementations in the setting, which for instance is enough for our original application of spatial anomaly detection we mentioned previously [17], even in higher dimensions. Our implementations are similar to the existing implementation of cuttings by Har-Peled [12], but adds several features which will aid in computing the partition. Our cutting implementation builds a cutting by iteratively adding lines in a random order while keeping track of the number of lines crossing each cell in an arrangement. From a practical point of view, it is important to force the cells of the partition to be constant size. We have focused on two methods for this, a vertical trapezoidal decomposition (Trapezoid), or a hierarchy of constant size polygons (PolyTree).

Constructing a -cutting over the entire set of halfspaces would lead to a runtime of which would be prohibitively slow. Instead of using the full set of halfspaces a smaller set (a test set) can be constructed, such that the number of partitions crossed by any halfspace in this test set will not be too different from the full set .

In particular, an -test set is a set of halfspaces which applies to any partition and point set of size so for all . It ensures that if , then . Here is the set of for which intersects , but do not completely contain . Test sets can be built a number of ways, including randomly sampling lines, randomly sampling points and using the lines they induce, and using the dual arrangement.

4 Implementation Particulars of Partitions

Our implementation of Partition trees is in python. It relies on an efficient way to construct and maintain an arrangement of lines and associated points. At each step of the construction we will maintain a tree with leaves that correspond to cells of an arrangement. Each cell will maintain a list of contained points and crossing lines.

As part of the construction so the result is a -partition , with desired parameter, cells can be refined by applying various operations to them. For instance a cutting can be constructed locally inside of a cell , or a cell can be partitioned into a set of sub-cells.

Geometric Primitives.

All of our algorithms rely on operations over line segments. The most important operation is being able to test, within a region , if a line lies completely above a line segment or if it crosses a line segment. This fairly simple operation is slightly complicated by numerical issues that can occur. For instance when constructing a test set using the BuildTestSet-Points or BuildTestSet-Dual method (see below) many lines will potentially meet at the same point. Line segments that meet in this point could be mistaken as crossing. To handle numerical issues we use python’s implementation of math.isclose to handle point comparisons. This method allows us to assign two floating point numbers as equal if their relative values are sufficiently close [5]. Moreover, all methods that compare line segments have closed and open versions where closed versions allow end-point overlap and open versions do not. The method segment.above_closed(line) returns true if the line intersects with the segment at one the segment’s end points, but is otherwise above the segment, while segment.above_open(line) returns false in this case. This allows us in our experiments to effectively handle degeneracies while avoiding slower exact precision libraries.

Internally our segment objects are represented by the slope, , the -intercept, , and the interval on the -axis the segment is defined over. This representation makes many operations easy, but also results in several challenges, most notably: vertical lines are undefined, unbounded segments (e.g., ) require extra logic to handle crossing queries, lines which are nearly vertical can become numerically unstable, and the dual of unbounded polygons require significant extra logic to handle correctly. However, we have implemented stable functions for intersect and above relations for pairs of segments in a cell.

Using line segments and points as the primitives we also define more complicated structures notably: polygons, dual wedges, vertical line segments, and trapezoids.

PolyTrees.

There are a number of ways to maintain the structure of an arrangement. A common method is to store each cell with a corresponding list of pointers to adjacent cells. Inserting a line involves finding the leftmost crossing cell, identifying the next adjacent cell the line crosses, splitting the crossed cell into an upper and lower cell, and then repeating this operation for each crossed cell.

This has a number of downsides: there are special cases if a line crosses a vertex of a cell, inserting points into the arrangement requires the maintenance of a secondary structure, and cells require a significant amount of adjacency information that must be maintained. Instead of maintaining this structure we use the idea of forcing each cell to be simple, and follow certain restrictions, as introduced by Seidel [26] and refined by Har-Peled [12].

In particular, we either maintain a decomposition into constant complexity polygons (polygons with a constant number of boundary segments) or a trapezoidal decomposition. In both cases we maintain a tree where each node in the tree consists of a line segment that separates a cell into two cells. With trapezoids an inserted line could in some cases divide a trapezoid vertically into 2 separate trapezoids and then horizontally into 4 separate trapezoids. In the case of polygons the inserted line would split the polygon into two separate polygons which could possibly be further split if the number of sides in either of the resulting polygons is greater than a chosen constant. We also enforce that no vertical segments are used to avoid limitations of our line segment representation.

Given a line and a decomposition , the zone of is the set of regions that intersect ; we represent this as . To find the zone of a line in this structure at each node we treat the line as an infinite length segment and then traverse the line down the tree. At each node we will have three cases where the portion of the line contained in the node lies completely either above or below, or crosses the current node’s line segment. In the completely above or below case we merely traverse to the above or below child of the node. In the crossing case we split the portion of the line contained in the node into two segments, above and below, and recursively query the above and below nodes. Point information is easy to maintain with this method since a point always lies on one side of the line segment, so the tree structure can be used to insert or remove points in logarithmic time to the number of cells.

More complicated structures can also be queried on these trees, most notably wedges and polygons. Wedge queries are particularly useful in ComputePartition-Chan since a wedge is the dual of a line segment, so the number of points contained in a wedge corresponds in the dual to the number of lines crossing a line segment.

1:  
2:  for  (ordered by a random weighted permutation) do
3:     Find .
4:     For all , replace in by
5:  return  
Algorithm 1

Cuttings.

Our cutting algorithm (Algorithm 1) follows closely Algorithm 1, from Har-Peled [12]. We implement the cutting with respect to weighted lines as this speeds up and somewhat simplifies the later partitioning algorithms. We require a weighted permutation of lines using [11]; this ensures that the probability we see a line after some point in the permutation is equivalent to the probability we would have seen at least one instance after seeing that many distinct lines in a variant where weights are multiplicities (as advocated by Chan [6]), and each copy is treated independently in a uniform random permutation. For notational convenience, for a subset , let , where is the weight implicitly stored with each halfspace .

Figure 1: Cuttings with 25 lines and at most 5 lines crossing each cell. From left to right PolyTree with at most 4 sides, PolyTree with at most 8 sides, and Trapezoid.

In practice, we implement slightly differently then described in Algorithm 1. Instead of choosing the to process in the random weighted permutation, our approach is centered around the violated cells. We choose a cell which is too heavy (i.e., ), and then choose some halfspace , and only replace (not the entire ) with the result of the , which divides into two parts separated by . This change has two advantages. First we do not need to find the which involves traversing the PolyTree, so our approach is slightly faster. Second, we can choose the

to use in the split wisely; e.g., as the one that maximizes the smaller of the two resulting cells. We find the second heuristic produces slightly smaller cuttings in practice, but is significantly slower, and is not used in our experiments.

How is implemented is the difference between the Trapezoid-based cutting and the Polygon-based cutting we refer to as PolyTree.

For the Trapezoid-based method each cell is a trapezoid to begin with. The operation first inserts up to two new vertical cuts for each intersection of the line with the top or bottom of the cell and then horizontally cuts the resulting cells using the inserted line. For PolyTree, the first important detail is that we store as the line segments restricted to where they intersect .

On a split, we need to maintain which halfspaces are in each child; if intersects both children, then we split into two line segments at the point where it intersects , and store the corresponding segment in each child. If is in only one child, because we store them as segments it is easy to check which child it goes into.

Both of these algorithms are then fairly straightforward to implement once given structures for efficiently maintaining arrangements of line segments.

We find that PolyTree is faster and produces a smaller cutting than Trapezoid-based ones; see Figure 2 and Figure 3 which show the runtime and cutting size as a function of input size and choice of . For this reason we will primarily focus on PolyTree hereafter.

Figure 2: Size of cutting (divided by ) and time (in seconds) vs. the number of input lines.
Figure 3: Size of cutting (divided by ) and time (in seconds) as increases for input lines.
1:  ; is dual of .
2:  
3:  return  the dual of , where is the set of vertices of the cells of .
Algorithm 2 BuildTestSet-Dual

Test Set Generation.

There are a number of ways to generate test sets (BuildTestSet-Dual, BuildTestSet-Points, BuildTestSet-Lines), but these do not appear to have a significant effect on the runtime of the final algorithms; again see Figure 2 and Figure 3 for comparisons of the PolyTree and Trapezoid methods. The simplest method, BuildTestSet-Lines, simply samples halfplanes, from those defined by passing through points in . The next simplest, BuildTestSet-Points, samples points , and then the test set is all halfplanes passing through -tuples chosen from ; it again defines halfplanes. Finally the most complicated approach is BuildTestSet-Dual (see Algorithm 2); it produces the smallest size test set, size  [20], and thus is the one we advocate. It samples points ; it considers the dual set of halfplanes of primal points ; it creates a -cutting of (in the dual); and then it returns the primal halfspaces defined by the vertices of the cutting in the dual. Each halfspace in the test set is implicitly endowed with a weight , which by default is for all .

1:  if () then return where contains .
2:  
3:  
4:  while (do
5:     
6:     Find so ; shrink so exactly.
7:     Add to ; remove from
8:     Double the weight which cross
9:  
10:  return  
Algorithm 3

Matoušek Partitioning.

We have implemented Matoušek’s efficient partition trees [20]. At a high level this algorithm computes the cutting of a test set and then finds a single good cell that contain at least points (for a constant , we use as default). It adds this cell to the partition, doubles the weight of all halfspaces in the test set crossing that cell, computes a new cutting and good cell. It repeats until the number of points remaining has been cut by half, and then it recurses on the remained of the points at half the precision (e.g., set ). This is too expensive to do with , so after this we then recursively partition each cell until the result is an -partition as desired. The branching factor of the partition tree is not fixed on each level, but will be roughly on average. Algorithm 3 presents this approach, and is initially called as .

Note that Line 9 is the refinement step where each cell is further partitioned. Since the first level is the most important for good -samples, faster algorithms could be used at later recurve calls at this step. In contrast, the recursive call at Line 10 is handing objects not handled in the first pass, where each pass handles roughly half of the data.

Chan Partitioning

Chan’s optimal partition trees [6] are faster in theory than Matoušek’s algorithm, but are more complicated to implement. The algorithm works by processing each node at a certain level in the tree in a random order. For each node it creates a cutting of approximately size for an appropriately large branching parameter (our implementation uses as a default). It then further splits the cells of the cutting to contain fraction of points at that node. It multiplicatively updates weights for halfplanes that cross each cell. This multiplicative update influences subsequent cuttings by biasing away from creating cells that are crossed by already heavily weighted lines (lines that cross many cells). After splitting all of the cells in this level of the tree the algorithm recurses on the newly created level. The ultimate partition are the leaf nodes of the tree. Algorithm 4 presents this approach, calling initially.

1:  Trim to ; if return
2:  
3:  for  do
4:     Sample , proportional to their weight , at rate
5:     ; with chosen so
6:     For all , further split (with ) until
7:     Replace in with
8:     Update all weights .
9:  return  
Algorithm 4

Implementing the algorithm as described is too slow asymptotically, so Chan presents a faster variant, which requires two additional parameters and . Roughly (see [6] for details) determines the probability that a line ends up in the reduced test set . The parameter , about (again, see [6] for details), effects the number of cells that are used to update the weight in each (we sample each cells with probability as opposed to dividing by this number, as written on Line 8). Also, Line 4, where is sampled from , can be made more efficient by only minimally updating each pass through the loop, since it generally has large weight lines and that set is fairly stable.

Near the bottom of the tree, Line 8 can be expensive. We make this efficient with a crucial observation that the test set was generated by computing a cutting over the dual space. Thus these halfspaces are duals to the vertices of the PolyTree structure. Thus we can search over the PolyTree to determine the number of crossing lines. A cell of the partitioning is a polygon consisting of a constant number of line segments. A line crossing the polygon will cross at least one of the line segments and in the dual this will correspond to a point contained inside of a double wedge. For each line segment in the polygon we take its dual (a double wedge) and query the PolyTree that was used to construct the test set to determine the number of vertices contained inside of it. Since we only return the overlapped polygons and each polygon consists of at most a constant number of edges, the number of queried cells can only be a constant factor larger than the number of lines crossed by the line segment.

However, code profiling shows that the two steps involving sampling with and , and updating are the most expensive parts of the algorithm. As a result we also consider a variant ComputePartition-Chan-Simple which avoids these sampling steps that were supposed to speed things up. In the context of Algorithm 4 this basically sets , so , and Line 4 is not required.

The given algorithm is only guaranteed to compute a set of partitions in time; incurring extra log factors due to the height of the partition tree. Chan removes log factors with a method he calls bootstrapping. We do not do this since the branching factor is high (around ) so the depth of the tree is low, and this method is not worth the overhead.

In our implementation, we only compute the test set once at the beginning. On each recursive call (Line 9) we can reuse it, but simply reset all of the weights to be uniform.

Why not use Har-Peled’s implementation and CGAL?

It may seem at first that we could simply use Har-Peled’s implementation for -cuttings [12]. However, our initial goal was to use this as part of a code for spatial anomaly detection [18, 17]

, and there were several issues that made this less feasible. (1) We wanted to use non infinite precision floating point arithmetic. Har-Peled reports that switching to exact precision representations results in a 30-factor slow down, but was necessary for degeneracy issues. We managed these precision issues while using floating point arithmetic with careful use of open and closed operators for line above/below and intersection. (2) We can measure wedges, and line segments on the

PolyTree structure which is very useful in ComputePartition-Chan. (3) Har-Peled’s code created a cutting inside of a box. This makes computing dual cuttings difficult as we first have to normalize the lines to lie in such a region, but computing the correct normalization quickly would require us to re-implement much of the PolyTree algorithm. Ultimately we opted to build our -cutting code from scratch rather than modify the previous code.

Har-Peled [12] also reported the cutting constant (number of cells divided by ) for various of his algorithms, about (polygons) and (trapezoids). This roughly matches the numbers we observe in Figure 2 and Figure 3.

Another option for computing cuttings and managing the partition trees is using the current 2d-arrangement implementation in CGAL [29]. This would have most likely made portions of this project much easier to implement and removed various hurdles. However, the possibilities of several factor slow-downs using exact precision would have potentially resulted in no ultimate gains in the spatial anomaly application demonstrated below.

5 Ham-Sandwich Tree

Figure 4: Ham Tree Sample on the left (single vertical line and then ham-sandwich cut) and Double Ham Tree on the right (ham-sandwich cut at every level).

We implement two alternative partitioning methods by Willard [30], Ham Tree Sample, and Edelsbrunner [10], Double Ham Tree. The first method, Ham Tree Sample, provides a partitioning with , which gives a sized sample, and constructs a tree with a branching factor of 4. The second method, Double Ham Tree, provides a partitioning with leading to a sample size of . Both methods are much simpler then the earlier given partitioning methods and potentially faster.

With Ham Tree Sample at each level we split the point set in half with a single vertical line, and on these two resulting sets we find a single (roughly horizontal) line that divides both the left and right point set in half. Such a separator is guaranteed by the ham-sandwich theorem, and can be computed in linear time [25], but is complicated to implement. We instead approximate the ham-sandwich cut by computing a number of test lines and choosing the best separator from these. This is simple to implement, gives good cuts in practice, and can guarantee to be at most -imbalanced [24].

Double Ham Tree differs in the way the levels are handled. We will substitute the vertical cut from above with a ham-sandwich cut. Starting with a pair of roughly equal sized point sets, each are cut in half using the ham-sandwich cut. We then recursively cut these resulting pairs again with more ham-sandwich cuts.

Figure 5: Input size vs. Time and Error for various with Ham Tree Sample and Double Ham Tree.

For both methods we compute the set of potential, approximate ham-sandwich cuts in the same way. We randomly select points, and consider the separating line generated by those passing through each pair. We have experimented with the sample size ; as the number of test set lines increases we approximate the ham-sandwich cut with higher accuracy, but the cut also takes more time to compute. We test various to determine a good compromise between accuracy and performance. We construct samples of the Chicago crime data [1] with roughly 6.5 million data points. These experiments follow closely the setup from Section 6, see there for more details. We find that in practice there is little difference between the accuracy and time that Double Ham Tree and Ham Tree Sample take, so in general we recommend the simpler Ham Tree Sample. We also find that smaller values of (at ) seem to provide accuracies nearly as good as large (at ), but are significantly faster. However, quite small values of (at ) offer minimal speedup while having a minor, but observable increase in Error. Ultimately, among these variants, we recommend Ham Tree Sample with , and this is the variant used in further experiments.

Figure 6: Output size vs. Time and Error for various with Ham Tree Sample and Double Ham Tree.

6 Experiments on -Samples and Applications

In this section we explore the efficacy of our -sample algorithms based on partitions. We use as the Chicago crime data [1] with roughly 6.5 million data points.

A key step of the analysis is measuring the accuracy of the -sample. That is for a sampled we measure , which unfortunately requires time to simply enumerate, which would be infeasible for large . Instead we use techniques [18, 13, 17] which provide guaranteed approximation of this function, designed with spatial anomaly detection in mind. We have set the parameters large enough so the noise in computing Error is insignificant compared to the quantities we are evaluating. We evaluated the accuracy and efficiency of computing -samples with different methods.

  • There are 3 algorithms based on sampling one element per cell from Matousek’s partition algorithm with polygonal cells, using tests created by lines Mat Poly Lines, points Mat Poly Points, or the dual approach Mat Poly Dual.

  • We consider 2 algorithms based on Chan’s partition algorithm Chan and Chan Simple. Each uses polygonal cells and the dual approach for the test set since this specific type of test set allowed for an optimization in the reweighting step. The Chan variant includes subsampling among cells for purpose of reweighting, while the Chan Simple simply uses all of these cells and does not require the sampling step which in practice was inefficient.

  • Then Ham Tree Sample draws samples from the cells of the Willards partitioning; these are also cells of a partition, but with worse theoretical size-accuracy bounds.

  • Finally we consider two baselines: random sampling, Random Sample, and another approach Biased-L2 [13] which is a greedy, but slow algorithm which has similar worst case guarantees to random sampling, but achieves better error in practice.

Figure 7: The Branching Factor vs. Time and Error using the default parameters.

In testing these algorithms we can control three parameters: the Branching Factor , the Input Size (default sampled from the crime data set), and the Output Size (default ). We do not create a sample before creating the partition as analyzed in Theorem 2; we just create the partition on the points, then sample a point from each cell for the -sample. The branching factor only effects ComputePartition-Mat (default ) and ComputePartition-Chan (default ) and is constant for the execution of the algorithm.

Figure 8: Input size vs. Time and Error using the default parameters.
Figure 9: Output size vs. Time and Error using the default parameters.

Sample Evaluation Results.

We do not plot Biased-L2 since it was quite slow as a function of the Output Size. For it required seconds which was already more than a factor slower than any other algorithm, and became nearly intractable for . We do note however that its measured Error on small is competitive with the best of our partitioning based methods.

Figure 7 shows how Branching Factor affects the time and error. Matoušek-based algorithms seem to gradually decrease in Error, but the trend is very small. For Chan Simple, the Error encounters a phase shift at around , where the error suddenly becomes significantly worse for larger , probably as an effect of the data set size. The timing is fairly unaffected by for Chan-based algorithms, but increases noticeably and linearly for the Matoušek based algorithms. We conclude that is a good choice for Chan-based algorithms and is a good choice for Matoušek based algorithms.

Figure 8 shows the Input Size relationship to time and Error. As prescribed by the theory, Input Size has no noticeable effect on Error. Moreover, also as expected the runtime of all algorithms scale linearly with Input Size.

Figure 9 plots the Output Size against the time and Error. As Output Size increases, as expected the error for all methods decreases, note that Error is plotted on a log-scale. The Ham Tree Sample and Mat Poly Lines achieve the smallest Error, with Ham Tree Sample doing the best, and all proposed methods appear to improve upon the old default Random Sample in terms of Error. In particular with Output Size both Mat Poly Points and Ham Tree Sample have while Random Sample has . For the Matoušek-based partitioning algorithms, the choice of test set does not have much effect on Error, and perform slightly worse than those based on Chan’s partitioning.

Moreover, as Output Size increases the observed run time of all algorithms increases at most linearly. In some cases (e.g., Ham Tree Sample and Mat Poly Lines) the increase is sublinear as these are hierarchical methods, and the largest cost is incurred at the top of the hierarchy. Here as in other plots, we observe that Random Sample is absurdly faster than any other approach. However, even for Output Size , our methods Ham Tree Sample, Chan Simple, and Mat Poly Points take only about , , and seconds, respectively.

Spatial Anomaly Detection Evaluation.

As a concrete demonstration of the usefulness of efficient -samples in practice, we apply our new algorithms to a framework for approximately detecting spatial anomalies – maximizing the spatial scan statistic [14]. Specifically each point is endowed with two measures ( the baseline quantity like population and the measured quantity like disease instance), and let and be the fraction of all measured and baseline counts within range , respectively. The main computational problem of exact scan statistics is to find where for simplicity we use .

Figure 10: Smooth Discrepancy Error vs. time.

Approximate scan statistics [18, 17] depend on creating two samples an -net which approximates the density of the regions and an -samples which approximates the density of points. Together this allows the algorithm to find a where ; and this is still statistically powerful [18]. In particular, we consider an algorithm for which runs in time , where is the -sample size and its construction time. We fix to be approximately which corresponds approximately to an -net of size . and vary only . ‘ We find approximate anomalies on the crime data set with a particular chosen and points chosen, so that will be anomalously large. Namely we plant a region containing fraction of the points, where in that region points are in the measured set with probability of and baseline set of and outside with probability and respectively. In Figure 10 we plot as a function of the overall runtime of the algorithms. Note that , so it is possible to find a , but serves as a useful proxy. We find that Ham Tree Sample generally outperforms Random Sample; for instance for error, Ham Tree Sample takes seconds to Random Sample’s seconds. Mat Poly Points also usually performs better than Ham Tree Sample, while Chan and Chan Simple

perform comparably to random sampling, albeit with high variance, even though their sampling procedure is hundreds of times slower.

Conclusion.

Overall we recommend Ham Tree Sample for computing -samples if moderate computing beyond random sampling can be tolerated. This method significantly reduces the size and error versus random sampling, and is not difficult to implement. If post-processing is not extensive, Random Sample is still a simple reasonable choice in many settings.

One thing to consider when implementing these algorithms is whether the extra complexity of a method such as Chan is necessary. Chan creates a tree structure that is very similar to Ham Tree Sample or Double Ham Tree, but differs in that it switches between partitioning based on points, Ham Tree Sample, and partitioning based on lines, cuttings. Maintaining the global information for the cuttings leads to the much higher overhead of this method, but is only really necessary to avoid the possible construction of bad partitions. In most situations Ham Tree Sample

will perform just as well or better, since the data set will not be adversarial. For instance if the point set is uniformly distributed then even a

-tree constructed partitioning will give an optimal [23]. In applications where consistently small sample sizes are extremely important and data can be manipulated by a 3rd party then guarantees become necessary.

References

  • [1] Crimes in Chicago. https://www.kaggle.com/currie32/crimes-in-chicago, 2017.
  • [2] J. Ralph Alexander. Geometric methods in thge theory of uniform distribution. Combinatorica, 10:115–136, 1990.
  • [3] Amitabha Bagchi, Amitabh Chaudhary, David Eppstein, and Michael T. Goodrich. Deterministic sampling and range counting in geometric data streams. ACM Transactions on Algorithms, 3(A16), 2007.
  • [4] Nikhil Bansal. Constructive algorithms for discrepancy minimization. In Proceedings 51st Annual IEEE Symposium on Foundations of Computer Science, pages 407–414, 2010.
  • [5] Christopher Barker. Pep 485 – a function for testing approximate equality. https://www.python.org/dev/peps/pep-0485/, Jan 2015.
  • [6] Timothy M. Chan. Optimal partition trees. In In: Proc. 26th Annu. ACM Sympos. Comput. Geom, pages 1–10, 2010.
  • [7] Bernard Chazelle. The Discrepancy Method. Cambridge, 2000.
  • [8] Bernard Chazelle and Joel Friedman. A deterministic view of random sampling and its use in geometry. Combinatorica, 10:229–249, 1990.
  • [9] Bernard Chazelle and Jiri Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimensions. Journal of Algorithms, 21:579–597, 1996.
  • [10] Herbert Edelsbrunner and Emo Welzl. Halfplanar range search in linear space and o(n0.695) query time. Information Processing Letters, 23(5):289 – 293, 1986. URL: http://www.sciencedirect.com/science/article/pii/0020019086900888, doi:https://doi.org/10.1016/0020-0190(86)90088-8.
  • [11] Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181 – 185, 2006.
  • [12] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6):2016–2039, 2000.
  • [13] Herve Bronnimann Huseyin Akcan and Robert Marini. Practical and efficient geometric -approximations. Proceedings of the 18th Canadian Conference on Computational Geometry, pages 120 – 125, 2006.
  • [14] Martin Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26:1481–1496, 1997.
  • [15] Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the samples complexity of learning. J. Comp. and Sys. Sci., 62:516–527, 2001.
  • [16] Sachar Lovett and Raghu Meka. Constructive discrepancy minimization by walking on the edges. SIAM Journal on Computing, 44:1573–1582, 2015.
  • [17] Michael Matheny and Jeff M. Phillips. Computing approximate statistical discrepancy. CoRR, abs/1804.11287, 2018. arXiv:1804.11287.
  • [18] Michael Matheny, Raghvendra Singh, Liang Zhang, Kaiqiang Wang, and Jeff M. Phillips. Scalable spatial scan statistics through sampling. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2016.
  • [19] Jiri Matoušek. Approximations and optimal geometric divide-and-conquer. In

    Proceedings 23rd Symposium on Theory of Computing

    , pages 505–511, 1991.
  • [20] Jiri Matoušek. Efficient partition trees. Discrete & Computational Geometry, 8:315–334, 1992.
  • [21] Jiri Matoušek. Tight upper bounds for the discrepancy of halfspaces. Discrete and Computational Geometry, 13:593–601, 1995.
  • [22] Jiri Matoušek. Geometric Discrepancy. Springer, 2009.
  • [23] Jiř$́\mathit{missing}$ Matoušek. Geometric range searching. ACM Comput. Surv., 26(4):422–461, December 1994. URL: http://doi.acm.org/10.1145/197405.197408, doi:10.1145/197405.197408.
  • [24] Jiř$́\mathit{missing}$ Matoušek, Chi-Yuan Lo, and William Steiger. Ham-sandwich cuts in rd. In Proceedings of the Twenty-fourth Annual ACM Symposium on Theory of Computing, STOC ’92, pages 539–545, New York, NY, USA, 1992. ACM.
  • [25] Nimrod Megiddo. Partitioning with two lines in the plane. Journal of Algorithms, 6(3):430 – 433, 1985.
  • [26] Raimund Seidel. A simple and fast incremental randomized algorithm for computing trapezoidal decompositions and for triangulating polygons. Computational Geometry, 1:51 – 64, 1991.
  • [27] Subhash Suri, Csaba D. Tóth, and Yunhong Zhou. Range counting over multidimensional data streams. In Proceedings 20th Symposium on Computational Geometry, pages 160–169, 2004.
  • [28] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theo. of Prob and App, 16:264–280, 1971.
  • [29] Ron Wein, Eric Berberich, Efi Fogel, Dan Halperin, Michael Hemmer, Oren Salzman, and Baruch Zukerman. 2D arrangements. In CGAL User and Reference Manual. CGAL Editorial Board, 4.12 edition, 2018.
  • [30] D. E. Willard. Polygon retrieval. In 11, editor, SIAM Journal of Computing, pages 149–165, 1982.