Computing Approximate Statistical Discrepancy

04/30/2018 ∙ by Michael Matheny, et al. ∙ THE UNIVERSITY OF UTAH 0

Consider a geometric range space (X,A̧) where each data point x ∈ X has two or more values (say r(x) and b(x)). Also consider a function Φ(A) defined on any subset A ∈ (X,A̧) on the sum of values in that range e.g., r_A = ∑_x ∈ A r(x) and b_A = ∑_x ∈ A b(x). The Φ-maximum range is A^* = _A ∈ (X,A̧)Φ(A). Our goal is to find some  such that |Φ(Â) - Φ(A^*)| ≤ε. We develop algorithms for this problem for range spaces with bounded VC-dimension, as well as significant improvements for those defined by balls, halfspaces, and axis-aligned rectangles. This problem has many applications in many areas including discrepancy evaluation, classification, and spatial scan statistics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a set of points in for constant . Let be the union (possibly not disjoint) of two sets , the red set, and , the blue set. Also consider an associated range space ; we are particularly interested in range spaces defined by geometric shapes such as rectangles in , disks in , and -dimensional halfspaces .

Let and be the fraction of red or blue points, respectively, in the range . We study the discrepancy function , when for brevity is typically write as just . A typical goal is to compute the range and value that maximizes the given function . Our goal is to find a range that satisfies

The exact version of this problem arises in many scenarios, formally as the classic discrepancy maximization problem [3, 7]. The rectangle version is a core subroutine in algorithms ranging from computer graphics [8] to association rules in data mining [10]. Also, for instance, in the world of discrepancy theory [20, 6]

, this is the task of evaluating how large the discrepancy for a given coloring is. For the halfspace setting, this maps to the minimum disagreement problem in machine learning (i.e., building a linear classifier

[17]. When is replaced with a statistically motivated form [13, 14], then this task (typically focusing on disks or rectangles) is the core subroutine in the GIScience goal of computing the spatial scan statistic [12, 22, 2, 1] to identify spatial anomalies. Indeed this statistical problem can be reduced the approximate variant with the simple discrepancy maximization form [2].

The approximate versions of these problems are often just as useful. Low-discrepancy colorings [20, 6] are often used to create the associated -approximations of range spaces, so an approximate evaluation is typically as good. It is common in machine learning to allow classification error. In spatial scan statistics, the approximate versions are as statistically powerful as the exact version and significantly more scalable [19].

While the exact versions take super-linear polynomial time in , e.g., the rectangle version with linear functions takes time conditional on a result of Backurs et al. [3], we show approximation algorithms with runtime. This improvement is imperative when considering massive spatial data, such as geotagged social media, road networks, wildlife sightings, or population/census data. In each case the size can reach into the 100s of millions.

While most prior work has focused on improving the polynomials on the exact algorithms for various shapes [15, 25]

or on using heuristics to ignore regions 

[28, 22], little work exists on approximate versions. These include [1] which introduced generic sampling bounds, [19] which showed that a two-stage random sampling can provide some error guarantees, and [27] which showed approximation guarantees under the Bernoulli model. In this paper, we apply a variety of techniques from combinatorial geometry to produce significantly faster algorithms; see Table 1.

Our results.

Our work involves constructing a two-part coreset of the initial range space ; it approximates the ground set and the set of ranges . This needs to be done in a way so that ranges can still be effectively enumerated and and values tabulated. We develop fast coreset constructions, and then extend and adapt exact scanning algorithms to the sparsified range space.

We develop notation and review known solutions in Section 2; also see Table 1. Then we describe a general sampling result in Section 3 for ranges with bounded VC-dimension. In particular, many of these results can be seen as formalizations and refinements (in theory and practice) of the two-stage random sampling ideas introduced in [19].

In Section 4 we describe improvements for halfspaces and disks. We first improve upon the sampling analysis to approximate ranges . By carefully annotating and traversing the dual arrangement from the approximate range space, we improve further upon the general construction.

Then in Section 5 we describe our improved results for rectangles. We significantly extend the exact algorithm of Barbay et al. [4] and obtain an algorithm that takes . This is improved to with some more careful analysis in Appendix A. This nearly matches a new conditional lower bound of , assuming current algorithms for APSP are optimal [3].

In Section 6 we show how to approximate a statistical discrepancy function (sdf, defined in Section 6) , as well as any general function . These require altered scanning approaches and the sdf-approximation requires a reduction to a number of calls to the generic (“linear”) . We reduce the number of needed calls to generic functions from  [2] to .

Finally, in Section 7 we show on rectangles strong empirical improvement over state of the art [19].

Known Exact Known Approx [19] New Runtime Bounds
General Range Space
Halfspaces  [8]
Disks  [8]
Rectangles  [4]  [2, 1]
Rectangles (sdf)
Rectangles (General)
Table 1: Algorithm times for (-approximately) maximizing different range spaces. Here dimension , VC-dimension

, and probability of failure are all constants. For

we show it takes time, assuming hardness of APSP.

2 Background on Geometric Range Spaces

Figure 1: First two panels show that has a conforming map defined by the smallest enclosing disk. The last panel shows a range space corresponding to triangles, and that a mapping defined by minimum area triangle is not conforming; it does not recover .

To review, a range space is composed of a ground set (for instance a set of points in ) and a family of subsets of that set. In this paper we are interested in geometrically defined range spaces , where . We formalize the requirements of this geometry via a conforming geometric mapping ; see Figure 1. Specifically, it maps from a subset to subset of . Typically, the result is a Lebesgue measureable subset of , for instance , defined for disk range space , could map to the smallest enclosing disk of .

We say this mapping is conforming to if for any it has the properties:

  • for any subset then [the mapping recovers the same subset]

  • for any subset then [the mapping is always in ]

2.1 Basic Combinatorial Properties of Geometric Range Spaces

We highlight two general combinatorial properties of geometric range spaces. These are critical in sparsification of the data and ranges, and enumeration of the ranges.

Sparsification.

An -sample of a range space preserves the density for all ranges as An -net of a range space hits large ranges, specifically for all ranges such that we guarantee that . Consider range space with VC-dimension . Then a random sample of size is an -sample with probability at least  [26, 16]. Also a random sample of size is an -net with probability at least . For our ranges of interest, the VC-dimensions of , , and are , , and .

Enumeration.

For the ranges spaces we will consider that each range can be defined by a basis ; where is a point set. Given a geometric conforming map and subset , a range space’s basis is such that , but on a strict subset , then is different (and usually smaller under some measure) than . We will use to denote the maximum size of the basis for any subset . For instance for then , for then , and for then . Recall, by Sauer’s Lemma [23], if a range space has VC-dimension , then .

This implies that for points, there are at most different ranges to consider. We assume is constant; then it is possible to construct in time, and to determine if contains a point in time. This means we can enumerate all possible bases in time, construct their maps in as much time, and for all of them count which points are inside, and evaluate each to find , in time.

For the specific range spaces we study, the time to find can be improved by faster enumeration techniques. For , Dobkin and Eppstein [7] reduced the runtime to find from to ; this implies for the runtime is reduced from to . For , Barbay et al. [4] show how to find in time; this was recently shown tight [3] in , assuming APSP takes cubic time.

2.2 Coverings

Our main approach towards efficient approximate range maximization, is to sparsify the range space . This will have two parts. The first is simply replacing with an -sample. The second is sparsifying the ranges , using a concept we refer to as an -covering.

Recall that the symmetric difference of two sets is . Define an -covering of a range space where , so that for any there exists a such that See Figure 2 for an illustration of this concept. If a range space satisfies the above condition for any one specific range , but not necessarily all ranges simultaneously, then it is a weak -covering of .

We will use subsets of the ground set to define subsets of the ranges. For a subset , let be the restriction of to the points in . We will define using or a subset thereof. However, as each is a subset of , which itself is a subset of , we need a conforming map to take a region and map it back to some region in , a subset of . Given (which is or a subset) we define as

Figure 2: First panel shows . Second panel shows the set of disks induced by . The third panel shows a range (defined by disk in blue). It has symmetric difference over (in orange) of size with the one defined by the disk (in green) induced by a subset .

A small sized -covering is implied by a result of Haussler [11]. For every range space of VC-dimension , with , there always exist a maximal set of ranges of size where for every pair of ranges the symmetric difference . Setting then , so is an -covering.

Symmetric difference nets.

We can construct an -net over the symmetric difference range space of and then use these points to define .

For a family of ranges , let be the family of ranges made up of the symmetric difference of ranges of . Specifically . If range space has VC-dimension , then has VC-dimension at most  [21]. Thus for constant we can use asymptotically the same size random sample as before. Matheny et al. [19] pointed out two important properties connecting nets over symmetric difference range spaces and -coverings and then finding .

  • An -net for induces which is an -covering of  [19].

  • Given an -covering and an -sample over then for any range , there exists a range for so  [19].

For an appropriate constant , by constructing -nets and , of size , on the red and blue points, also constructing -samples of size on and , and invoking (P2) on the results, Matheny et al. [19] observed we can maximize over to find an -approximate . They construct the -nets and -samples using random sampling, and apply the results to scan disk and rectangle range spaces towards finding . Enumerating all ranges in and counting the intersections with the -samples, when is a constant, is sufficient to find an in time for disks and time for rectangles .

We can ignore the distinct red and blue points, and focus on three aspects of this problem which can be further optimized: (1) More efficiently constructing a sparse set of -covering ranges . (2) More efficiently constructing a smaller -sample of . (3) More efficiently scanning the resulting .

3 General Results via -Coverings

For general range spaces of contant VC-dimension we can directly apply the work of Matheny et al. [19] to get a bound. A random sample of size induces an -covering with constant probability by (P1). A random sample of size induces an -sample with constant probability. By (P2), scanning the ranges in , evaluating on each ranges using , and returning the maximum induces the -approximation of as we desire. Including the time to calculate and we obtain the following result.

Consider a range space with constant VC-dimension , with , and conforming map . For , with probability at least , in time , we can find a range so that

Proof.

First compute random samples and of size and respectively. The algorithm naively considers all subsets of size , and calculates the quantity . By (P2), this can be used to -approximate for any range which has less than -symmetric difference with . Moreover, since is an -cover, with constant probability any range is within symmetric difference of at most of one induced by some subset . Thus, with constant probability we observe some range for which (after adjusting constants in the size of and ). To amplify the probability of success to , we repeat this process times, and return the with median score. ∎

4 Halfspaces

Our general additive error results applied to arbitrary halfspaces, , would require time. In this section, we improve this runtime to . First, a recent paper [18] shows that with constant probability an -sample for of size can be constructed in time.

Second we create a weak -covering of using where is of size . Ultimately this requires time , with constant probability.

Then, we show how to enumerate these ranges while maintaining the counts from with less overhead than the previous brute force approaches.

4.1 Smaller Coverings

We show that a random sample of only points induces a range space which is a weak -covering of .

A random sample of size induces a weak -covering for with constant probability.

Proof.

For any halfspace range , we aim to show that contains some within symmetric difference of (recall ). For any halfspace range , we can translate and rotate this to define the geometric halfspace , until it has at least points incident to its boundary (some may be inside and some may be just outside). We will let denote the geometric shape and the range defined .

We can fix the last points , and consider a rotation of so that those points stay incident to its boundary. This defines an ordering over all points in . Denote the first points closest to in this ordering as the set . If we take the union of any with it induces another range that has symmetric difference of size at most with . A randomly chosen point from is in this set with probability . If we randomly choose points iid, then none of these will be from with probability at least . Hence setting will result in one point within with probability at least . Let the halfspace induced by be .

If we assume we have chosen some point from this first set of points, then we fix this point in , and also put the last points from in . This again results in points in , and we can order the points to rotate a halfspace that is incident to these points. We put the points within of in this ordering in . Then again if we iid sample another points from , with probability at least one falls into . Let such a point be and the halfspace induced by is and the symmetric difference .

We repeat this for more steps, until we obtain a set . This succeeds with probability at least . The induced halfspace space is within symmetric difference , by triangle inequality.

Setting , and this implies after iid samples from , with probability at least , the resulting set induces a weak -covering of . ∎

4.2 Fast Enumeration of Halfspaces

Now using our sets and we enumerate over the ranges in the weak -covering , and for each range we count the intersection with an -sample of . We first consider the case when ; the general case will reduce to this case.

Our technique follows that of Dobkin and Eppstein [7]. This first builds the dual arrangement , where in the dual the points in are halfspaces in . Each halfspace intersects at most other halfspaces, and takes as long to insert in the arrangement; thus construction of takes total time. Also, has vertices and edges, each edge representing a combinatorial range from . At a vertex, we are incident to edges, these correspond to ranges in that only toggle the inclusion of the two relevant points whose dual halfspaces and cross at that vertex.

Then our technique extends that of Dobkin and Eppstein [7] in that we annotate each edge of with each halfspace , the dual set of . Each such intersects at most edges of and these edges can be found and annotated in time. Annotating all takes time. This annotation describes how many halfspaces are crossed when moving between vertices in the arrangement. By considering the counts of at all vertices of , we evaluate on all ranges in .

We can traverse using a topological sweep [9] in time. Starting at the far left, corresponding to ranges that contain no points, we maintain at each vertex of how many halfspaces we are below, and thus how large is for all .

Combining the results for from Lemma 4.1 and  [18] both for constant , we can state our result for .

Consider a range space with . For maximum range , with constant probability, in time , we can find a range so that

To extend this to we start with a weak -covering of size and a random sample of size which is an -sample with constant probability [16].

To search we consider all subsets of size . For each subset we define a projection orthogonal to the span of , resulting in a -dimensional space. Restricting to this -dimensional space, we follow Dobkin and Eppstein [7], and scan under this projection, which enforces they have boundary incident to . Here we can again construct each pertinent dual arrangement in dimensions in time.

We can then reduce to a set of size , which is an -sample restricted to . This takes time for each  [18].

Then we annotate the -dimensional dual arrangement of with in

time. The topological sweep, and evaluation of each range takes time, and does not dominate the cost. Ultimately for all subsets , the total running time is , for constant . To amplify the success probability to at least , we repeat the entire construction times, and return the with median value .

Consider a range space with . For , with probability at least , in time , we find a range so

4.3 Application to Disks and other Ranges

Many other geometric ranges can be mapped to halfspaces through various lifting maps including disks, ellipses, slabs, and annuli. For disks, we can solve the approximate maximum range problem for by using the lifting . Then invoking the result for from Theorem 4.2 takes time.

5 Rectangles

For the case of rectangles , we will describe two classes of algorithms. One simply creates an -cover and evaluates each rectangle in this cover on an -sample as before. The other takes specific advantage of the orthogonal structure of the rectangles and of “linearity” of ; this algorithm can find the maximum in among ranges in without considering every possible range. Our techniques are inspired by several algorithms [4, 24, 8] for the exact maximization problem, but requires new ideas to efficiently take advantage of using both and . Common to all techniques will be an efficient way to compute an -cover based on a grid.

5.1 Grid -Covers for Rectangles

We will create a grid via a set of cells along each axis. The grid is then the cross-product of these cells on each axis. The grid is defined so no row or column contains more than points. Given any rectangle define to be rounded to these grid boundaries. Rounding to incurs at most change in points on each side, so that in total for all sides there is error.

We can sort along each axis in time, and take a set of points along the th axis of points evenly spaced in this sorted order. These define the grid boundaries on each axis. If we choose , then no row contains more than points as desired.

We label the rectangular ranges of restricted to this grid boundary as , and as argued above it is an -cover of . We can then efficiently annotate each grid cell with approximately how many points it contains from . For each point we assign it to the count of each grid cell in time, for constant . Since any rectangle in is also a rectangle in then the count on any rectangle in

can be estimated within

by examining . For range space where , the construction of grid takes time, has cells on each side, and induces an -cover of for constant .

Simple enumeration algorithm.

Given the point set we take an -sample and then construct a grid on it in time. is also an -sample for intervals along each axis, so running our grid construction on induces an -cover of . Once we have the approximate count in each grid cell of , we can build subset sum information. That is, in each of grid cells, we compute the sum of counts for all grid cells with smaller or equal indexes in each dimension. Straightforward dynamic programming solves this in time. Then for any rectangle on can have its count calculated in time using inclusion-exclusion formulas from a constant number of subset sum values; for instance when , values are required.

We evaluate all rectangles by enumerating all pairs of grid cells (defining upper and lower corners in each dimension), and calculating the count for (technically a separate red count from and blue count from ) in time.

Consider a range space with and an Lipschitz-continuous function with maximum range . With probability at least , in time we can find a range so that

5.2 Algorithms for Decomposable Functions

Here we exploit a critical “linear” property of that a rectangle can be decomposed into any two parts and and . Technically, we solve both and separately, and take their max. In particular, this allows us (following exact algorithms [4]) to decompose the problem along a separating line. The solution then either lies completely on one half, or spans the line. In the exact case on points, this ultimately leads to a run time recurrence of where is the time to compute the problem spanning the line. The line spanning problem can then be handled using a different recurrence that leads to and a total runtime for the problem of [4].

First we show we can efficiently construct a special sample of size , but this still would requires runtime of roughly .

Our approximate algorithm will significantly improve upon this be compressing the representation at various points, but requiring some extra bookkeeping and a bit more complicated recurrence to analyze. In short, we can map to an grid (using Lemma 5.1), and then the recurrence only depends on the dyadic y-intervals of the grid. We can compress each such interval to have only error, since each query only touches about of these intervals. The challenge then falls to maintaining this compressed structure more efficiently during the recurrence.

The dense exact case on an grid is also well studied. There exists a practically efficient time method [5] based on Kadane’s algorithm (which performs best as gridScan_linear; see Section 7), and a more complicated method taking time [24]. By allowing an approximation, we ultimately reduce this runtime to .

We will focus on the 2d case. This is where the advantage over the Theorem 5.1 bound of is most notable. Generalization to high dimensions is straightforward: enumerate over pairs of grid cells to define the first dimensions, then apply the -dimensional result on the remaining dimensions.

Tree and slab approximation.

The algorithm builds a binary tree over the rows (the values) of . We will assume that the number of cells in each axis is a power of (otherwise we can round up), so it is a perfectly balanced binary tree.

At the th level of the tree, each node contains rows and there are nodes. We refer to the family of rows represented by a subtree as a slab. Any grid-aligned rectangle can be defined as the intersection of with at most slabs in the y-coordinate – the classic dyadic decomposition. This implies we can tolerate additive error in each slab to have at most additive error overall (which implies the percentage of red and of blue points in each range has additive error).

Since the rectangle will span the entire vertical extent ( direction) of each slab in this decomposition, the additive error of a slab can be obtained along just the horizontal () direction. Thus, we can scan cells from left to right within a slab, and only retain the cumulative weight in a cell when it exceeds . We refer to this operation as -compression. We denote each column (and value) within a slab where it has retained a non-zero value as active, all other columns are inactive. We store the active cells in a linked list.

Since there are points per row, it implies we can approximate each slab consisting of row (a leaf of the tree, level ) with weights in only cells (since ). And a slab at level (originally with points) can be approximated by accumulating weight in cells. For level , this compresses the points in that slab.

In time, we can compress all slabs in the tree, so a slab at level contains active columns where .

Interval Preprocessing and Merging.

Now consider a subproblem, where we seek to find a rectangle to maximize the total weight, restricted to a given horizontal extent (e.g., within a slab). We reduce this to a d problem by summing the weights for each -coordinate to . Then there is an often-used [4, 7, 2] way to preprocess intervals so they can be merged and updated. It maintains maximal weight subintervals: (1) the maximal weight subinterval in , (2) the maximal weight interval including the left boundary , and (3) the maximal weight interval including the right boundary . Given two preprocessed adjacent intervals and , we can update these subintervals to in time [4]. Thus given a horizontal extent with active intervals, we can find the maximum weight subinterval in time.

Recursive construction.

Now we can describe our recursive algorithm for finding the maximal weight rectangle on the grid . We find the maximum weight rectangle through 3 options: (1) completely in the top child’s subtree, (2) completely in the bottom child’s subtree, (3) overlapping both the top and bottom child’s subtree. The total time can be written as a recurrence as , where is the time to solve case (3).

Case (3) requires another recurrence to understand, and it closely follows the “strip-constrained” algorithm of Barbay et al. [4]; our version will account for the dense grid.

We consider the Strip-constrained grid search problem: First fix a strip which is a consecutive set of rows. Then consider two slabs and where is directly above (on top of) and is directly below . A column of is active if it is active in or . Counts in active columns of are maintained, and intervals of described by consecutive inactive columns have been merged. The goal is to find the maximum weight rectangle with vertical span where is in and is in (it must cross ).

We specifically want to solve this problem when is empty, is the top child and the bottom child of the root, and all columns are initially active. We call this the case of size since there are still rows.

The Strip-constrained grid search problem of size over an -compressed binary tree takes time.

Proof.

Following Barbay et al. [4] we split the problem into subcases, following the subtrees of the slabs. Slab has a top and bottom sub-slab, and similarly and for . Then we consider recursive cases with new strip : (1) slabs and with , (2) slabs and with , (3) slabs and with , and (4) slabs and with . The cost in a recursive step is the preprocessing of the new slab . We will describe the largest case (1); the others are similar.

Strip already maintains preprocessed intervals of inactive columns. When or has an active column which is inactive in and , we treat this as a new inactive interval that needs to be maintained within . The weights from and are added to that in the column for . If inactive intervals of are then adjacent to each other, they are merged, in time each. This completes the recursive step for case (1).

In the base case when slabs and are single rows (at depth ), the range maximum is restricted to use their active columns. We sum weights on active columns in , , and . Then also considering the inactive intervals on , invoke the interval merging procedure [4] to find the maximal range, in time proportional to the number of active intervals, in time.

The cost of recursing in any case is also proportional to the number of active columns since this bounds the number of potential merges, and the time it takes to scan the linked lists of active columns to detect where the merging is needed. At level this is bounded by .

At each level there are recursive sub instances and at most active columns, and therefore merging takes time. The cost is asymptotically dominated by the last level, which takes time . ∎

Letting (since ) as it is in Lemma 5.2 we have a bound of . We can solve the first recurrence of