Kaggle surveys [KaggleSurvey] show that the majority of learning tasks faced by data scientists involve relational data, which is typically stored in multiple tables in a relational database. However, standard algorithms for standard learning problems generally assume that the input consists of points in Euclidean space [HundredPage], and thus are not designed to operate directly on relational data. Thus the current standard practice for a data scientist, confronted with a learning task on relational data, is: Standard Practice:
Issue a feature extraction query to extract the data from the relational database by joining together multiple tablesto materialize a design matrix .
Then interpret each row of this design matrix as a point in a Euclidean space.
And finally to import this design matrix into a standard learning algorithm to train the model.
Independent of the learning task, this standard practice necessarily has exponential worst-case time and space complexity as the design matrix can be exponentially larger than the underlying relational tables. Thus a natural research question is:
The Relational Learning Question:
Which standard learning algorithms can be implemented as relational algorithms, which informally are algorithms that are efficient when the input is in relational form?
And for those standard algorithms that are not implementable by a relational algorithms, is there an alternative relational algorithm that has similar performance guarantees to the standard algorithm?
In this paper we address the relational learning question for two well-known algorithms for the -means clustering problem, a common/standard learning problem (for example -means is one of the handful of learning models implemented in Google’s BigQuery ML package [bigqueryml]). The input to the -means problem consists of a collection of points in a Euclidean space and a positive integer . A feasible output is points , which we call centers, in this Euclidean space. The objective is to minimize the aggregate squared distance from each original point to its nearest center. The algorithms we consider are:
The first algorithm is the -means++ algorithm from [DBLP:conf/soda/ArthurV07]. In the -means++ algorithm the
centers are picked iteratively from the collection of original points, where the probability that a point is picked as the next center is proportional to the squared distance to the closest previously picked center.
The second algorithm is the adaptive -means algorithm from [aggarwal2009adaptive]. In this algorithm centers are selected as in the -means++ algorithm. Then each center is given a weight, which should be interpreted as a multiplicity, equal to the number of original points that are closest to it. It is shown in [aggarwal2009adaptive] that if is sufficiently large (say ) then this weighted instance is an -approximate sketch with high probability. An -approximate sketch is a collection of weighted points, where the weights are interpreted as multiplicities, with the property that any -approximate solution on this weighted instance is also an -approximate solution on the original instance.
So plan A is to find a relational implementation of each algorithm. And if plan A fails, plan B is to find an alternative algorithm that is relationally implementable and that has similar performance guarantees to the standard algorithm.
Finding relational algorithms is a spiritual sibling to finding algorithms in the streaming model [Muthukrishnan:2005], and finding algorithms in the Massively Parallel Computation (MPC) model [ImMS17, KarloffSV10]. All these models are motivated by application areas in which the large size of the data is an issue, and due to time and space limitations, the way in which an algorithm can access the data is restricted in some way. So our broader research goals, beyond learning problems, are the same as for these sibling restricted access models: determining which problems admit efficient algorithms under these restrictions, and developing generally applicable algorithmic design and analysis tools.
1.1 Warm-up: Efficiently implementing 1-means++ and 2-means++
To illustrate the concepts introduced so far, let us consider relationally implementing 1-means++ and 2-means++ ( the k-means++ algorithm in the special cases that and ), and relationally implementing the computation of the weights for two centers in the manner of the adaptive -means algorithm, on a commonly considered special type of join, namely a path join. We first show that plan A works out for 1-means++ and 2-means++ in that both have efficient relational implementations. We then show that it is highly unlikely that plan A could work out for computing the weights for two centers as an efficient relational implementation would imply .
In a path join each table has two features/columns , and . Assume for simplicity that each table contains rows. The design matrix is formed by taking the natural join of the tables. So has features, one for each possible feature in the tables, and the rows in are exactly the -tuples , where for all , it is the case that the entries for features and of appear as a row in . The design matrix could contain up to rows , and entries. Thus the standard practice could require time and space in the worst-case, as this much space could be required to just materialize .
A Relational Implementation of 1-means++:
Conceptually consider a layered directed acyclic graph , with one layer for each feature. In there is one vertex in layer for each entry value that appears in the column in either table or table . Further, in there is a directed edge between a vertex in layer and a vertex in layer if and only if is a row in table . Then there is a one-to-one correspondence between full paths in , which are paths from layer to layer , and rows in the design matrix. Then implementing the 1-means++ algorithm is equivalent to generating a full path uniformly at random from . This can be done by counting the number of full paths using dynamic programming and topological sorting, where for each vertex the number of paths from layer 1 to is stored. Using these paths counts, it is then straight-forward to generate a full path uniformly at random. It is also straight-forward to implement this algorithm so that it does not need to explicitly construct and the path counts are stored in an additional column in each table. The resulting running time would be , so conceptually the total time of this relational algorithm is dominated by the time to sort each table.
A Relational Implementation for 2-means++:
Assume for simplicity, and without loss of generality, that the first center was the origin. If we conceptually think of each vertex in the layered graph as having a cost equal to the square of its feature value, then implementing 2-means++ is equivalent to generating a full path with probability proportional to its aggregate cost. This can again be efficiently implemented in time using dynamic programming and topological sorting, where for each vertex the aggregate number of paths from layer 1 to , and the aggregate costs of the paths from layer 1 to , are stored.
#P-hardness of Relationally Computing the Weights for Two Centers:
We prove -Hardness by a reduction from the well known -hard Knapsack Counting problem. The input to the Knapsack Counting problem consists of a set of nonnegative integer weights, and a nonnegative integer . The output is the number of subsets of with aggregate weight at most . To construct the relational instance, for each , we define the tables and as follows:
Let centers and be arbitrary points such that points closer to than are those points for which . Then there are full paths in , and hence rows in , since can either be selected or not selected in level/feature . The weight of is the number of points in closer to than , which is in turn exactly the number of subsets of with aggregate weight at most .
#P-hardness of Implementing One Step of Lloyd’s Algorithm:
We can also show that computing in one step of the commonly used Lloyd’s algorithm [lloyd1982least] is -hard (see Appendix D).
Constant approximations are known for the -means problem in the standard computational setting [LiS16, KanungoMNPSW04]. Although the most commonly used algorithm in practice is a local search algorithm called Lloyd’s algorithm, or sometimes confusingly just called “the k-means algorithm”. The -means++ algorithm from [DBLP:conf/soda/ArthurV07] is a approximation algorithm, and is commonly used in practice to seed Lloyd’s algorithm. Some sort of sketching has been used before to design algorithms for the -means problem in other restricted access computational models, including steaming [GuhaMMMO03, BravermanFLSY17], and the MPC model [EneIM11, BahmaniMVKV12], as well as speeding up sequential methods [MeyersonOP04, SohlerW18].
We will now try to briefly cover the results in the limited literature on relational algorithms that are the most critical for our purposes. The most important result is the existence of relational algorithms for SumProd queries. A SumProd query consists of:
A collection of tables in which each column has an associated feature. There are standard methods to convert categorical features to numerical features [HundredPage], and as we do not innovate with respect to this process, we will assume that all features are a priori numerical. Let be the collection of all features, and is the number of features. The design matrix is , the natural join of the tables. We use to denote the number of rows in the largest input table and use to denote the number of rows in .
A function for each feature for some base set . We generally assume each is easy to compute.
Binary operations and such that forms a commutative semiring. Most imporantly this means that distributes over .
Evaluating results in the following element of the semiring:
where is a row in the design matrix and is the value for feature in that row. For example, if each , is multiplication, and is addition, then the resulting SumProd query computes , which evaluates to the number of rows in the design matrix . As another example, if , and then one can verify that is a commutative semiring, and the resulting SumProd query computes a pair where is the number of rows in and is be the aggregate 2-norm square distances from a fixed point over the points/rows in (see appendix section A for more details).
A SumProd query grouped by a table computes for each row the value of
That is, it computes for each row the value of under the assumption that was the only row in . For example, evaluating the SumProd query grouped by table , for the path join discussed in subsection 1.1, would compute, for each edge between layer and layer , the number of paths in the graph passing through this edge. Algorithms for SumProd queries have been been rediscovered multiple times. It will be useful to state the running times of the algorithms that we develop in terms of the time to evaluate a single SumProd query (perhaps grouped by a table) under the assumption that the operators and can be evaluated in constant time.
Finding a good formal definition for a relational algorithm is challenging. Firstly if the tables have a complicated structure, implementing any non-trivial algorithm runs in complexity theoretic barriers as it is NP-hard to even determine whether the join of an arbitrary collection of tables is empty. Secondly, for each candidate definition for a “relational algorithm” there are plausible situations in which it is not the “right” definition. In this paper, we will go with the following definition, which seems to most commonly be the “right” one:
A relational algorithm is an algorithm in which the running time is bounded by a function that consists of a product of factors where:
one factor is a function of and is at most poly-log ,
one factor is a function of and is at most polynomial in ,
one factor is a function of and is at most polynomial in , and
one factor is .
So a typical run time for a relational algorithm might be something like . Note that typically and are orders of magnitude smaller than . A simpler way to address this challenge, which we suggest the reader adopt on the first reading, is to assume that the join is acyclic, which is normally the case, and then replace the factor with a linear factor of . For acyclic joins this is an equivalent definition as [khamis2016joins] gives an algorithm for SumProd queries (perhaps grouped by a table) for which for acyclic joins. (In the appendix section A we discuss cyclicity and extensions to cyclic relational tables. )
Another useful building block for us will be SumProd queries with constraints, which consists of a SumProd query , a table , and a constraint (or perhaps a collection of constraints) . Evaluating such a query would result in the evaluation of the SumProd query grouped by table , when is restricted to only those rows that satisfy the constraint (or equivalently the rows not satisfying are removed from ). SumProd queries with additive constraints were introduced in [faqai]. An additive constraint consists of a function for each feature and a bound , and the constraint is that . Examples of standard learning algorithms that can be viewed as using SumProd queries with additive constraints are given in [faqai, abokhamis2020approximate]. [faqai] gives an algorithm for SumProd queries with additive inequalities that conceptually improves on standard practice as this algorithm doesn’t have to explicitly compute the last join, but this algorithm’s running time is still exponential in the worst-case. [abokhamis2020approximate] shows that computing a SumProd query with a single additive constraint is NP-hard. However, [abokhamis2020approximate] shows that there is a relational algorithm to compute a -approximation when the operators satisfy some additional natural properties (which are too complicated to go into here). For our purposes it is sufficient to know that this result from [abokhamis2020approximate] yields a relational algorithm, with time complexity , to compute a -approximation of the number of points that lie inside a specified hypersphere. The main algorithmic design innovation in this algorithm was the use of what we called structured semirings, which intuitively are semirings in which the base elements are arrays of numbers, instead of numbers. (This is explained in somewhat more detail in the appendix section C. ) Finally [abokhamis2020approximate] shows that computing an -approximation to the number of rows in the design matrix that satisfy two additive inequalities is NP-hard, even for acyclic joins, although some sort of nontraditional approximation is possible.
[Rkmeans] gives an -approximation algorithm for -means when the input is in relational form. The algorithm solves the -means problem separately for each table, and from this produces instance of points with weights/multiplicities, that is shown to be an -approximate sketch. Thus one can then obtain an -approximation by running your favorite -approximation algorithm for weighted -means on this sketch. Note that the time complexity of this algorithm is exponential in the number of features , and is thus not a relational algorithm under the definition that we use here.
1.3 Our Results
The first of our two main results is a relational implementation of the -means++ algorithm (so plan A works out). To appreciate the issues, consider relationally implementing 3-means++: a point is chosen as the third center with probability proportional to its 2-norm squared distance to the closer of the two previous centers and . However, as it is NP-hard to even count the number of points closer to than , generating
according to this probability distribution initially seems challenging. But it turns out that this is indeed possible using two algorithmic design techniques that we believe will likely be useful in the future for designing relational algorithms for other problems.
The first algorithmic design technique is the use of SumProd queries with axis-parallel hyperplane constraints. A (axis-parallel) hyperplane constraint specifies that the point has to be a particular side of a particular (axis-parallel) hyperplane. So a typical axis-parallel hyperplane constraint is, where is some feature. While evaluating SumProd queries with arbitrary hyperplane constraints is NP-hard [abokhamis2020approximate], SumProd queries with axis parallel hyperplane constraints are easy to evaluate as one can just discard portions of each table not satisfying these constraints before evaluating the SumProd query on the remaining table. For example, for the constraint one can by throw out all rows in all tables that have feature value before evaluating the SumProd query.
The second algorithmic design technique is the use of rejection sampling, which under the right conditions allows one to sample from a “hard” distribution using an “easy” distribution . In our setting, the hard distribution is the distribution used by the -means++ algorithm, and the easy distribution is defined by axis parallel hyperplanes, which makes it easy to sample from. To prove that this method is efficient we then need that in some sense is a reasonably close approximation to . We explain our implementation of the k-means++ algorithm in Section 2, starting as a warm-up with 3-means++.
We then turn to implementing the adaptive -means algorithm. As we have already mentioned, assuming , we will not be able to relationally implement this algorithm exactly. Thus we must resort to plan . The second of our two main results is an alternative algorithm for computing the weights that is relationally implementable, and that (like the original adaptive -means algorithm) produces an -approximate sketch.111Recall that we define an -approximate sketch as a collection of weighted points where if one uses an -approximation for -means on the weighted instance then this will result in an -approximation on the original input. Typically it is trivial to efficiently construct the weights for centers. However, in the relational setting, we know that it is -hard to weight the points exactly and this proof can be used to show the weights are NP-Hard to even individually approximate. While our algorithm does not approximate every weight, the algorithm will construct weights that on aggregate result in a good sketch.
Firstly our algorithm assumes that the number of centers selected using the -means++ algorithm is . The main algorithmic design idea is that for each center we consider a collection of hyperspheres/balls around where contains approximately points. Using the algorithm from [abokhamis2020approximate], our relational algorithm approximately uniformly samples a poly-log sized collection of points from each , and then increases the weight for by approximately times the fraction of the sample that are found to be in the outer half of the ball and to be closer to than any other center. We show that these computed weights are accurate in aggregate to form a -approximate sketch.
In summary we show there is a relational algorithm to compute the centers of -means++ and approximately weight them. Using an efficient constant approximation algorithm for -means to cluster these weighted points to obtain exactly centers, we obtain our main result.
There is a relational algorithm that is -approximate for the -means problem.
2 The -means++ Algorithm
In this section we describe a relational implementation of the -means++ algorithm. As a warmup, we first explain how to implement the algorithm when .
2.1 Relational Implementation of 3-means++
Recall that the 3-means++ algorithm picks a point to be the third center with probability where and is a normalizing constant. Conceptually think of as being a ‘hard” distribution to sample from.
Description of the Implementation:
The implementation first constructs two identically-sized axis-parallel hypercubes/boxes and centered around and that are as large as possible subject to the constraints that the side lengths have to be non-negative integral powers of , and that and can not intersect. Such side lengths could be found since we may assume and are sufficiently far away from each other up to scaling. Conceptually the implementation also considers a box that is the whole Euclidean space.
To define our “easy” distribution , for each point define to be
Then is defined to be , where is normalizing constant. The implementation then repeatedly samples a point with probability . After sampling , the implementation can either (A) reject , and then resample or (B) accept , which means setting the third center to be . The probability that is accepted after it is sampled is , and thus the probability that is rejected is .
It is straightforward to see how to compute and , and how to compute and for a particular point . Thus the only non-straight-forward part is sampling a point with probability , which we explain now:
The implementation uses a SumProd query to compute the aggregate 2-norm squared distance from constrained to points in and grouped by table
. (Recall that this SumProd query was explained in the introduction.) Let the resulting vector be. So is the aggregate 2-norm squared distance from of all rows in the design matrix that are extensions of row in .
Then the implementation uses a SumProd query to compute the aggregated 2-norm squared distance from , constrained to points in , and grouped by . Let the resulting vector be . (Notice that an axis-parallel box constraint can be expressed as a collection of axis-parallel hyperplane constraints.)
Then the implementation uses a SumProd query to compute the aggregated 2-norm squared distance from , constrained to points in , and grouped by Let the resulting vector be .
The implementation then picks a row of with probability proportional to .
The implementation then replaces by a table consisting only of the picked row .
The implementation then repeats this process on table , then table etc.
At the end will consist of one point/row , where the probability that a particular point ends up as this final row is . To see this note that essentially computes aggregate 2-norm squared distances to for all points not in , and computes the aggregated squared distances of the points in to .
We now claim that this implementation guarantees that with probability . We can see this using the standard rejection sampling calculation. At each iteration of sampling from , let be the event that point is sampled and be the event that is accepted. Then,
Thus is accepted with probability proportional to , as desired.
As the number of times that the implementation has to sample from
is geometrically distributed, the expected number of times that it will have to sample is the inverse of the probability of success, which is. It is not too difficult to see (we prove it formally in Lemma 2.3) that . It takes SumProd queries to sample from . Therefore, the expected running time of our implementation of 3-means++ is .
2.2 Relational Implementation of -means++ for General
In this section we give a relational implementation of -means++ for general . It is sufficient to explain how center is picked given the previous centers. Recall that the -means++ algorithm picks a point to be the center with probability where and is a normalizing constant. The implementation consists of two parts. The first part, described in subsubsection 2.2.1 involves the construction of a laminar collection of axis-parallel hyperrectangles (meaning for any two hyperrectangles either they have no intersection, or one of them contains the other), which we will henceforth call boxes. The second part, described in subsubsection 2.2.2 samples according to probability distribution using rejection sampling and an “easy” distribution that is derived from the boxes constructed in the first part.
2.2.1 Box Construction
The algorithm maintains two collections and of tuples consisting of a box and a point in that box that we refer to as the representative of the box. When the algorithm terminates, will be a laminar collection of boxes that we will use to define the “easy” probability distribution .
Initially consists of a hypercube centered at each previous center , where each dimensional simplex is at distance 1 from , with the representative point being . And initially is empty. Without loss of generality we can scale so that no pair of these boxes intersect. Over time, some of the boxes in will grow in size, some boxes will be added to and some boxes will be moved from to . So one can think of as a collection of active boxes that might change in the future, and think of as a collection of inactive boxes that are frozen.
The algorithm repeats the following steps. If there are no pair of boxes in that intersect, then a doubling step is performed. In a doubling step every box in is doubled, which means that each dimensional simplex is moved twice as far away from its representative point. Otherwise the algorithm picks two arbitrary intersecting boxes from , say with representative and with representative , and executes what we call a meld step. A meld step consists of
Computing the smallest box that contains both and .
Adding to .
Deleting and from .
If was created before the last doubling step (that is, was not melded with another box since the last doubling step), the implementation computes a box from by halving, which means that each dimensional simplex is moved so that its distance to the box’s representative is halved. Then is added to .
If was created before the last doubling step, then the implementation computes a box from by halving, which means that each dimensional simplex is moved so that its distance to the box’s representative is halved. Then is added to .
The algorithm terminates when there is only one element left in , at which point the algorithm adds a box that contains the whole Euclidean space with representative to . Note that at each iteration of the doubling and melding, the boxes which are added to are the ones that after doubling were melded with other boxes (and they are added at their size before doubling). Pseudo-code for this algorithm can be found in the appendix. ∎
The collection of boxes in constructed the above algorithm is laminar.
One can prove by induction that just before each doubling step it is the case that the boxes in are disjoint, and for every box in there exist a box in such that . Laminarity of is a straight-forward consequence of this. ∎
To define our easy distribution , let be the minimal box in that contains point and let the representative of . Define , and where is a normalizing constant.
The implementation then repeatedly samples a point with probability . After sampling , the implementation can either (A) reject , and then resample or (B) accept , which means setting the next center is . The probability that is accepted after it is sampled is , and thus the probability that is rejected is .
If is the the event of initially sampling from distribution , and is the event of subsequently accepting , we can calculate the probability of accepting in a particular round using the standard rejection sampling calculation:
Thus we can see that the probability that is sampled is proportional to , as desired.
We now explain how to relationally implement the generation of a point with probability . The implementation first generates a single row from table , then a single row from table , etc. So it is sufficient to explain how to implement the generation of a row from an arbitrary table . To generate a row from , the implementation recursively computes a vector that has one entry for each row of . Initially is the all zeros vector. The recursion starts with the box in that is the whole Euclidean space. Assume that it is currently operating on box with representative . First is incremented by the aggregate 2-norm squared distances of points in from grouped by , which can be computed by a SumProd query with box constraint and grouped by . Then let be the children of in the laminar decomposition . If no such boxes exists, then this is a base case of the recursion, and no further action is taken. Otherwise for each , is decremented by a the aggregate 2-norm squared distances of points in from grouped by , which can computed by a SumProd query with box constraint grouped by table . Then the implementation recurses on each for . Once is computed, then a row is selected from with probability proportional to , and is replaced by a table with a single row . Pseudo-code for this implementation can be found in Appendix B. ∎
Consider the state of the implementation just before it is going to execute doubling step . Consider an arbitrary box in at this time, and let be the number of centers in at this time. Let be an arbitrary one of these centers. Then:
The distance from to any dimensional simplex of is at least .
Each side length of is at most .
The first statement is a direct consequence of the definition of doubling and melding since at any point of time the distance of all the centers in a box is at least . To prove the second statement, we define the assignment of the centers to the boxes as following. Consider the centers inside each box right before the doubling step. We call these centers, the centers assigned to and denote the number of them by . When two boxes and are melding into box , we assign their assigned centers to .
We prove each side length of is at most by induction on the number of executed doubling steps. Since right before each doubling, this will prove the second statement. The statement is obvious in the base case, . The statement also obviously holds by induction after a doubling step as is incremented and the side lengths double and the number of assigned boxes don’t change. It also holds during every meld step because each side length of the newly created larger box is at most the aggregate maximum side lengths of the smaller boxes that are moved to , and the number of assigned centers in the newly created larger box is the aggregate of the assigned centers in the two smaller boxes that are moved to . Note that since for any box all the assigned centers to are inside at all times, is the number of centers inside before the next doubling. ∎
For all points , .
Consider an arbitrary point . Let , , be the prior center that is closest to under the 2-norm distance. Assume is minimal such is contained in a box in just before the -th doubling round. We argue about the state of the algorithm at two times, the time just before doubling round and the time just before doubling round . Let be a minimal box in that contains at time , and let be the representative for box . By Lemma 2.2 the squared distance from from to is at most . So it is sufficient to show that the squared distance from to is .
Let be the box in that contains at time . Note that could not have been inside at time by the definition of and . Then by Lemma 2.2 the distance from to the edge of at time is at least , and hence the distance from to is also at least as is outside of . ∎
The expected time complexity for this implementation of -means++ is .
When picking center , a point can be sampled with probability in time time. This is because is size , as the laminar decomposition can be thought of as a tree with leaves, the implementation needs to group by each of the tables. By Lemma 2.3, the expected number of times that the implementation will have to sample from is . Summing over , we get ∎
3 The Adaptive -means Algorithm
The first step of the adaptive -means algorithm from [aggarwal2009adaptive] samples a collection of centers using the -means++ algorithm. Here we will take , where is a constant that satisfies . In subsection 3.1 we describe a relational algorithm to compute a collection of alternative weights, one weight for each center . Then in subsection 3.2 we show that these alternative weights ’s (like the original weights ’s) form a -approximate sketch using points.
3.1 Algorithm for Computing Alternative Weights
Fix some . Initialize the alternative weight for each center to zero. For each center and for each the following steps are taken:
A radius is computed such that the number of points in the hypersphere/ball of radius centered at lies in the range where is a constant. This can be accomplished using one application per center of the approximation algorithm for SumProd queries with one additive constraint from [abokhamis2020approximate]. Some further elaboration is given in the appendix C.
Let be a constant that is at least . A collection of “test” points are independently and approximately-uniformly sampled with replacement from every ball . Here “approximately-uniformly” means that every point in is sampled with a probability on each draw. Again this can be accomplished using techniques from [abokhamis2020approximate], and some further elaboration is given in the appendix.
Let be the points in that lie in the “donut” . Then the cardinality is computed. That is .
The number of these points that are closer to than any other center in is computed. That is , where to the index of the center in closest to the point , that is .
The ratio is computed (if then ).
If then is incremented by (else for analysis purposes only, donut
is classified as undersampled).
The running time of a naive implementation of this algorithm would be dominated by sampling of the test points. Sampling a single test point can be accomplished with applications of the algorithm from from [abokhamis2020approximate] and setting the approximation error to . Recall the running time of the algorithm from [abokhamis2020approximate] is . Thus, the time to sample all test points is . Substituting for , and noting that , we obtain a total time for a naive implementation of .
3.2 Approximation Analysis
The goal in this subsection is to prove Theorem 3.1 which states that the alternative weights form an -approximate sketch with high probability. Throughout our analysis, “with high probability” means that for any constant the probability of the statement not being true can be made less than asymptotically by appropriately setting the constants in the algorithm.
The centers , along with the computed weights , form an -approximate sketch with high probability.
Intuitively if a decent fraction of the points in each donut are closer to center than any other center, then Theorem 3.1 can be proven by using a straight-forward application of Chernoff bounds to show that each alternate weight is likely close to the true weight . The conceptual difficultly is if most of the points in a donut are closer to other centers than then likely is undersampled. Thus the “uncounted” points in would contribute no weight to the computed weight . And thus a computed weight may well poorly approximate the actual weight . To address this, we need to prove that omitting the weight from these uncounted points does not have a significant impact on the objective value. We break our proof into four parts. The first part, described in subsubsection 3.2.1, involves defining a fractional weight for each center . Conceptually the fractional weights are computed by fractionally assigning the weight of the uncounted points to various “near” centers in a somewhat complicated manner. The second part, described in subsubsection 3.2.2, establishes various properties of the fractional weight that we will need. The third part, described in subsubsection 3.2.3, shows that each fractional weight indeed likely closely approximates the computed weight . The fourth part, described in subsubsection 3.2.4, shows that the fractional weights for the centers in form a -approximate sketch. Subsubsection 3.2.4 also contains the proof of Theorem 3.1.
3.2.1 Defining the Fractional Weights
To define the fractional weights we first define an auxiliary directed acyclic graph where the vertices are the input points . Let be an arbitrary point in . Let be the closest center to , that is . Let be the donut around that contains . If is not undersampled then will have one outgoing edge . So let us now assume that is undersampled. Defining the outgoing edges from in this case is a bit more complicated. Let be the points that are closer to than any other center in , that is such that . If then contains only the point , and the only outgoing edge from goes to . So let us now assume . Let the center that is closest to the most points in , the next donut in toward from . That is . Let be points in that are closer to than any other center. That is is the collection of such that . Then there is a directed edge from to each point in . Before defining how to derive the fractional weights from , let us take a detour to note that is acyclic.
Consider a directed edge , and be the center in that is closest to, and the donut around that contains . Then since it must be the case that . Since it must be the case that . Thus . Thus the closest center to must be closer to than the closest center to is to . Thus as one travels along a directed path in , although identify of the closest center can change, the distance to the closest center must be monotonically decreasing. Thus, must be acyclic. ∎
We explain how to compute a fractional weight for each point using the network . Initially each is set to 1. Then conceptually these weights flow toward the sinks in , splitting evenly over all outgoing edges at each vertex. More formally, the following flow step is repeated until is no longer possible to do so:
Flow Step: Let be an arbitrary point that currently has positive fractional weight and that has positive outdegree in . Then for each directed edge in increment by . Finally set to zero.
As the sinks in are exactly the centers in , the centers in will be the only points that end up with positive fractional weight. Thus we use to refer to the resulting fractional weight on center .
3.2.2 Properties of the Fractional Weights
that with high probability, either the estimated ratio is a good approximation of, or the real ratio is very small.
We show in Lemma 3.7 that the maximum flow through any node is bounded by when is big enough. This follows using induction because each point has neighbors and every point can have in degree from one set of nodes per center. We further know every point that is not uncounted actually contributes to their centers weight.
Consider Bernoulli trials . Let and . Then, for any :
|Upper Chernoff Bound|
|Lower Chernoff Bound|
With high probability either or .
Fix any center and . By applying the low Chernoff bound from Lemma 3.3 it is straight forward to conclude that is large then with high probability at least a third of the test points in each are in the donut . That is, with high probability . So let us consider a particular and condition having some fixed value that is at least . So is conditioned on being large.
, and the indicator random variablesare Bernoulli trials. Further note by the definition of it is the case that . Further note that as the sampling of test points is nearly uniform that . For notational convenience, let . We now break the proof into three cases, that cover the ways in which the statement of this lemma would not be true. For each case, we show with high probability the case does not occur.
Case 1: and and
We are going to prove the probability of this case happening is very low. If we set , then using Chernoff bound, we have
|[Upper Chernoff Bound]|
Therefore, for and this case cannot happen with high probability.
Case 2: and and
We can use Lower Chernoff Bound with to prove the probability of this event is very small.
Therefore, for and this case cannot happen with high probability.