    # A Composable Coreset for k-Center in Doubling Metrics

A set of points P in a metric space and a constant integer k are given. The k-center problem finds k points as centers among P, such that the maximum distance of any point of P to their closest centers (r) is minimized. Doubling metrics are metric spaces in which for any r, a ball of radius r can be covered using a constant number of balls of radius r/2. Fixed dimensional Euclidean spaces are doubling metrics. The lower bound on the approximation factor of k-center is 1.822 in Euclidean spaces, however, (1+ϵ)-approximation algorithms with exponential dependency on 1/ϵ and k exist. For a given set of sets P_1,...,P_L, a composable coreset independently computes subsets C_1⊂ P_1, ..., C_L⊂ P_L, such that ∪_i=1^L C_i contains an approximation of a measure of the set ∪_i=1^L P_i. We introduce a (1+ϵ)-approximation composable coreset for k-center, which in doubling metrics has size sublinear in |P|. This results in a (2+ϵ)-approximation algorithm for k-center in MapReduce with a constant number of rounds in doubling metrics for any ϵ>0 and sublinear communications, which is based on parametric pruning. We prove the exponential nature of the trade-off between the number of centers (k) and the radius (r), and give a composable coreset for a related problem called dual clustering. Also, we give a new version of the parametric pruning algorithm with O(nk/ϵ) running time, O(n) space and 2+ϵ approximation factor for metric k-center.

## Authors

09/27/2021

### Clustering with Neighborhoods

In the standard planar k-center clustering problem, one is given a set P...
06/19/2018

03/16/2022

### Tight Lower Bounds for Approximate Exact k-Center in ℝ^d

In the discrete k-center problem, we are given a metric space (P,) where...
09/23/2018

### Improved constant approximation factor algorithms for k-center problem for uncertain data

In real applications, database systems should be able to manage and proc...
08/23/2019

### A Center in Your Neighborhood: Fairness in Facility Location

When selecting locations for a set of facilities, standard clustering al...
03/10/2021

### Subtrajectory Clustering: Finding Set Covers for Set Systems of Subcurves

We study subtrajectory clustering under the Fréchet distance. Given one ...
07/20/2021

### FPT Approximation for Fair Minimum-Load Clustering

In this paper, we consider the Minimum-Load k-Clustering/Facility Locati...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Coresets are subsets of points that approximate a measure of the point set. A method of computing coresets on big data sets is composable coresets. Composable coresets  provide a framework for adapting constant factor approximation algorithms to streaming and MapReduce models. Composable coresets summarize distributed data so that the scalability is increased while keeping the desirable approximation factor and time complexity.

There is a general algorithm for solving problems using coresets which known by different names in different settings: mergeable summaries  and merging in a tree-like structure  for streaming -approximation algorithms, small space (divide and conquer) for constant factor approximations in streaming , and composable coresets in MapReduce . A consequence of using constant factor approximations instead of -approximations with the same merging method is that it can add a factor to the approximation factor of the algorithm on an input of size .

Composable coresets  require only a single round and sublinear communications in the MapReduce model, and the partitioning is done arbitrarily.

###### Definition 1 (Composable Coreset).

A composable coreset on a set of sets is a set of subsets whose union gives an approximation solution for an objective function . Formally, a composable coreset of a minimization problem is an -approximation if

 f(∪iSi)≤f(∪iC(Si))≤α.f(∪iSi),

for a minimization problem. The maximization version is similarly defined.

A partitioned composable coreset is a composable coreset in which the initial sets are a partitioning, i.e. sets are disjoint. Using Gonzalez’s algorithm for -center , Indyk, et al. designed a composable coreset for a similar problem known as the diversity maximization problem . Other variations of composable coresets are randomized composable coresets and mapping coresets. Randomized composable coresets  share the same divide and conquer approach as other composable coresets and differ from composable coresets only in the way they partition the data. More specifically, randomized composable coresets, randomly partitioning the input, as opposed to other composable coresets which make use of arbitrary partitioning. Mapping coresets  extend composable coresets by adding a mapping between coreset points and other points to their coresets and keep almost the same amount of data in all machines. Algorithms for clustering in norms using mapping coresets are known . Further improvements of composable coresets for diversity maximization  include lower bounds  and multi-round composable coresets in metrics with bounded doubling dimension .

Metric -center is an NP-hard problem for which -approximation algorithms that match the lower bound for the approximation factor of this problem are known [39, 21]. Among approximation algorithms for -center is a parametric pruning algorithm, based on the minimum dominating set . In this algorithm, an approximate dominating set is computed on the disk graph of the input points. The running time of the algorithm is . The greedy algorithm for -center requires only time  and unlike the algorithm based on the minimum dominating set, uses -nets . A -approximation coreset exists for -center  with size exponentially dependent on .

Let the optimal radius of -center for a point set be . The problem of finding the smallest set of points that cover using radius is known as the dual clustering problem .

Metric dual clustering (of -center) has an unbounded approximation factor . In Euclidean metric, there exists a streaming -approximation algorithm for this problem . Also, any -approximation algorithm for the minimum disk/ball cover problem gives a -approximation coreset of size for -center, so -approximation coresets of size exist for this problem . A greedy algorithm for dual clustering of -center has also been used as a preprocessing step of density-based clustering (DBSCAN) . Implementing DBSCAN efficiently in MapReduce is an important problem [26, 13, 19, 36, 29].

Randomized algorithms for metric -center and -median in MapReduce  exist. These algorithms take -approximation offline algorithms and return -approximation and -approximation algorithms for -center and

-median in MapReduce, respectively. The round complexity of these algorithms depends on the probability of the algorithm for finding a good approximation.

Current best results on metric -center in MapReduce have rounds and give the approximation factor . However, a -approximation algorithm exists if the cost of the optimal solution is known . Experiments in  suggest that running Gonzalez’s algorithm on a random partitioning and an arbitrary partitioning results in the same approximation factor.

In doubling metrics, a -approximation algorithm exists that is based on Gonzalez’s greedy algorithm 

. The version with outliers has also been discussed

[9, 15].

### Warm-Up

Increasing the size of coresets in the first step of computing composable coresets can improve the approximation factor of some problems. The approximation factor of -median algorithm of  is , where and are the approximation factors of -median and weighted -median, respectively. This algorithm computes a composable coreset, where a coreset for -median is the set of medians weighted by the number of points assigned to each median.

A pseudo-approximation for -median finds median and has approximation factor . Using a pseudo-approximation algorithm in place of -median algorithms in the first step of , it is possible to achieve a better approximation factor for -median using the same proof as . Since any pseudo-approximation has a cost less than or equal to the optimal solution; replacing them will not increase the cost of clustering.

The approximation factor using  as weighted -median coresets is , while the best -median algorithm would give a factor using the same algorithm (). The lower bound on the approximation factor of this algorithm using the same weighted -median algorithm but without pseudo-approximation is ().

### Contributions

We give a -approximation coreset of size for -center in metric spaces with doubling dimension . Using composable coresets, our algorithm generalizes to MapReduce setting, where it becomes a -approximation coreset of size , given memory , which is sublinear in the input size .

Using the composable coreset for dual clustering, we find a -approximation composable coreset for -center, which has a sublinear size in metric spaces with constant doubling dimension. More specifically, if an -approximation exists for doubling metrics, our algorithm provides -approximation factor. It empirically improves previous metric -center algorithms [33, 34] in MapReduce. A summary of results on -center is shown in Table 1. Note that for the MapReduce model, each round can take a polynomial amount of time, however, the space available to each machine is sublinear.

Our algorithm achieves a trade-off between the approximation factor and the size of coreset (see fig. 1). The approximation factor of our algorithm and the size of the resulting composable coreset for input sets are and , respectively. This trade-off is the main idea of our algorithm. Figure 1: Space-approximation factor trade-off of our α-approx. coreset of size βkL for k-center in Euclidean plane.

Our composable coresets give single-pass streaming algorithms and -round approximation algorithms in MapReduce with sublinear communication, since each coreset is communicated once, and the size of the coreset is constant.

## 2 Preliminaries

First, we review some basic definitions, models and algorithms in computational geometry and MapReduce.

### 2.1 Definitions

Some geometric definitions and notations are reviewed here, which have been used in the rest of the paper.

###### Definition 2 (Metric Space).

A (possibly infinite) set of points and a distance function create a metric space if the following three conditions hold:

• , known as triangle inequality

Metrics with bounded doubling dimension are called doubling metrics. Constant dimension Euclidean spaces under norms and Manhattan distance are examples of doubling metrics.

Doubling constant  of a metric space is the number of balls of radius that lie inside a ball of radius . The logarithm of doubling constant in base is called doubling dimension. Many algorithms have better approximation factors in doubling metrics compared to general metric spaces. The doubling dimension of Euclidean plane is .

###### Definition 3 (Doubling Dimension ).

For any point in a metric space and any , if the ball of radius centered at can be covered with at most balls of radius , we say the doubling dimension of the metric space is .

-Center is an NP-hard clustering problem with clusters in shapes of -dimensional balls.

###### Definition 4 (Metric k-Center ).

Given a set of points in a metric space, find a subset of points as cluster centers such that

 ∀p∈P,minc∈Cd(p,c)≤r

and is minimized.

The best possible approximation factor of metric -center is .

Geometric intersection graphs represent intersections between a set of shapes. For a set of disks, their intersection graph is called a disk graph.

###### Definition 5 (Disk Graph).

For a set of points in a metric space with distance function and a radius , the disk graph of is a graph whose vertices are , and whose edges connect points with distance at most .

###### Definition 6 (Dominating Set).

Given a graph , the smallest subset is a minimum dominating set, if .

We define the following problem as a generalization of the dual clustering of  by removing the following two conditions: the radius of balls is , and the set of points are in .

###### Definition 7 (Dual Clustering).

Given a set of points and a radius , the dual clustering problem finds the smallest subset of points as centers such that the distance from each point to its closest center is at most .

### 2.2 An Approximation Algorithm for Metric k-Center

Here, we review the parametric pruning algorithm of  for metric -center.

Using this algorithm on a metric graph , a -approximation for the optimal radius can be determined. In algorithm 1, edges are added by increasing order of their length until reaching . Given this radius, another graph is built, where edges exist between points within distance at most of each other.

Hence, by definition, a minimum dominating set of is an optimal -center of . Every cluster is a star in which turns into a clique in . Therefore, a maximal independent set of chooses at most one point from each cluster. Algorithm 2 computes and returns a maximal independent set of .

Computing a maximal independent set takes time. The graph in Algorithm 2 only changes in each iteration of Algorithm 1 around the newly added edge, so, updating the previous graph and takes time. Therefore, the time complexity of Algorithm 1 is .

## 3 A Coreset for Dual Clustering in Doubling Metrics

In this section, we prove a better approximation offline coreset for the dual clustering problem. Our method is based on Algorithm 1 which first builds the disk graph with radius , then covers this graph using a set of stars. We prove the maximum degree of those stars is , where is the doubling constant. The result is an approximation algorithm for dual clustering in doubling metrics.

### 3.1 Algorithm

We add a preprocessing step to Algorithm 1 to find a better approximation factor for -center and dual clustering problems.

### 3.2 Analysis

Unlike in general metric spaces, -center in doubling metrics admits a space-approximation factor trade-off. More specifically, doubling or halving the radius of -center changes the number of points in the coreset by a constant factor since the degrees of vertices in the minimum dominating set are bounded in those metric spaces.

###### Lemma 1.

For each cluster of Algorithm 3 with radius , the maximum number of points from that are required to cover all points inside with radius is at most , i.e.

 (Δ+1)≤D2,

where is the doubling constant of the metric space.

###### Proof.

Assume a point returned by Algorithm 3. By the definition of doubling metrics, there are balls of radius centered at called that cover the ball of radius centered at , called .

 ∀q∈B,∃Bi,i=1,…,D:d(p,bi)≤r′/2

Repeating this process for each ball results in a set of at most balls of radius centered at .

 ∀q∈B′i,j,d(b′i,j,q)≤r′/4

Choose a point . Using triangle inequality,

 ∀q∈B′i,j,d(pi,j,q) ≤d(pi,j,b′i,j)+d(b′i,j,q) ≤r′/4+r′/4=r′/2.\*

We claim any minimal solution needs at most one point from each ball . By contradiction, assume there are two point in the minimal solution that lie inside a ball . After removing , the ball with radius centered at still covers , since:

 ∀q∈P,∃Bi,B′i,j∋q,pi,j d(q,pi,j) ≤d(q,b′i,j)+d(b′i,j,pi,j) ≤r′/4+r′/4=r/2′.\*

Then we have found a point whose removal decreases the size of the solution, which means the solution was not minimal. So, the size of any minimal set of points covering is at most . ∎

###### Lemma 2.

In a metric space with doubling constant , if a dual clustering with radius has points, then a dual clustering with radius exists which has points.

###### Proof.

Let be a center in the -center problem. Based on the proof of Lemma 1, there are vertices adjacent to that cover the points inside the ball of radius centered at , using balls of radius and a ball of radius centered at . By choosing all these vertices as centers, it is possible to cover all input points with radius . Using the same reasoning for all clusters, it is possible to cover all points using centers. Using the bound in Lemma 1, these are centers. ∎

###### Theorem 1.

The approximation factor of Algorithm 3 is for the dual clustering.

###### Proof.

Since the radius of balls in Lemma 2 is at most the optimal radius for -center, the approximation factor of dual clustering is the number of points chosen as centers divided by , which is . ∎

###### Theorem 2.

The approximation factor of the coreset for -center in Algorithm 3 is and its size is .

###### Proof.

Applying Lemma 2 halves the radius and multiplies the number of points by . So, applying this lemma times gives points since it might be the case that in the first step of the algorithm the optimal radius was found, and we divided it by . The radius remains because of the case where we had found a -approximation. ∎

###### Theorem 3.

Algorithm 3 given as input, is a -approximation coreset of size for the -center problem.

###### Proof.

For , the proof of Theorem 2 gives points and radius . Assume is the set of centers returned by the optimal algorithm for point-set , and is the set of centers returned by running the optimal algorithm on the coreset of . For any point , let be the center that covers and be the point that represents in the coreset. Using triangle inequality:

 d(p,c)≤d(p,o)+d(o,c)≤r+rϵ=(1+ϵ)r

So, computing a -center on this coreset gives a -approximation. ∎

## 4 A Composable Core-Set for k-Center in Doubling Metrics

Our general algorithm for constructing coresets based on dual clustering has the following steps:

• Compute the cost of an approximate solution .

• Find a composable coreset for dual clustering with cost .

• Compute a clustering on the coreset.

In this section, we use this general algorithm for solving -center.

### 4.1 Algorithm

Knowing the exact or approximate value of , we can find a single-round -approximation for metric -center in MapReduce. Although the algorithm achieves the aforementioned approximation factor, the size of the coreset and the communication complexity of the algorithm depend highly on the doubling dimension.

Based on the running time of Algorithm 2 and Gonzalez’s algorithm, the running time of Algorithm 4 is . Since the sum of running times of machines is of the same order as the best sequential algorithm, Algorithm 4 is a work-efficient parallel algorithm.

We review the following well-known lemma:

###### Lemma 3.

For a subset , the optimal radius of the -center of is at most twice the radius of the -center of .

###### Proof.

Consider the set of clusters in the optimal -center of centered at with radius . If , then the points of are covered by with radius , as before. Otherwise, select an arbitrary point in as the new center . Using the triangle inequality on and any point :

 d(p,c′i)≤d(p,ci)+d(ci,c′i)≤r+r=2r

Since was covered using with radius . So, the set can be covered with radius . Note that since we choose at most one point from each set, the number of new centers is at most . ∎

###### Theorem 4.

The approximation factor of Algorithm 4 is for metric -center.

###### Proof.

Let be the optimal radius of -center for . Since , using Lemma 3, the radius of -center for is at most . The radius of -center inside each set is at most for the same reason. The algorithm computes a covering with balls of radius . Based on the fact that offline -center has -approximation algorithms and the triangle inequality, the approximation factor of the algorithm proves to be -approximation (Figure 3). Let , then

 ∀s∈Si∃c∈C,d(s,c) ≤d(s,p)+d(p,c)≤r′+riϵ/2 ≤2r+2rϵ/2=(2+ϵ)r\*

where is the radius of the offline -center algorithm on . ∎

### 4.2 Analysis

###### Lemma 4.

In a metric space with doubling constant , the union of dual clusterings of radius computed on sets is a -approximation for the dual clustering of radius of their union .

###### Proof.

Each center in the dual clustering with radius of has at most adjacent vertices covered by this center. Consider a point covered by center in a solution for . If and belong to the same set , assign to . Otherwise, pick any point that was previously covered by as the center that covers .

While this might increase the radius by a factor , it does not increase the number of centers in each set. Since the algorithm uses radius , it increases the number of centers to (based on Theorem 2 for ) but keeps the approximation factor of the radius to . There are such sets, so the size of the coreset is . ∎

###### Theorem 5.

Algorithm 4 returns a coreset of size for -center in metric spaces with fixed doubling dimension.

###### Proof.

The coreset of each set has a radius varying from the optimal radius to , where is the approximation factor of the offline algorithm for -center. Clearly, the lower bound holds because any radius is at least as much as the optimal (minimum) radius, which means ; and Lemma 3 when applied to , yields the upper bound.

 r≤ri≤2β.r⇒rϵ4β≤riϵ4β≤ϵr2

Reaching value requires applying Theorem 4 at most times.

The size of the resulting coreset is therefore at most

 (4log2D)log24βϵkL=(4βϵ)2(log2D)kL.

Here, we use the best approximation factor for metric -center , which gives a coreset of size for fixed . ∎

### 4.3 Generalized Approximation Factor

We prove that any -approximation algorithm that does not choose a center from the points of another center can be used instead of Gonzalez’s algorithm in the MapReduce algorithm of , and a similar proof will give the approximation factor . Algorithm 5 shows the generalized algorithm.

###### Theorem 6.

Algorithm 5 given an -approximation metric -center algorithm with which does not choose a center from the points of another cluster, finds a -approximation solution.

###### Proof.

Assume is the optimal -center radius of . We prove that covers with radius at most . Suppose there is a point whose distance to its nearest point from is more than , so . The distance between each pair of points from is at least , since the algorithm never chooses a point as a center if it is within distance of another center. Therefore, the set has points with distance at least from each other. There are at most optimal clusters, so at least two of these points must lie inside a cluster, which means their distance is at most . This means that , which contradicts the previous bound .

A similar proof follows for and . Using triangle inequality, the distance from any point to its local center and its final center is bounded by:

 d(p,c(p))+d(c(p),c′(c(p)))≤αr∗+αr∗=2αr∗.

Note that the parametric pruning algorithm finds a dominating set by computing a maximal independent set, so the centers returned by this algorithm do not lie inside each others’ clusters.

### 4.4 A (1+ϵ)-Composable Core-Set

The composable coreset for -center in doubling metrics can be used to obtain a -approximation for constant and . All these results also hold for dual clustering, as a result of the proven trade-off between and .

###### Theorem 7.

Algorithm 6 gives a -approximation for -center in doubling metrics, for fixed and .

###### Proof.

The approximation factor of is and its size is , based on Theorem 5. Repeating the core-set computation gives the approximation factor , and has size as proved in Theorem 3. Checking all possible choices for centers from takes polynomial time, for fixed and , since:

 ((4ϵ)2log2Dkk)≤(e(4ϵ)2log2Dkk)k=(4ϵ)2klog2D.

Since the last step was optimal, the approximation factor of for -center is . ∎

## 5 The Exponential Nature of The Trade-off Between r and k

The same constructive algorithm yields an exponential lower bound on the trade-off between and of -center.

We build the following example by placing a point at the center of each ball from ball covering problem using balls of radius to cover a ball of radius , and repeating this process recursively.

###### Example 1.

Cover the ball of radius with balls of radius , where is the doubling constant of the metric space. Repeat this process with each of the balls of radius . The number of balls in the -th iteration of this process is and their radius is .

###### Lemma 5.

A circle packing of radius with circles of radius is an upper bound for the ball cover using circles of radius , and the circle packing using circles of radius is a lower bound for the ball cover of radius .

###### Proof.

The circle packing has the maximum number of circles, so there is no room for more circles in the empty spaces between those circles. Therefore, increasing the radius of circles to twice the previous radius will cover the circle of radius . So, the circle packing for circles of radius gives an upper bound on the minimum number of circles required to cover a circle of radius .

On the other hand, circle packing for circles of radius is a lower bound for the minimum number of circles required to cover the circle of radius , since all those circles are disjoint. ∎

###### Theorem 8.

The optimal trade-off between and is exponential.

###### Proof.

Based on Lemma 5, Theorem 2 gives both a lower bound and an upper bound on the trade-off between and for points and radius , within a constant factor for doubling metrics. Lemma 2 gives the upper bound for each step, and the lower bound in Example 1 is . Substituting this bound in the trade-off of Theorem 2 gives the ratio between the upper bound and the lower bound of in this trade-off which is , where is the radius of the balls used for covering the points. ∎

Better trade-offs in and can be achieved by replacing with the square of the bound from circle/sphere covering for radius in Theorem 8 instead.

## 6 A Comparison of The Algorithms for Metric k-Center

We consider variations of Gonzalez’s greedy algorithm and the parametric pruning algorithm in which arbitrary choices are replaced by random ones. In the worst case, even the randomized version of these algorithms cannot achieve an approximation factor better than .

We also prove the solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm.

###### Lemma 6.

There are examples in which randomized Gonzalez’s algorithm cannot do better than -approximation in the best case.

###### Proof.

We prove this lemma by the counterexample of Figure 4. The measures of the example are as follows:

 d(Pi,P′i)=ϵ,d(P′i,P′j)=1−ϵ,d(Pi,Pj)=1.

The farthest neighbor computation prevents the algorithm from choosing solutions of cost 1 such as or .

###### Lemma 7.

The solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm.

###### Proof.

Let be the radius and be the set of centers computed by Gonzalez’s algorithm, after removing the last centers if they do not decrease the cluster radius. Consider the graph such that is the set of input points and is the set of all pairs of points with distance at most . By the anti-cover property of Gonzalez’s algorithm, is an independent set of .

Since the maximal independent set algorithm visits vertices in an arbitrary order, use the order of visiting used in Gonzalez’s algorithm. Consider an instance of the parametric pruning algorithm that at the -th step, visits the points of in the order of Gonzalez’s algorithm after it has chosen centers. In such an instance, all the edges between the points of and their corresponding members from have already appeared in the sorted list of edges, since they have a lower edge weight than . Also, there are no edges between the points of , since Gonzalez’s algorithm chooses the farthest point from previous points, so the distance between the centers is more than . Therefore, is a maximal independent set of the disk graph of radius . All the radii less than that are checked by the parametric pruning algorithm will fail since is the minimum radius that covers using points of . For radius , the parametric pruning algorithm finds the solution with as centers.

We proved that there is an execution of the maximal independent set algorithm on the square of that finds as the set of centers. ∎

###### Lemma 8.

There are examples in which randomized parametric pruning algorithm for -center cannot do better than a -approximation in the best case.

###### Proof.

Any solution in the form of a dominating set of the unit disk graph that is not also an independent set is a solution that the parametric pruning algorithm cannot find. See Figure 5 for an example. In this example, are an optimal solution, but the parametric pruning algorithm does not find them. is a solution that the parametric pruning algorithm can find, because it is an independent set. Figure 5: {p1,p2} is a dominating set which is not also an independent set.

## 7 Efficient Parametric Pruning Algorithm

We need to keep the time and space complexity of the coreset computation algorithm near linear. Since the time complexity of parametric pruning in general metrics is and its space complexity is , we give a -approximation algorithm with time and space. Later, we use this algorithm to find a -approximation algorithm for general metrics.

###### Theorem 9.

The time complexity of Algorithm 7 is and its memory complexity is .

###### Proof.

For each point, if it has been visited before, the algorithm ignores it, otherwise the algorithm chooses it as the next center, in which case it checks at most other points. Since there are at most centers, using aggregate method for amortized analysis, the running time of the algorithm is for each . The number of values that are checked by the algorithm is . Using taylor series , the overall time complexity is . ∎

###### Theorem 10.

The approximation factor of Algorithm 7 is for metric -center.

###### Proof.

Consider the disk graph of points with radius , where is the optimal radius. Using this radius, each cluster turns into a clique, so the maximal independent set subroutine of parametric pruning algorithm chooses at most one point from each cluster.

Algorithm 7 computes a maximal independent set of the disk graph of radius at each step. For , at most points are marked as centers by the algorithm. The algorithm starts from a lower bound on the radius and multiplies it by . So, in the worst case the first radius that the algorithm checks which exceeds is . ∎

## 8 Connectivity Preservation and Applications to DBSCAN

Computing the connected components of a graph is harder than testing the connectivity between two vertices of the graph. It has been conjectured that sparse -connectivity in rounds and single-linkage clustering in high dimensions cannot be solved using a constant number of MapReduce rounds, by reduction from connectivity problem .

In DBSCAN, a point that has at least other points within distance from it is called a core point. A cluster is a connected component of the intersection graph of balls of radius centered at core points. Any point that is not within distance of a core point is an outlier. Therefore, the algorithm can be seen as two main steps: simultaneous range counting queries, and computing the connected components of the disk graph. Both of these problems are challenging in the MapReduce model.

We use dual clustering to solve a non-convex clustering problem in MapReduce. Several MapReduce algorithms for density-based spatial clustering of applications with noise (DBSCAN) has been presented [26, 13, 19, 36, 29]. However, they lack theoretical guarantees. We use the abstract DBSCAN algorithm , which only differs from the original DBSCAN algorithm  in its time complexity , but computes the range counting queries prior to computing the connected components of the disk graph.

Several algorithms for range counting queries exist in MapReduce, but it is not possible in the model to run instances of single-query range search  simultaneously, since the data from one machine could be used in the solution of points from machines, for a constant . Mergeable summaries for range counting queries are randomized approximation algorithms which are also composable . Note that range queries for unit disks in can be converted into rectangular range queries in , via linearization , therefore, any algorithm for rectangular range counting also solves the problem for disk range counting.

Our core-set for dual clustering of radius , approximately preserves the connectivity of edges of weight at most between clusters.

###### Lemma 9.

For two cluster centers of clusters of radius , they are said to be connected if there is a point . Algorithm 3 with radius detects if such two cluster centers are connected or not.

###### Proof.

By definition of clustering, the distances from each point to its cluster center is at most , so

 d(p,ci)≤ϵ/2,d(q,cj)≤ϵ/2.

If , then using triangle inequality twice gives the following results:

 d(ci,cj)≤d(ci,p)+d(p,cj)≤ϵ/2+ϵ/2=ϵ.

Algorithm 8 with the minimum number of points of each cluster set to one, and then using the dual clustering, can be used to solve the DBSCAN problem in doubling metrics.

###### Theorem 11.

Algorithm 8 solves DBSCAN using rounds of MapReduce, given that , where is the size of the output.

###### Proof.

Using Lemma 9, has the same connected components for the points of as the optimal solution. Therefore, connecting each point to its nearest neighbor in gives an exact DBSCAN clustering.

Let be the number of points required to represent the clusters. Based on Theorem 1, the number of points returned by the algorithm is at most . Sending data from machines to one machine requires .

Running Algorithm 3 takes rounds, computing the union and sending the clusters to all machines each take one round. So, the total number of rounds is . ∎

Note that in Algorithm 8, even without sending the points to a single machine, the set in Algorithm 8 is still a composable core-set for DBSCAN.

## 9 Experimental Results

Description of data sets used in our experiments is depicted in Table 2. Euclidean distance is used for all data sets. Note that DEXTER data set is not doubling, since it has a higher dimension than the number of its instances.

The size of data chunks used for partitioning the data is .

### 9.1 Randomized Gonzalez vs. Randomized Parametric Pruning

In Section 6, we proved that the solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm. We compare the randomized version of these algorithms, where arbitrary choices in these algorithms are replaced by randomized ones. Then, we empirically compare the approximation factor of the resulting algorithms. Figure 6: A comparison of Gonzalez’s greedy algorithm and the parametric pruning algorithm on Parkinson data-set. Figure 7: A comparison of Gonzalez’s greedy algorithm and the parametric pruning algorithm on Dexter data-set.

The experiments show that the effect of randomization when choosing the points is slight, however, the differences between the approximation factor of the algorithms are more significant. In Figure 6, the results of the algorithm for a data-set in low dimensional Euclidean space, which is a doubling metric are given. Figure 7 shows the results for a high-dimensional Euclidean space, which is not a doubling dimension.

### 9.2 A Comparison in MapReduce

In this experiment, we compared the approximation factor of the efficient parametric pruning algorithm (Algorithm 7) using with the greedy algorithm of Gonzalez extended to MapReduce . Figure 8: A comparison of Gonzalez’s greedy algorithm and the parametric pruning algorithm on Higgs data-set. Figure 9: A comparison of Gonzalez’s greedy algorithm and the parametric pruning algorithm on Power data-set.

The radii of Gonzalez’s greedy algorithm in MapReduce are times the radii of parametric pruning algorithm on average on Higgs (Figure 8) and Power (Figure 9) data-sets.

## 10 Conclusions

Gonzalez’s algorithm  is a special case of parametric pruning algorithm  in which the greedy maximal independent set computation prioritizes the points with the maximum distance from the currently chosen points. Our algorithm and trade-off partially answers the open question of  about comparing and improving these two algorithms in MapReduce. We propose a modified parametric pruning algorithm with running time that achieves a better approximation factor in practice. Finding algorithms with provable approximation factor in the worst-case and better approximation factors on average remains open.

We also proved that the best possible trade-off between the approximation factor and the number of centers of -center in doubling metrics is exponential.

Our composable coreset for dual clustering gives constant factor approximation for minimizing the size of DBSCAN cluster representatives given that the neighbor-counting is done prior to computing the coreset and the connected components. Finding a summarization technique that can preserve both the number of near neighbors and the connectivity between clusters in general metrics remains open. Note that in doubling metrics, keeping the number of points assigned to each center approximately solves this problem.

## References

•  P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):26, 2013.
•  P. K. Agarwal, K. Fox, K. Munagala, and A. Nath. Parallel algorithms for constructing range and nearest-neighbor searching data structures. In Proc. 35th ACM SIGMOD-SIGACT-SIGAI Sympos. Princ. Database Syst., pages 429–440. ACM, 2016.
•  P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004.
•  S. Aghamolaei, M. Farhadi, and H. Zarrabi-Zadeh. Diversity maximization via composable coresets. In Proc. 27th Canad. Conf. Computat. Geom., 2015.
•  S. Aghamolaei and M. Ghodsi. A composable coreset for k-center in doubling metrics. In Proc. 30th Canad. Conf. Computat. Geom., 2018.
•  M. Bādoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proc. 34th Annu. ACM Sympos. Theory Comput., pages 250–257. ACM, 2002.
•  P. Baldi, P. Sadowski, and D. Whiteson.

Searching for exotic particles in high-energy physics with deep learning.

Nature communications, 5:4308, 2014.
•  M. Bateni, A. Bhaskara, S. Lattanzi, and V. Mirrokni. Distributed balanced clustering via mapping coresets. In Advances in Neural Information Processing Systems (NIPS), pages 2591–2599, 2014.
•  M. Ceccarello, A. Pietracaprina, and G. Pucci. Solving -center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially, 2018.
•  M. Ceccarello, A. Pietracaprina, G. Pucci, and E. Upfal. Mapreduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension. Proceedings of the VLDB Endowment, 10(5):469–480, 2017.
•  M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417–1440, 2004.
•  M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem (extended abstract). In Proc. 31st Annu. ACM Sympos. Theory Comput., pages 1–10, New York, NY, USA, 1999. ACM.
•  B.-R. Dai and I.-C. Lin. Efficient map/reduce-based dbscan algorithm with optimized data partition. In 2012 IEEE 5th International Conference on Cloud Computing (CLOUD), pages 59–66. IEEE, 2012.
•  D. Dheeru and E. Karra Taniskidou.

UCI machine learning repository, 2017.

•  H. Ding. Greedy strategy works for clustering with outliers and coresets construction. arXiv preprint arXiv:1901.08219, 2019.
•  A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 681–689. ACM, 2011.
•  M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), volume 96, pages 226–231, 1996.
•  T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proc. 20th Annu. ACM Sympos. Theory Comput., pages 434–444. ACM, 1988.
•  Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on parallel dbscan algorithm design based on mapreduce. In Advanced Materials Research, volume 301, pages 1133–1138. Trans Tech Publ, 2011.
•  J. Gan and Y. Tao. Dbscan revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 519–530, 2015.
•  T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science (TCS), 38:293–306, 1985.
•  S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(3):515–528, 2003.
•  A. Gupta, R. Krauthgamer, and J. R. Lee. Bounded geometries, fractals, and low-distortion embeddings. In Proc. 44th Annu. IEEE Sympos. Found. Comput. Sci., pages 534–543. IEEE, 2003.
•  I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror.

Result analysis of the nips 2003 feature selection challenge.

In Advances in Neural Information Processing Systems (NIPS), pages 545–552, 2005.
•  S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput., 35(5):1148–1184, 2006.
•  Y. He, H. Tan, W. Luo, S. Feng, and J. Fan.

Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data.

Frontiers of Computer Science, 8(1):83–99, 2014.
•  S. Im and B. Moseley. Brief announcement: Fast and better distributed mapreduce algorithms for k-center clustering. In Proc. 27th ACM Sympos. Parallel Algorithms Architect., pages 65–67. ACM, 2015.
•  P. Indyk, S. Mahabadi, M. Mahdian, and V. S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proc. 33rd ACM SIGMOD-SIGACT-SIGAI Sympos. Princ. Database Syst., pages 100–108. ACM, 2014.
•  Y. Kim, K. Shim, M.-S. Kim, and J. S. Lee. Dbcure-mr: an efficient density-based clustering algorithm for large data using mapreduce. Inf. Syst., 42:15–35, 2014.
•  S. Li and O. Svensson. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530–547, 2016.
•  C. Liao and S. Hu. Polynomial time approximation schemes for minimum disk cover problems. J. Comb. Optim., 20(4):399–412, 2010.
•  M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical engineering online, 6(1):23, 2007.
•  G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems (NIPS), pages 1063–1071, 2015.
•  J. McClintock and A. Wirth. Efficient parallel algorithms for k-center clustering. In Parallel Processing (ICPP), 2016 45th International Conference on, pages 133–138. IEEE, 2016.
•  V. Mirrokni and M. Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. In Proc. 47th Annu. ACM Sympos. Theory Comput., pages 153–162. ACM, 2015.
•  M. Noticewala and D. Vaghela. Mr-idbscan: Efficient parallel incremental dbscan algorithm using mapreduce. International Journal of Computer Applications (IJCA), 93(4), 2014.
•  E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans. Database Syst., 42(3):19, 2017.
•  C. D. Toth, J. O’Rourke, and J. E. Goodman. Handbook of discrete and computational geometry. Chapman and Hall/CRC, 2017.
•  V. V. Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.
•  G. Yaroslavtsev and A. Vadapalli. Massively parallel algorithms and hardness for single-linkage clustering under -distances. arXiv preprint arXiv:1710.01431, 2017.