1 Introduction
Coresets are subsets of points that approximate a measure of the point set. A method of computing coresets on big data sets is composable coresets. Composable coresets [28] provide a framework for adapting constant factor approximation algorithms to streaming and MapReduce models. Composable coresets summarize distributed data so that the scalability is increased while keeping the desirable approximation factor and time complexity.
There is a general algorithm for solving problems using coresets which known by different names in different settings: mergeable summaries [1] and merging in a treelike structure [3] for streaming approximation algorithms, small space (divide and conquer) for constant factor approximations in streaming [22], and composable coresets in MapReduce [28]. A consequence of using constant factor approximations instead of approximations with the same merging method is that it can add a factor to the approximation factor of the algorithm on an input of size .
Composable coresets [28] require only a single round and sublinear communications in the MapReduce model, and the partitioning is done arbitrarily.
Definition 1 (Composable Coreset).
A composable coreset on a set of sets is a set of subsets whose union gives an approximation solution for an objective function . Formally, a composable coreset of a minimization problem is an approximation if
for a minimization problem. The maximization version is similarly defined.
A partitioned composable coreset is a composable coreset in which the initial sets are a partitioning, i.e. sets are disjoint. Using Gonzalez’s algorithm for center [21], Indyk, et al. designed a composable coreset for a similar problem known as the diversity maximization problem [28]. Other variations of composable coresets are randomized composable coresets and mapping coresets. Randomized composable coresets [35] share the same divide and conquer approach as other composable coresets and differ from composable coresets only in the way they partition the data. More specifically, randomized composable coresets, randomly partitioning the input, as opposed to other composable coresets which make use of arbitrary partitioning. Mapping coresets [8] extend composable coresets by adding a mapping between coreset points and other points to their coresets and keep almost the same amount of data in all machines. Algorithms for clustering in norms using mapping coresets are known [8]. Further improvements of composable coresets for diversity maximization [28] include lower bounds [4] and multiround composable coresets in metrics with bounded doubling dimension [10].
Metric center is an NPhard problem for which approximation algorithms that match the lower bound for the approximation factor of this problem are known [39, 21]. Among approximation algorithms for center is a parametric pruning algorithm, based on the minimum dominating set [39]. In this algorithm, an approximate dominating set is computed on the disk graph of the input points. The running time of the algorithm is . The greedy algorithm for center requires only time [21] and unlike the algorithm based on the minimum dominating set[39], uses nets [25]. A approximation coreset exists for center [6] with size exponentially dependent on .
Let the optimal radius of center for a point set be . The problem of finding the smallest set of points that cover using radius is known as the dual clustering problem [11].
Metric dual clustering (of center) has an unbounded approximation factor [11]. In Euclidean metric, there exists a streaming approximation algorithm for this problem [11]. Also, any approximation algorithm for the minimum disk/ball cover problem gives a approximation coreset of size for center, so approximation coresets of size exist for this problem [31]. A greedy algorithm for dual clustering of center has also been used as a preprocessing step of densitybased clustering (DBSCAN) [17]. Implementing DBSCAN efficiently in MapReduce is an important problem [26, 13, 19, 36, 29].
Randomized algorithms for metric center and median in MapReduce [16] exist. These algorithms take approximation offline algorithms and return approximation and approximation algorithms for center and
median in MapReduce, respectively. The round complexity of these algorithms depends on the probability of the algorithm for finding a good approximation.
Current best results on metric center in MapReduce have rounds and give the approximation factor [33]. However, a approximation algorithm exists if the cost of the optimal solution is known [27]. Experiments in [34] suggest that running Gonzalez’s algorithm on a random partitioning and an arbitrary partitioning results in the same approximation factor.
In doubling metrics, a approximation algorithm exists that is based on Gonzalez’s greedy algorithm [9]
. The version with outliers has also been discussed
[9, 15].WarmUp
Increasing the size of coresets in the first step of computing composable coresets can improve the approximation factor of some problems. The approximation factor of median algorithm of [22] is , where and are the approximation factors of median and weighted median, respectively. This algorithm computes a composable coreset, where a coreset for median is the set of medians weighted by the number of points assigned to each median.
A pseudoapproximation for median finds median and has approximation factor [30]. Using a pseudoapproximation algorithm in place of median algorithms in the first step of [22], it is possible to achieve a better approximation factor for median using the same proof as [22]. Since any pseudoapproximation has a cost less than or equal to the optimal solution; replacing them will not increase the cost of clustering.
The approximation factor using [12] as weighted median coresets is , while the best median algorithm would give a factor using the same algorithm (). The lower bound on the approximation factor of this algorithm using the same weighted median algorithm but without pseudoapproximation is ().
Contributions
We give a approximation coreset of size for center in metric spaces with doubling dimension . Using composable coresets, our algorithm generalizes to MapReduce setting, where it becomes a approximation coreset of size , given memory , which is sublinear in the input size .
Conditions  Approx.  Reference 
Metric center:  
rounds  [33] (greedy), Theorem 6 (parametric pruning)  
rounds  [27] (parametric pruning)  
Lower bound  offline [39]  
Doubling metrics:  
rounds  [9] (greedy), Theorem 4 (parametric pruning)  
Lower bound  [18]  
Dual clustering:  
General metrics  min dominating set [39], composable coreset [28]  
Doubling metrics  Theorem 1 
Using the composable coreset for dual clustering, we find a approximation composable coreset for center, which has a sublinear size in metric spaces with constant doubling dimension. More specifically, if an approximation exists for doubling metrics, our algorithm provides approximation factor. It empirically improves previous metric center algorithms [33, 34] in MapReduce. A summary of results on center is shown in Table 1. Note that for the MapReduce model, each round can take a polynomial amount of time, however, the space available to each machine is sublinear.
Our algorithm achieves a tradeoff between the approximation factor and the size of coreset (see fig. 1). The approximation factor of our algorithm and the size of the resulting composable coreset for input sets are and , respectively. This tradeoff is the main idea of our algorithm.
Our composable coresets give singlepass streaming algorithms and round approximation algorithms in MapReduce with sublinear communication, since each coreset is communicated once, and the size of the coreset is constant.
2 Preliminaries
First, we review some basic definitions, models and algorithms in computational geometry and MapReduce.
2.1 Definitions
Some geometric definitions and notations are reviewed here, which have been used in the rest of the paper.
Definition 2 (Metric Space).
A (possibly infinite) set of points and a distance function create a metric space if the following three conditions hold:



, known as triangle inequality
Metrics with bounded doubling dimension are called doubling metrics. Constant dimension Euclidean spaces under norms and Manhattan distance are examples of doubling metrics.
Doubling constant [23] of a metric space is the number of balls of radius that lie inside a ball of radius . The logarithm of doubling constant in base is called doubling dimension. Many algorithms have better approximation factors in doubling metrics compared to general metric spaces. The doubling dimension of Euclidean plane is .
Definition 3 (Doubling Dimension [23]).
For any point in a metric space and any , if the ball of radius centered at can be covered with at most balls of radius , we say the doubling dimension of the metric space is .
Center is an NPhard clustering problem with clusters in shapes of dimensional balls.
Definition 4 (Metric Center [39]).
Given a set of points in a metric space, find a subset of points as cluster centers such that
and is minimized.
The best possible approximation factor of metric center is [39].
Geometric intersection graphs represent intersections between a set of shapes. For a set of disks, their intersection graph is called a disk graph.
Definition 5 (Disk Graph).
For a set of points in a metric space with distance function and a radius , the disk graph of is a graph whose vertices are , and whose edges connect points with distance at most .
Definition 6 (Dominating Set).
Given a graph , the smallest subset is a minimum dominating set, if .
We define the following problem as a generalization of the dual clustering of [11] by removing the following two conditions: the radius of balls is , and the set of points are in .
Definition 7 (Dual Clustering).
Given a set of points and a radius , the dual clustering problem finds the smallest subset of points as centers such that the distance from each point to its closest center is at most .
2.2 An Approximation Algorithm for Metric Center
Here, we review the parametric pruning algorithm of [39] for metric center.
Using this algorithm on a metric graph , a approximation for the optimal radius can be determined. In algorithm 1, edges are added by increasing order of their length until reaching . Given this radius, another graph is built, where edges exist between points within distance at most of each other.
Hence, by definition, a minimum dominating set of is an optimal center of . Every cluster is a star in which turns into a clique in . Therefore, a maximal independent set of chooses at most one point from each cluster. Algorithm 2 computes and returns a maximal independent set of .
Computing a maximal independent set takes time. The graph in Algorithm 2 only changes in each iteration of Algorithm 1 around the newly added edge, so, updating the previous graph and takes time. Therefore, the time complexity of Algorithm 1 is .
3 A Coreset for Dual Clustering in Doubling Metrics
In this section, we prove a better approximation offline coreset for the dual clustering problem. Our method is based on Algorithm 1 which first builds the disk graph with radius , then covers this graph using a set of stars. We prove the maximum degree of those stars is , where is the doubling constant. The result is an approximation algorithm for dual clustering in doubling metrics.
3.1 Algorithm
We add a preprocessing step to Algorithm 1 to find a better approximation factor for center and dual clustering problems.
3.2 Analysis
Unlike in general metric spaces, center in doubling metrics admits a spaceapproximation factor tradeoff. More specifically, doubling or halving the radius of center changes the number of points in the coreset by a constant factor since the degrees of vertices in the minimum dominating set are bounded in those metric spaces.
Lemma 1.
For each cluster of Algorithm 3 with radius , the maximum number of points from that are required to cover all points inside with radius is at most , i.e.
where is the doubling constant of the metric space.
Proof.
Assume a point returned by Algorithm 3. By the definition of doubling metrics, there are balls of radius centered at called that cover the ball of radius centered at , called .
Repeating this process for each ball results in a set of at most balls of radius centered at .
Choose a point . Using triangle inequality,
We claim any minimal solution needs at most one point from each ball . By contradiction, assume there are two point in the minimal solution that lie inside a ball . After removing , the ball with radius centered at still covers , since:
Then we have found a point whose removal decreases the size of the solution, which means the solution was not minimal. So, the size of any minimal set of points covering is at most . ∎
Lemma 2.
In a metric space with doubling constant , if a dual clustering with radius has points, then a dual clustering with radius exists which has points.
Proof.
Let be a center in the center problem. Based on the proof of Lemma 1, there are vertices adjacent to that cover the points inside the ball of radius centered at , using balls of radius and a ball of radius centered at . By choosing all these vertices as centers, it is possible to cover all input points with radius . Using the same reasoning for all clusters, it is possible to cover all points using centers. Using the bound in Lemma 1, these are centers. ∎
Theorem 1.
The approximation factor of Algorithm 3 is for the dual clustering.
Proof.
Since the radius of balls in Lemma 2 is at most the optimal radius for center, the approximation factor of dual clustering is the number of points chosen as centers divided by , which is . ∎
Theorem 2.
The approximation factor of the coreset for center in Algorithm 3 is and its size is .
Proof.
Applying Lemma 2 halves the radius and multiplies the number of points by . So, applying this lemma times gives points since it might be the case that in the first step of the algorithm the optimal radius was found, and we divided it by . The radius remains because of the case where we had found a approximation. ∎
Theorem 3.
Algorithm 3 given as input, is a approximation coreset of size for the center problem.
Proof.
For , the proof of Theorem 2 gives points and radius . Assume is the set of centers returned by the optimal algorithm for pointset , and is the set of centers returned by running the optimal algorithm on the coreset of . For any point , let be the center that covers and be the point that represents in the coreset. Using triangle inequality:
So, computing a center on this coreset gives a approximation. ∎
4 A Composable CoreSet for Center in Doubling Metrics
Our general algorithm for constructing coresets based on dual clustering has the following steps:

Compute the cost of an approximate solution .

Find a composable coreset for dual clustering with cost .

Compute a clustering on the coreset.
In this section, we use this general algorithm for solving center.
4.1 Algorithm
Knowing the exact or approximate value of , we can find a singleround approximation for metric center in MapReduce. Although the algorithm achieves the aforementioned approximation factor, the size of the coreset and the communication complexity of the algorithm depend highly on the doubling dimension.
Based on the running time of Algorithm 2 and Gonzalez’s algorithm, the running time of Algorithm 4 is . Since the sum of running times of machines is of the same order as the best sequential algorithm, Algorithm 4 is a workefficient parallel algorithm.
We review the following wellknown lemma:
Lemma 3.
For a subset , the optimal radius of the center of is at most twice the radius of the center of .
Proof.
Consider the set of clusters in the optimal center of centered at with radius . If , then the points of are covered by with radius , as before. Otherwise, select an arbitrary point in as the new center . Using the triangle inequality on and any point :
Since was covered using with radius . So, the set can be covered with radius . Note that since we choose at most one point from each set, the number of new centers is at most . ∎
Theorem 4.
The approximation factor of Algorithm 4 is for metric center.
Proof.
Let be the optimal radius of center for . Since , using Lemma 3, the radius of center for is at most . The radius of center inside each set is at most for the same reason. The algorithm computes a covering with balls of radius . Based on the fact that offline center has approximation algorithms and the triangle inequality, the approximation factor of the algorithm proves to be approximation (Figure 3). Let , then
where is the radius of the offline center algorithm on . ∎
4.2 Analysis
Lemma 4.
In a metric space with doubling constant , the union of dual clusterings of radius computed on sets is a approximation for the dual clustering of radius of their union .
Proof.
Each center in the dual clustering with radius of has at most adjacent vertices covered by this center. Consider a point covered by center in a solution for . If and belong to the same set , assign to . Otherwise, pick any point that was previously covered by as the center that covers .
While this might increase the radius by a factor , it does not increase the number of centers in each set. Since the algorithm uses radius , it increases the number of centers to (based on Theorem 2 for ) but keeps the approximation factor of the radius to . There are such sets, so the size of the coreset is . ∎
Theorem 5.
Algorithm 4 returns a coreset of size for center in metric spaces with fixed doubling dimension.
Proof.
The coreset of each set has a radius varying from the optimal radius to , where is the approximation factor of the offline algorithm for center. Clearly, the lower bound holds because any radius is at least as much as the optimal (minimum) radius, which means ; and Lemma 3 when applied to , yields the upper bound.
Reaching value requires applying Theorem 4 at most times.
The size of the resulting coreset is therefore at most
Here, we use the best approximation factor for metric center , which gives a coreset of size for fixed . ∎
4.3 Generalized Approximation Factor
We prove that any approximation algorithm that does not choose a center from the points of another center can be used instead of Gonzalez’s algorithm in the MapReduce algorithm of [33], and a similar proof will give the approximation factor . Algorithm 5 shows the generalized algorithm.
Theorem 6.
Algorithm 5 given an approximation metric center algorithm with which does not choose a center from the points of another cluster, finds a approximation solution.
Proof.
Assume is the optimal center radius of . We prove that covers with radius at most . Suppose there is a point whose distance to its nearest point from is more than , so . The distance between each pair of points from is at least , since the algorithm never chooses a point as a center if it is within distance of another center. Therefore, the set has points with distance at least from each other. There are at most optimal clusters, so at least two of these points must lie inside a cluster, which means their distance is at most . This means that , which contradicts the previous bound .
A similar proof follows for and . Using triangle inequality, the distance from any point to its local center and its final center is bounded by:
∎
Note that the parametric pruning algorithm finds a dominating set by computing a maximal independent set, so the centers returned by this algorithm do not lie inside each others’ clusters.
4.4 A Composable CoreSet
The composable coreset for center in doubling metrics can be used to obtain a approximation for constant and . All these results also hold for dual clustering, as a result of the proven tradeoff between and .
Theorem 7.
Algorithm 6 gives a approximation for center in doubling metrics, for fixed and .
Proof.
The approximation factor of is and its size is , based on Theorem 5. Repeating the coreset computation gives the approximation factor , and has size as proved in Theorem 3. Checking all possible choices for centers from takes polynomial time, for fixed and , since:
Since the last step was optimal, the approximation factor of for center is . ∎
5 The Exponential Nature of The Tradeoff Between and
The same constructive algorithm yields an exponential lower bound on the tradeoff between and of center.
We build the following example by placing a point at the center of each ball from ball covering problem using balls of radius to cover a ball of radius , and repeating this process recursively.
Example 1.
Cover the ball of radius with balls of radius , where is the doubling constant of the metric space. Repeat this process with each of the balls of radius . The number of balls in the th iteration of this process is and their radius is .
Lemma 5.
A circle packing of radius with circles of radius is an upper bound for the ball cover using circles of radius , and the circle packing using circles of radius is a lower bound for the ball cover of radius .
Proof.
The circle packing has the maximum number of circles, so there is no room for more circles in the empty spaces between those circles. Therefore, increasing the radius of circles to twice the previous radius will cover the circle of radius . So, the circle packing for circles of radius gives an upper bound on the minimum number of circles required to cover a circle of radius .
On the other hand, circle packing for circles of radius is a lower bound for the minimum number of circles required to cover the circle of radius , since all those circles are disjoint. ∎
Theorem 8.
The optimal tradeoff between and is exponential.
Proof.
Based on Lemma 5, Theorem 2 gives both a lower bound and an upper bound on the tradeoff between and for points and radius , within a constant factor for doubling metrics. Lemma 2 gives the upper bound for each step, and the lower bound in Example 1 is . Substituting this bound in the tradeoff of Theorem 2 gives the ratio between the upper bound and the lower bound of in this tradeoff which is , where is the radius of the balls used for covering the points. ∎
Better tradeoffs in and can be achieved by replacing with the square of the bound from circle/sphere covering for radius in Theorem 8 instead.
6 A Comparison of The Algorithms for Metric Center
We consider variations of Gonzalez’s greedy algorithm and the parametric pruning algorithm in which arbitrary choices are replaced by random ones. In the worst case, even the randomized version of these algorithms cannot achieve an approximation factor better than .
We also prove the solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm.
Lemma 6.
There are examples in which randomized Gonzalez’s algorithm cannot do better than approximation in the best case.
Proof.
We prove this lemma by the counterexample of Figure 4. The measures of the example are as follows:
The farthest neighbor computation prevents the algorithm from choosing solutions of cost 1 such as or .
∎
Lemma 7.
The solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm.
Proof.
Let be the radius and be the set of centers computed by Gonzalez’s algorithm, after removing the last centers if they do not decrease the cluster radius. Consider the graph such that is the set of input points and is the set of all pairs of points with distance at most . By the anticover property of Gonzalez’s algorithm, is an independent set of .
Since the maximal independent set algorithm visits vertices in an arbitrary order, use the order of visiting used in Gonzalez’s algorithm. Consider an instance of the parametric pruning algorithm that at the th step, visits the points of in the order of Gonzalez’s algorithm after it has chosen centers. In such an instance, all the edges between the points of and their corresponding members from have already appeared in the sorted list of edges, since they have a lower edge weight than . Also, there are no edges between the points of , since Gonzalez’s algorithm chooses the farthest point from previous points, so the distance between the centers is more than . Therefore, is a maximal independent set of the disk graph of radius . All the radii less than that are checked by the parametric pruning algorithm will fail since is the minimum radius that covers using points of . For radius , the parametric pruning algorithm finds the solution with as centers.
We proved that there is an execution of the maximal independent set algorithm on the square of that finds as the set of centers. ∎
Lemma 8.
There are examples in which randomized parametric pruning algorithm for center cannot do better than a approximation in the best case.
Proof.
Any solution in the form of a dominating set of the unit disk graph that is not also an independent set is a solution that the parametric pruning algorithm cannot find. See Figure 5 for an example. In this example, are an optimal solution, but the parametric pruning algorithm does not find them. is a solution that the parametric pruning algorithm can find, because it is an independent set.
∎
7 Efficient Parametric Pruning Algorithm
We need to keep the time and space complexity of the coreset computation algorithm near linear. Since the time complexity of parametric pruning in general metrics is and its space complexity is , we give a approximation algorithm with time and space. Later, we use this algorithm to find a approximation algorithm for general metrics.
Theorem 9.
The time complexity of Algorithm 7 is and its memory complexity is .
Proof.
For each point, if it has been visited before, the algorithm ignores it, otherwise the algorithm chooses it as the next center, in which case it checks at most other points. Since there are at most centers, using aggregate method for amortized analysis, the running time of the algorithm is for each . The number of values that are checked by the algorithm is . Using taylor series , the overall time complexity is . ∎
Theorem 10.
The approximation factor of Algorithm 7 is for metric center.
Proof.
Consider the disk graph of points with radius , where is the optimal radius. Using this radius, each cluster turns into a clique, so the maximal independent set subroutine of parametric pruning algorithm chooses at most one point from each cluster.
Algorithm 7 computes a maximal independent set of the disk graph of radius at each step. For , at most points are marked as centers by the algorithm. The algorithm starts from a lower bound on the radius and multiplies it by . So, in the worst case the first radius that the algorithm checks which exceeds is . ∎
8 Connectivity Preservation and Applications to DBSCAN
Computing the connected components of a graph is harder than testing the connectivity between two vertices of the graph. It has been conjectured that sparse connectivity in rounds and singlelinkage clustering in high dimensions cannot be solved using a constant number of MapReduce rounds, by reduction from connectivity problem [40].
In DBSCAN, a point that has at least other points within distance from it is called a core point. A cluster is a connected component of the intersection graph of balls of radius centered at core points. Any point that is not within distance of a core point is an outlier. Therefore, the algorithm can be seen as two main steps: simultaneous range counting queries, and computing the connected components of the disk graph. Both of these problems are challenging in the MapReduce model.
We use dual clustering to solve a nonconvex clustering problem in MapReduce. Several MapReduce algorithms for densitybased spatial clustering of applications with noise (DBSCAN) has been presented [26, 13, 19, 36, 29]. However, they lack theoretical guarantees. We use the abstract DBSCAN algorithm [37], which only differs from the original DBSCAN algorithm [17] in its time complexity [20], but computes the range counting queries prior to computing the connected components of the disk graph.
Several algorithms for range counting queries exist in MapReduce, but it is not possible in the model to run instances of singlequery range search [2] simultaneously, since the data from one machine could be used in the solution of points from machines, for a constant . Mergeable summaries for range counting queries are randomized approximation algorithms which are also composable [1]. Note that range queries for unit disks in can be converted into rectangular range queries in , via linearization [38], therefore, any algorithm for rectangular range counting also solves the problem for disk range counting.
Our coreset for dual clustering of radius , approximately preserves the connectivity of edges of weight at most between clusters.
Lemma 9.
For two cluster centers of clusters of radius , they are said to be connected if there is a point . Algorithm 3 with radius detects if such two cluster centers are connected or not.
Proof.
By definition of clustering, the distances from each point to its cluster center is at most , so
If , then using triangle inequality twice gives the following results:
∎
Algorithm 8 with the minimum number of points of each cluster set to one, and then using the dual clustering, can be used to solve the DBSCAN problem in doubling metrics.
Theorem 11.
Algorithm 8 solves DBSCAN using rounds of MapReduce, given that , where is the size of the output.
Proof.
Using Lemma 9, has the same connected components for the points of as the optimal solution. Therefore, connecting each point to its nearest neighbor in gives an exact DBSCAN clustering.
Let be the number of points required to represent the clusters. Based on Theorem 1, the number of points returned by the algorithm is at most . Sending data from machines to one machine requires .
Running Algorithm 3 takes rounds, computing the union and sending the clusters to all machines each take one round. So, the total number of rounds is . ∎
Note that in Algorithm 8, even without sending the points to a single machine, the set in Algorithm 8 is still a composable coreset for DBSCAN.
9 Experimental Results
Description of data sets used in our experiments is depicted in Table 2. Euclidean distance is used for all data sets. Note that DEXTER data set is not doubling, since it has a higher dimension than the number of its instances.
Data Set  # of Instances  # of Dimensions  Preprocessing 

Parkinson [32]  5875  26   
DEXTER [24]  2600  20000   
Power  2049280  7  No missing values, numerical attributes only 
Higgs [7]  11000000  7  highlevel attributes only 
The size of data chunks used for partitioning the data is .
9.1 Randomized Gonzalez vs. Randomized Parametric Pruning
In Section 6, we proved that the solutions of Gonzalez’s algorithm are a subset of the solutions of the parametric pruning algorithm. We compare the randomized version of these algorithms, where arbitrary choices in these algorithms are replaced by randomized ones. Then, we empirically compare the approximation factor of the resulting algorithms.
The experiments show that the effect of randomization when choosing the points is slight, however, the differences between the approximation factor of the algorithms are more significant. In Figure 6, the results of the algorithm for a dataset in low dimensional Euclidean space, which is a doubling metric are given. Figure 7 shows the results for a highdimensional Euclidean space, which is not a doubling dimension.
9.2 A Comparison in MapReduce
In this experiment, we compared the approximation factor of the efficient parametric pruning algorithm (Algorithm 7) using with the greedy algorithm of Gonzalez extended to MapReduce [21].
10 Conclusions
Gonzalez’s algorithm [21] is a special case of parametric pruning algorithm [39] in which the greedy maximal independent set computation prioritizes the points with the maximum distance from the currently chosen points. Our algorithm and tradeoff partially answers the open question of [34] about comparing and improving these two algorithms in MapReduce. We propose a modified parametric pruning algorithm with running time that achieves a better approximation factor in practice. Finding algorithms with provable approximation factor in the worstcase and better approximation factors on average remains open.
We also proved that the best possible tradeoff between the approximation factor and the number of centers of center in doubling metrics is exponential.
Our composable coreset for dual clustering gives constant factor approximation for minimizing the size of DBSCAN cluster representatives given that the neighborcounting is done prior to computing the coreset and the connected components. Finding a summarization technique that can preserve both the number of near neighbors and the connectivity between clusters in general metrics remains open. Note that in doubling metrics, keeping the number of points assigned to each center approximately solves this problem.
References
 [1] P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):26, 2013.
 [2] P. K. Agarwal, K. Fox, K. Munagala, and A. Nath. Parallel algorithms for constructing range and nearestneighbor searching data structures. In Proc. 35th ACM SIGMODSIGACTSIGAI Sympos. Princ. Database Syst., pages 429–440. ACM, 2016.
 [3] P. K. Agarwal, S. HarPeled, and K. R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004.
 [4] S. Aghamolaei, M. Farhadi, and H. ZarrabiZadeh. Diversity maximization via composable coresets. In Proc. 27th Canad. Conf. Computat. Geom., 2015.
 [5] S. Aghamolaei and M. Ghodsi. A composable coreset for kcenter in doubling metrics. In Proc. 30th Canad. Conf. Computat. Geom., 2018.
 [6] M. Bādoiu, S. HarPeled, and P. Indyk. Approximate clustering via coresets. In Proc. 34th Annu. ACM Sympos. Theory Comput., pages 250–257. ACM, 2002.

[7]
P. Baldi, P. Sadowski, and D. Whiteson.
Searching for exotic particles in highenergy physics with deep learning.
Nature communications, 5:4308, 2014.  [8] M. Bateni, A. Bhaskara, S. Lattanzi, and V. Mirrokni. Distributed balanced clustering via mapping coresets. In Advances in Neural Information Processing Systems (NIPS), pages 2591–2599, 2014.
 [9] M. Ceccarello, A. Pietracaprina, and G. Pucci. Solving center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially, 2018.
 [10] M. Ceccarello, A. Pietracaprina, G. Pucci, and E. Upfal. Mapreduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension. Proceedings of the VLDB Endowment, 10(5):469–480, 2017.
 [11] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417–1440, 2004.
 [12] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constantfactor approximation algorithm for the kmedian problem (extended abstract). In Proc. 31st Annu. ACM Sympos. Theory Comput., pages 1–10, New York, NY, USA, 1999. ACM.
 [13] B.R. Dai and I.C. Lin. Efficient map/reducebased dbscan algorithm with optimized data partition. In 2012 IEEE 5th International Conference on Cloud Computing (CLOUD), pages 59–66. IEEE, 2012.

[14]
D. Dheeru and E. Karra Taniskidou.
UCI machine learning repository, 2017.
 [15] H. Ding. Greedy strategy works for clustering with outliers and coresets construction. arXiv preprint arXiv:1901.08219, 2019.
 [16] A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 681–689. ACM, 2011.
 [17] M. Ester, H.P. Kriegel, J. Sander, X. Xu, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), volume 96, pages 226–231, 1996.
 [18] T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proc. 20th Annu. ACM Sympos. Theory Comput., pages 434–444. ACM, 1988.
 [19] Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on parallel dbscan algorithm design based on mapreduce. In Advanced Materials Research, volume 301, pages 1133–1138. Trans Tech Publ, 2011.
 [20] J. Gan and Y. Tao. Dbscan revisited: Misclaim, unfixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 519–530, 2015.
 [21] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science (TCS), 38:293–306, 1985.
 [22] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(3):515–528, 2003.
 [23] A. Gupta, R. Krauthgamer, and J. R. Lee. Bounded geometries, fractals, and lowdistortion embeddings. In Proc. 44th Annu. IEEE Sympos. Found. Comput. Sci., pages 534–543. IEEE, 2003.

[24]
I. Guyon, S. Gunn, A. BenHur, and G. Dror.
Result analysis of the nips 2003 feature selection challenge.
In Advances in Neural Information Processing Systems (NIPS), pages 545–552, 2005.  [25] S. HarPeled and M. Mendel. Fast construction of nets in lowdimensional metrics and their applications. SIAM J. Comput., 35(5):1148–1184, 2006.

[26]
Y. He, H. Tan, W. Luo, S. Feng, and J. Fan.
Mrdbscan: a scalable mapreducebased dbscan algorithm for heavily skewed data.
Frontiers of Computer Science, 8(1):83–99, 2014.  [27] S. Im and B. Moseley. Brief announcement: Fast and better distributed mapreduce algorithms for kcenter clustering. In Proc. 27th ACM Sympos. Parallel Algorithms Architect., pages 65–67. ACM, 2015.
 [28] P. Indyk, S. Mahabadi, M. Mahdian, and V. S. Mirrokni. Composable coresets for diversity and coverage maximization. In Proc. 33rd ACM SIGMODSIGACTSIGAI Sympos. Princ. Database Syst., pages 100–108. ACM, 2014.
 [29] Y. Kim, K. Shim, M.S. Kim, and J. S. Lee. Dbcuremr: an efficient densitybased clustering algorithm for large data using mapreduce. Inf. Syst., 42:15–35, 2014.
 [30] S. Li and O. Svensson. Approximating kmedian via pseudoapproximation. SIAM J. Comput., 45(2):530–547, 2016.
 [31] C. Liao and S. Hu. Polynomial time approximation schemes for minimum disk cover problems. J. Comb. Optim., 20(4):399–412, 2010.
 [32] M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical engineering online, 6(1):23, 2007.
 [33] G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley. Fast distributed kcenter clustering with outliers on massive data. In Advances in Neural Information Processing Systems (NIPS), pages 1063–1071, 2015.
 [34] J. McClintock and A. Wirth. Efficient parallel algorithms for kcenter clustering. In Parallel Processing (ICPP), 2016 45th International Conference on, pages 133–138. IEEE, 2016.
 [35] V. Mirrokni and M. Zadimoghaddam. Randomized composable coresets for distributed submodular maximization. In Proc. 47th Annu. ACM Sympos. Theory Comput., pages 153–162. ACM, 2015.
 [36] M. Noticewala and D. Vaghela. Mridbscan: Efficient parallel incremental dbscan algorithm using mapreduce. International Journal of Computer Applications (IJCA), 93(4), 2014.
 [37] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans. Database Syst., 42(3):19, 2017.
 [38] C. D. Toth, J. O’Rourke, and J. E. Goodman. Handbook of discrete and computational geometry. Chapman and Hall/CRC, 2017.
 [39] V. V. Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.
 [40] G. Yaroslavtsev and A. Vadapalli. Massively parallel algorithms and hardness for singlelinkage clustering under distances. arXiv preprint arXiv:1710.01431, 2017.
Comments
There are no comments yet.