Distributed k-Clustering for Data with Heavy Noise

10/18/2018
by   Shi Li, et al.
University at Buffalo
0

In this paper, we consider the k-center/median/means clustering with outliers problems (or the (k, z)-center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on z, the number of outliers. Recently Guha et al. overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with 2z outliers. For the case where z is large, the extra z outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible (1+ϵ)z, while maintaining the O(1)-approximation ratio and independence of communication cost on z. The problems we consider include the (k, z)-center problem, and (k, z)-median/means problems in Euclidean metrics. Implementation of the our algorithm for (k, z)-center shows that it outperforms many previous algorithms, both in terms of the communication cost and quality of the output solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/03/2017

Constant Approximation for k-Median and k-Means with Outliers via Iterative Rounding

In this paper, we present a novel iterative rounding framework for many ...
07/02/2020

Adapting k-means algorithms for outliers

This paper shows how to adapt several simple and classical sampling-base...
06/03/2013

Distributed k-Means and k-Median Clustering on General Topologies

This paper provides new algorithms for distributed clustering for two po...
05/24/2018

A Practical Algorithm for Distributed Clustering and Outlier Detection

We study the classic k-means/median clustering, which are fundamental pr...
12/20/2020

Outliers Detection Is Not So Hard: Approximation Algorithms for Robust Clustering Problems Using Local Search Techniques

In this paper, we consider two types of robust models of the k-median/k-...
01/24/2019

Greedy Strategy Works for Clustering with Outliers and Coresets Construction

We study the problems of clustering with outliers in high dimension. Tho...
10/27/2021

Tight FPT Approximation for Constrained k-Center and k-Supplier

In this work, we study a range of constrained versions of the k-supplier...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental problem in unsupervised learning and data analytics. In many real-life datasets, noises and errors unavoidably exist. It is known that even a few noisy data points can significantly influence the quality of the clustering results. To address this issue, previous work has considered the clustering with outliers problem, where we are given a number

on the number of outliers, and need to find the optimum clustering where we are allowed to discard points, under some popular clustering objective such as -center, -median and -means.

Due to the increase in volumes of real-life datasets, and the emergence of modern parallel computation frameworks such as MapReduce and Hadoop, computing a clustering (with or without outliers) in the distributed setting has attracted a lot of attention in recent years. The set of points are partitioned into parts that are stored on different machines, who collectively need to compute a good clustering by sending messages to each other. Often, the time to compute a good solution is dominated by the communications among machines. Many recent papers on distributed clustering have focused on designing -approximation algorithms with small communication cost DBLP:conf/nips/BalcanEL13 ; DBLP:conf/nips/MalkomesKCWM15 ; DBLP:conf/spaa/GuhaLZ17 .

Most previous algorithms for clustering with outliers have the communication costs linearly depending on , the number of outliers. Such an algorithm performs poorly when data is very noisy. Consider the scenario where distributed sensory data are collected by a crowd of people equipped with portable sensory devices. Due to different skill levels of individuals and the quality of devices, it is reasonable to assume that a small constant fraction of the data points are unreliable.

Recently, Guha et al. DBLP:conf/spaa/GuhaLZ17 overcame the linear dependence issue, by giving distributed -approximation algorithms for -center/median/means with outliers problems with communication cost independent of . However, the solutions produced by their algorithms have outliers. Such a solution discards more points compared to the (unknown) optimum one, which may greatly decrease the efficiency of data usage. Consider an example where a research needs to be conducted using the inliers of a dataset containing 10% noisy points; a filtering process is needed to remove the outliers. A solution with outliers will only preserve 80% of data points, as opposed to the promised 90%. As a result, the quality of the research result may be reduced.

Unfortunately, a simple example (described in the supplementary material) shows that if we need to produce any multiplicatively approximate solution with only outliers, then the linear dependence on can not be avoided. We show that, even deciding whether the optimum clustering with outliers has cost 0 or not, for a dataset distributed on 2 machines, requires a communication cost of bits. Given such a negative result and the positive results of Guha et al. DBLP:conf/spaa/GuhaLZ17 , the following question is interesting from both the practical and theoretical points of view:

Can we obtain distributed -approximation algorithms for -center/median/means with outliers that have communication cost independent of and output solutions with outliers, for any ?

On the practical side, an algorithm discarding additional outliers is acceptable, as the number can be made arbitrarily small, compared to both the promised number of outliers and the number of inliers. On the theoretical side, the -factor for the number of outliers is the best we can hope for if we are aiming at an -approximation algorithm with communication complexity independent of ; thus answering the question in the affirmative can give the tight tradeoff between the number of outliers and the communication cost in terms of .

In this paper, we make progress in answering the above question for many cases. For the -center objective, we solve the problem completely by giving a -bicriteria approximation algorithm with communication cost , where is the aspect ratio of the metric. ( is the approximation ratio, is the multiplicative factor for the number of outliers our algorithm produces; the formal definition appears later.) For -median/means objective, we give a distributed -bicrteria approximation algorithm for the case of Euclidean metrics. The communication complexity of the algorithm is , where is the dimension of the underlying Euclidean metric. (The exact communication complexity is given in Theorem 1.2.) Using dimension reduction techniques, we can assume , by incurring a -distortion in pairwise distances. So, the setting indeed covers a broad range of applications, given that the term “-means clustering” is defined and studied exclusively in the context of Euclidean metrics. The -bicriteria approximation ratio comes with a caveat: our algorithm has running time exponential in many parameters such as and (though it has no exponential dependence on or ).

1.1 Formulation of Problems

We call the -center (resp. -median and -means) problem with outliers as the -center (resp. -median and -means) problem. Formally, we are given a set of points that reside in a metric space , two integers and . The goal of the problem is to find a set of centers and a set of points so as to minimize (resp. and ), where is the minimum distance from to a center in . For all the 3 objectives, given a set of centers, the best set can be derived from by removing the points with the largest . Thus, we shall only use a set of centers to denote a solution to a -center/median/means instance. The cost of a solution is defined as , and respectively for a -center, median and means instance, where is obtained by applying the optimum strategy. The points in and the points in are called inliers and outliers respectively in the solution.

As is typical in the machine learning literature, we consider general metrics for

-center, and Euclidean metrics for -median/means. In the -center problem, we assume that each point in the metric space can be described using words, and given the descriptions of two points and , one can compute in time. In this case, the set of centers must be from since these are all the points we have. For -median/means problem, points in and centers are from Euclidean space , and it is not required that . One should treat as a small number, since dimension reduction techniques can be applied to project points to a lower-dimension space.

Bi-Criteria Approximation We say an algorithm for the -center/median/means problem achieves a bi-criteria approximation ratio (or simply approximation ratio) of , for some , if it outputs a solution with at most outliers, whose cost is at most times the cost of the optimum solution with outliers.

Distributed Clustering In the distributed setting, the dataset is split among machines, where is the set of data points stored on machine . We use to denote . Following the communication model of DBLP:conf/icml/DingLHL16 and DBLP:conf/spaa/GuhaLZ17 , we assume there is a central coordinator, and communications can only happen between the coordinator and the machines. The communication cost is measured in the total number of words sent. Communications happen in rounds, where in each round, messages are sent between the coordinator and the machines. A message sent by a party (either the coordinator or some machine) in a round can only depends on the input data given to the party, and the messages received by the party in previous rounds. As is common in most of the previous results, we require the number of rounds used to be small, preferably a small constant.

Our distributed algorithm needs to output a set of centers, as well as an upper bound on the maximum radius of the generated clusters. For simplicity, only the coordinator needs to know and . We do not require the coordinator to output the set of outliers since otherwise the communication cost is forced to be at least . In a typical clustering task, each machine can figure out the set of outliers in its own dataset based on and (1 extra round may be needed for the coordinator to send and to all machines).

1.2 Prior Work

In the centralized setting, we know the best possible approximation ratios of and DBLP:conf/soda/CharikarKMN01 for the -center and -center problems respectively, and thus our understanding in this setting is complete. There has been a long stream of research on approximation algorithms -median and -means, leading to the current best approximation ratio of BPRST17 for -median,  ANSW16 for -means, and for Euclidean -means ANSW16 . The first -approximation algorithm for -median is given by Chen, DBLP:conf/soda/Chen08 . Recently, Krishnaswamy et al. KLS18 developed a general framework that gives -approximations for both -median and -means.

Much of the recent work has focused on solving -center/median/means and -center/median/means problems in the distributed setting DBLP:conf/kdd/EneIM11 ; DBLP:conf/nips/BalcanEL13 ; DBLP:conf/spaa/ImM15 ; DBLP:conf/nips/MalkomesKCWM15 ; DBLP:conf/spaa/ImM15 ; DBLP:conf/nips/MalkomesKCWM15 ; DBLP:conf/icml/DingLHL16 ; DBLP:conf/nips/ChenSWZ16 ; DBLP:conf/spaa/GuhaLZ17 ; DBLP:journals/corr/abs-1805-09495 . Many distributed approximation algorithms with small communication complexity are known for these problems. However, for -center/median/means problems, most known algorithms have communication complexity linearly depending on , the number of outliers. Guha et al. DBLP:conf/spaa/GuhaLZ17 overcame the dependence issue, by giving -bicriteria approximation algorithms for all the three objectives. The communication costs of their algorithms are , where hides a logarithmic factor.

1.3 Our Contributions

Our main contributions are in designing -bicriteria approximation algorithms for the -center/median/means problems. The algorithm for -center works for general metrics:

Theorem 1.1.

There is a -round, distributed algorithm for the -center problem, that achieves a -bicriteria approximation and communication cost, where is the aspect ratio of the metric.

We give a high-level picture of the algorithm. By guessing, we assume that we know the optimum cost (since we do not know, we need to lose the additive term in the communication complexity). In the first round of the algorithm, each machine will call a procedure called aggregating, on its set . This procedure performs two operations. First, it discards some points from ; second, it moves each of the survived points by a distance of at most . After the two operations, the points will be aggregated at a few locations. Thus, machine can send a compact representation of these points to the coordinator: a list of pairs, where is a location and is the number of points aggregated at . The coordinator will collect all the data points from all the machines, and run the algorithm of DBLP:conf/soda/CharikarKMN01 for -center instance on the collected points, for some suitable .

To analyze the algorithm, we show that the set of points collected by the coordinator well-approximates the original set . The main lemma is that the total number of non-outliers removed by the aggregation procedure on all machines is at most . This incurs the additive factor of in the number of outliers. We prove this by showing that inside any ball of radius , and for every machine , we removed at most points in . Since the non-outliers are contained in the union of balls of radius , and there are machines, the total number of removed non-outliers is at most . For each remaining point, we shift it by a distance of , leading to an -loss in the approximation ratio of our algorithm.

We perform experiments comparing our main algorithm stated in Theorem 1.1 with many previous ones on real-world datasets. The results show that it matches the state-of-art method in both solution quality (objective value) and communication cost. We remark that the qualities of solutions are measured w.r.t removing only outliers. Theoretically, we need outliers in order to achieve an -approximation ratio and our constant 24 is big. In spite of this, empirical evaluations suggest that the algorithm on real-word datasets performs much better than what can be proved theoretically in the worst case.

For -median/means problems, our algorithm works for the Euclidean metric case and has communication cost depending on the dimension of the Euclidean space. One can w.l.o.g. assume by using the dimension reduction technique. Our algorithm is given in the following theorem:

Theorem 1.2.

There is a -round, distributed algorithm for the -median/means problems in -dimensional Euclidean space, that achieves a

-bicriteria approximation ratio with probability

. The algorithm has communication cost , where is the aspect ratio of the input points, for -median, and for -means.

We now give an overview of our algorithm for -median/means. First, it is not hard to reformulate the objective of the -median problem as minimizing , where is obtained from by truncating all distances at . By discretization, we can construct a set of interesting values that the under the superior operator can take. Thus, our goal becomes to find a set , that is simultaneously good for every -median instance defined by . Since now we are handling -median instances (without outliers), we can use the communication-efficient algorithm of DBLP:conf/nips/BalcanEL13 to construct an -coreset with weights for every . Roughly speaking, the coreset is similar to the set for the task of solving the -median problem under metric . The size of each -coreset is at most , implying the communication cost stated in the theorem. After collecting all the coresets, the coordinator can approximately solve the optimization problem on them. This will lead to an -bicriteria approximate solution. The running time of the algorithm, however, is exponential in the total size of the coresets. The argument can be easily adapted to the -means setting.

Organization In Section 2, we prove Theorem 1.1, by giving the -approximation algorithm. The empirical evaluations of our algorithm for -center and the proof of Theorem 1.2 are provided in the supplementary material.

Notations

Throughout the paper, point sets are multi-sets, where each element has its own identity. By a copy of some point , we mean a point with the same description as but a different identity. For a set of points, a point , and a radius , we define to be the set of points in that have distances at most to

. For a weight vector

on some set of points, and a set , we use to denote the total weight of points in .

Throughout the paper, is always the set of input points. We shall use and to denote the minimum and maximum non-zero pairwise distance between points in . Let denote the aspect ratio of the metric.

2 Distributed -Center Algorithm with Outliers

In this section, we prove Theorem 1.1, by giving the -approximation algorithm for -center, with communication cost . Let be the cost of the optimum -center solution (which is not given to us). We assume we are given a parameter and our goal is to design a main algorithm with communication cost , that either returns a -center solution of cost at most , or certifies that . Notice that . We can obtain our -approximation by using the main algorithm to check different values of in parallel, and among all ’s that are not certified to be less than , returning solution correspondent to the smallest such . A naive implementation requires all the parties to know and in advance; we show in the supplementary material that the requirement can be removed.

In intermediate steps, we may deal with -center instances where points have integer weights. In this case, the instance is defined as , where is a set of points, , and is an integer between and . The instance is equivalent to the instance , the multi-set where we have copies of each .

DBLP:conf/soda/CharikarKMN01 gave a -approximation algorithm for the -center problem. However, our setting is slightly more general so we can not apply the result directly. We are given a weighted set of points that defines the -center instance. The optimum set of centers, however, can be from the superset which is hidden to us. Thus, our algorithm needs output a set of centers from and compare it against the optimum set of centers from . Notice that by losing a factor of , we can assume centers are in ; this will lead to a -approximation. Indeed, by applying the framework of DBLP:conf/soda/CharikarKMN01 more carefully, we can obtain a -approximation for this general setting. We state the result in the following theorem:

Theorem 2.1 (DBLP:conf/soda/CharikarKMN01 ).

Let be a metric over the set of points, and . There is an algorithm (Algorithm 1) that takes inputs , with , the metric restricted to , and a real number . In time , the algorithm either outputs a -center solution to the instance of cost at most , or certifies that there is no -center solution of cost at most and outputs “No”.

The main algorithm is (Algorithm 3), which calls an important procedure called (Algorithm 2). We describe and in Section 2.1 and 2.2 respectively.

2.1 Aggregating Points

The procedure , as described in Algorithm 2, takes as input the set of points to be aggregated (which will be some when we actually call the procedure), the guessed optimum cost , and , which controls how many points can be removed from . It returns a set of points obtained from aggregation, along with their weights .

Algorithm 1 1:  ; 2:  for to do 3:      with largest 4:      5:      6:  if then return “No” else return Algorithm 2 1:  ; 2:  while with do 3:      , 4:      5:  return

In , we start from and and keep removing points from . In each iteration, we check if there is a with . If yes, we add to , remove from and let be the number of points removed. We repeat thie procedure until such a can not be found. We remark that the procedure is very similar to the algorithm (Algorithm 1) in DBLP:conf/soda/CharikarKMN01 .

We start from some simple observations about the algorithm.

Claim 2.2.

We define to be the set of points in with distance at most to some point in at the end of Algorithm 2. Then, the following statements hold at the end of the algorithm:

  1. [itemsep=0pt,topsep=0pt,leftmargin=*]

  2. .

  3. for every .

  4. There is a function such that , , and .

Proof.

is exactly the set of points in with distance more than to any point in and thus . Property 2 follows from the termination condition of the algorithm. Property 3 holds by the way we add points to and remove points from . If in some iteration we added to , we can define for every point , i.e, every point removed from in the iteration. ∎

We think of as the set of points we discard from and as the set of survived points. We then move each to and thus will be aggregated at the set of locations. The following crucial lemma upper bounds :

Lemma 2.3.

Let and assume there is a -center solution to the instance with cost at most . Then, at the end of Algorithm 2 we have .

Proof.

Let be the set of outliers according to solution . Thus .

Focus on the moment before we run Step 

3 in some iteration of . See Figure 2 for the two cases we are going to consider. In case (a), every center has . In this case, every point has : if for some , then by triangle inequality; for some with , we have , implying that as . Thus, . So, Step 3 in this iteration will decrease by at least .

Consider the case (b) where some has . Then will be removed from by Step 3 in this iteration. Thus,

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. if case (a) happens, then is decreased by more than in this iteration;

  3. otherwise case (b) happens; then for some , changes from non-empty to .

The first event can happen for at most iterations and the second event can happen for at most iterations. So, . ∎

Figure 1: Two cases in proof of Lemma 2.3. In Figure (a), the balls (red circles) are all empty. So, . In Figure (b), there is a non-empty for some with (the red circle). The ball is contained in .
Figure 2: Illustration for proof of Lemma 2.7. is indicated by the dashed lines, each of whom is of length at most . The number of crosses in a circle is at most .

2.2 The Main Algorithm

We are now ready to describe the main algorithm for the -center problem, given in Algorithm 3. In the first round, each machine will call to obtain . All the machines will first send their corresponding to the coordinator. In Round 2 the algorithm will check if is small or not. If yes, send a “Yes” message to all machines; otherwise return “No” and terminate the algorithm. In Round 3, if a machine received a “Yes” message from the coordinator, then it sends the dataset with the weight vector to the coordinator. Finally in Round 4, the coordinator collects all the weighted points and run on these points.

input on all parties: , , , ,
input on machine : dataset with
output: a set or “No” (which certifies )

 

Round 1 on machine

1:  
2:  send to the coordinator 

 

Round 2 on the coordinator

1:  if then return “No” else send “Yes” to each machine

 

Round 3 on machine

1:  Upon receiving of a “Yes” message from the coordinator, respond by sending

 

Round 4 on the coordinator

1:  let
2:  let be the function from to obtained by merging
3:  let
4:  if then return “No” else return
Algorithm 3

An immediate observation about the algorithm is that its communication cost is small:

Claim 2.4.

The communication cost of is .

Proof.

The total communication cost of Round 1 and Round 2 is . We run Round 3 only when the coordinator sent the “Yes” message, in which case the communication cost is at most . ∎

It is convenient to define some notations before we make further analysis. For every machine , let be the constructed in Round 1 on machine . Let be the set of points in that are within distance at most to some point in . Notice that this is the definition of in Claim 2.2 for the execution of on machine . Let ; this is the set at the end of this execution. Let be the mapping from to satisfying Property 3 of Claim 2.2. Let and be the function from to , obtained by merging . Thus and .

Claim 2.5.

If returns a set , then is a -center solution to the instance with cost at most .

Proof.

must be returned in Step 4 in Round 4. By Theorem 2.1 for , is a -center solution to of cost at most . That is, . This implies . Notice that for each , the set of points are within distance from and . So, , which is exactly . ∎

We can now assume and we need to prove that we must reach Step 4 in Round 4 and return a set . We define to be a set of size such that . Let be the set of “inliers” according to and be the set of outliers. Thus, and .

Lemma 2.6.

After Round 1, we have .

Proof.

Let be the set of outliers in . Then, is a -center solution to the instance with cost at most . By Lemma 2.3, we have that . So, we have

Therefore, the coordinator will not return “No” in Round 2. It remains to prove the following Lemma.

Lemma 2.7.

Algorithm 3 will reach Step 4 in Round 4 and return a set .

Proof.

See Figure 2 for the illustration of the proof. By Property 2 of Claim 2.2, we have for every since . This implies that for every , we have . (Otherwise, taking an arbitrary in the ball leads to a contradiction.)

For every , will have distance at most to some center in . Also, notice that for every , we have that

So, . This implies that , and there is a -center solution to the instance of cost at most . Thus will reach Step 4 in Round 4 and returns a set . This finishes the proof of the Lemma. ∎

We now briefly analyze the running times of algorithms on all parties. The running time of computing on each machine in round 1 is and this is the bottleneck for machine . Considering all possible values of , the running time on machine is . The running time of the round-4 algorithm of the central coordinator for one will be . We sort all the interesting values in increasing order. The trick here is to run only the first two rounds of the main algorithm. The central coordinator then use binary search to find the smallest that the main algorithm sends out “Yes” in Round 2. And it proceeds to Round 3 and 4 only for this single . So, the running time of the central coordinator can be made .

The quadratic dependence of running time of machine on might be an issue when is big; we discuss how to alleviate the issue in the supplementary material.

3 Conclusion

In this paper, we give a distributed -bicriteria approximation for the -center problem, with communication cost . The running times of the algorithms for all parties are polynomial. We evaluate the algorithm on realworld data sets and it outperforms most previous algorithms, matching the performance of the state-of-art methodDBLP:conf/spaa/GuhaLZ17 .

For the -median/means problem, we give a distributed -bicriteria approximation algorithm with communication cost , where is the upper bound on the size of the coreset constructed using the algorithm of DBLP:conf/nips/BalcanEL13 . The central coordinator needs to solve the optimization problem of finding a solution that is simultaneously good for -median/means instances. Since the approximation ratio for this problem will go to both factors in the bicriteria ratio, we really need a -approximation for the optimization problem. Unfortunately, solving

-median/means alone is already APX-hard, and we don’t know a heuristic algorithm that works well in practice (e.g, a counterpart to Lloyd’s algorithm for

-means). It is interesting to study if a different approach can lead to a polynomial time distributed algorithm with -approximation guarantee.

References

  • [1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward.

    Better guarantees for k-means and euclidean k-median by primal-dual algorithms.

    In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 61–72, 2017.
  • [2] Barbara M. Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guide for the unsure: Approximation algorithms for min-max location problems. Math. Oper. Res., 35(1):79–101, 2010.
  • [3] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In Advances in Neural Information Processing Systems 26, NIPS 2013, December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 1995–2003, 2013.
  • [4] Jaroslaw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Trans. Algorithms, 13(2):23:1–23:31, 2017.
  • [5] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the 12th Annual Symposium on Discrete Algorithms, January 7-9, 2001, Washington, DC, USA., pages 642–651, 2001.
  • [6] Sanjay Chawla and Aristides Gionis. k-means

    : A unified approach to clustering and outlier detection.

    In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 189–197, 2013.
  • [7] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. CoRR, abs/1805.09495, 2018.
  • [8] Jiecao Chen, He Sun, David P. Woodruff, and Qin Zhang. Communication-optimal distributed clustering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3720–3728, 2016.
  • [9] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008, pages 826–835, 2008.
  • [10] Hu Ding, Yu Liu, Lingxiao Huang, and Jian Li. -means clustering with distributed dimensions. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1339–1348, 2016.
  • [11] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 681–689, 2011.
  • [12] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, pages 143–152, 2017.
  • [13] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng., 15(3):515–528, 2003.
  • [14] Johan Håstad and Avi Wigderson. The randomized communication complexity of set disjointness. Theory of Computing, 3(1):211–219, 2007.
  • [15] Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Math. Oper. Ues., 10(2):180–184, 1985.
  • [16] Sungjin Im and Benjamin Moseley. Brief announcement: Fast and better distributed mapreduce algorithms for k-center clustering. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2015, Portland, OR, USA, June 13-15, 2015, pages 65–67, 2015.
  • [17] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 646–659, 2018.
  • [18] M. Lichman. UCI machine learning repository, 2013.
  • [19] Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 28(2):129–136, 1982.
  • [20] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems 28, NIPS 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1063–1071, 2015.
  • [21] Athanasios Tsanas, Max A. Little, Patrick E. Mcsharry, and Lorraine O. Ramig. Enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson’s disease progression. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, 14-19 March 2010, Dallas, Texas, USA, pages 594–597, 2010.

Appendix A Necessity of Linear Dependence of Communication Cost on for True Approximation Algorithms

In this section, we show that if one is aiming for a multiplicative approximation for the -center, -median, or -means problem, then the communication cost is at least bits, even if there are only 2 machines. We show that deciding whether the optimum -center solution has cost or not requires bits of communication. This holds for any combination of values for and such that . Let . The points are all in the real line . On machine 1, there are copies of points from the set , where each one of the points appears either or times. Notice that each point in the set appears at least once in the set. Meanwhile, machine 1 has a set of different points in , and machine 2 has a set of different points in , and we have . If , then the cost of the optimum solution is 0. Let , then we can discard all points except from and . Then we discarded exactly points and the remaining set of points are at locations. On the other hand, if , then the cost of the optimum solution is not 0. Thus deciding whether the cost is 0 or not requires us to decide if , which is exactly the set disjointness problem. This requires a communication cost of between machine 1 and machine 2111This is a well-known result in communication complexity theory, see e.g. [14].

Appendix B Dealing with Various Issues of the Algorithm for -Center

In this section, we show how to handle various issues that our -center algorithm might face.

When and are not given. We can remove the assumption that and are given to us. Let and be the minimum and maximum non-zero pairwise distances between points in . The crucial observation is that running aggregating on for is the same as running it for , and running it for is the same as running it for . Thus, machine only needs to consider values that are integer powers of inside , or , and send results for these values. Since and , the number of such values is at most . Also notice that the data points sent from machine to the coordinator are all generated from . Thus, the aspect ratio for the union of all points received by the coordinator, is at most . This can guarantee that the coordinator only needs to use iterations in the binary search step in Round 4.

When is super big. There are many ways to handle the case when is super-large. In many applications, we know the nature of the dataset and have a reasonable guess on . In other applications, we may be only interested in the case where : we are happy with any clustering of cost less than and any clustering of cost more than is meaningless. In these applications where we have inside information about the dataset, the number of guesses can be much smaller. Finally, if we allow more rounds in our algorithm, we can use binary search for the whole algorithm , not just inside Round 4. We only need to run the algorithm for iterations; this will increase the number of rounds to , but decrease the communication cost to .

Handling the Quadratic Running Time of Round 1 on Machine . In Round 1 of the algorithm , each machine needs to run on points, leading to a running time of order . In cases where is large, the algorithm might be slow. We can decrease the running time, at the price of increasing the communication cost and the running time on the coordinator. We view each as a collection of sub-machines, for some integer . Then, we run on the set of sub-machines, instead of the original set of machines. The communication cost of the algorithm increases to , and the running time on each machine decreases to , and the running time of the algorithm for the coordinator becomes . Each machine can choose a so that the -time algorithm of Round 1 terminates in acceptable amount of time.

Appendix C Distributed Algorithms -Median/Means

In this section, we give our distributed algorithm for the -median/means problems in Euclidean metrics. Let and be as defined in the problem setting. Let be the confidence parameter; i.e, our algorithm needs to succeed with probability . Also, we define a parameter to indicate whether the problem we are considering is -median () or -means ().

Recall that and are respectively the minimum and maximum non-zero pairwise distance between points in . It is not hard to see that the optimum solution to the instance has cost either or at least . For a technical reason, we can redefine as follows for every :

That is, we truncate distances below at , and above at . It is easy to see that the problem w.r.t the new metric is equivalent to the original one up to a multiplicative factor of . In the new instance, we have either or .

Given an integer and a set