1 Introduction
Clustering is a fundamental unsupervised learning method that offers a compact view of data sets by grouping similar input points. Among various clustering methods,
means clustering is one of the most popular clustering methods used in practice, which is defined as follows: given a set of points in Euclidean space^{1}^{1}1The input space can be extended to an arbitrary metric space. and a target number of clusters , the goal is to choose a set of points from as centers, so as to minimize the loss, i.e., the sum of the squared distances of every point to its closest center in .Due to its popularity, means clustering has been extensively studied for decades both theoretically and empirically, and as a result, various novel algorithms and powerful underlying theories have been developed. In particular, because the clustering problem is NPhard, several constantfactor approximation algorithms have been developed (Charikar and Guha, 1999; Kanungo et al., 2004; Kumar et al., 2004; Feldman et al., 2007), meaning that their output is always within an factor of the optimum. One of the most successful algorithms used in practice is means++ (Arthur and Vassilvitskii, 2007). The algorithm means++ is a preprocessing step used to set the initial centers when using Lloyd’s algorithm (Lloyd, 1982)
. Lloyd’s algorithm is a simple local search heuristic that alternates between updating the center of every cluster and reassigning points to their closest centers.
means++ has a provable approximation guarantee of by carefully choosing the initial centers.means clustering is highly sensitive to noise, which is present in many data sets. Indeed, it is not difficult to see that the means clustering objective can vary significantly even with the addition of a single point that is far away from the true clusters. In general, it is a nontrivial task to filter out noise; without knowing the true clusters, we cannot identify noise, and vice versa. While there are other clustering methods, such as densitybased clustering (Ester et al., 1996), that attempt to remove noise, they do not replace means clustering because they are fundamentally different than means.
Consequently, there have been attempts to study means clustering in the presence of noise. The following problem formulation is the most popular formulation in the theory (Chen, 2008; Charikar et al., 2001; McCutchen and Khuller, 2008; Guha et al., 2017)
(Malkomes et al., 2015; Chawla and Gionis, 2013; Li and Guo, 2018) and database communities (Gupta et al., 2017). Note that traditional means clustering is a special case of this problem when . Throughout, for , we let denote the distance between and . For a subset of points , let .Definition 1 (Means with Outliers).
In this problem we are given as input a subset of points in , a parameter (number of centers), and a parameter (number of outliers). The goal is to choose a collection of centers, , to minimize: , where is the subset of input points with the smallest distances to .
Because this problem generalizes means clustering, it is NPhard, and in fact, turns out to be significantly more challenging. The only known constant approximations (Chen, 2008; Krishnaswamy et al., 2018)
are highly sophisticated and are based on complicated local search or linear program rounding. They are unlikely to be implemented in practice due to their runtime and complexity. Therefore, there have been strong efforts to develop simpler algorithms that offer good approximation guarantees when allowed to discard more than
points as outliers (Charikar et al., 2001; Meyerson et al., 2004; Gupta et al., 2017), or heuristics (Chawla and Gionis, 2013). Unfortunately, all existing algorithms with theoretical guarantees suffer from either high running time or inherent loss in solution quality.1.1 Our Results and Contributions
The algorithmic contribution of this paper is twofold, and further these contributions are validated by experiments. In this section, we state our contribution and discuss it in detail compared to the previous work.
Simple Preprocessing Step for Removing Outliers with Provable Guarantees:
In this paper we develop a simple preprocessing step, which we term NKmeans, to effectively filter out outliers. NKmeans stands for noise removal for means. Our proposed preprocessing step can be combined with any algorithm for means clustering. Despite the large amount of work on this problem, we give the first reduction to the standard means problem. In particular, NKmeans can be combined with the popular means++. The algorithm is the fastest known algorithm for the means with outliers problem. Its speed and simplicity gives it the potential to be used in practice. Formally, given an approximation for means clustering, we give an algorithm for means with outliers that is guaranteed to discard up to points such that the cost of remaining points is at most times the optimum that discards up to exactly points. While the theoretical guarantee on the number of outliers is larger than on worstcase inputs, we show that NKmeans removes at most outliers under the assumption that every cluster in an optimal solution has at least points. We believe that this assumption captures most practical cases since otherwise significant portions of the true clusters can be discarded as outliers. In actual implementation, we can guarantee discarding exactly points by discarding the farthest points from the centers we have chosen. It is worth keeping in mind that all (practical) algorithms for the problem discard more than points to have theoretical guarantees (Charikar et al., 2001; Meyerson et al., 2004; Gupta et al., 2017).
New Coreset Construction:
When the data set is large, a dominant way to speed up clustering is to first construct a coreset and then use the clustering result of the coreset as a solution to the original input. Informally, a set of (weighted) points is called a coreset of if a good clustering of is also a good clustering of (see Section 4.1 for the formal definition of coreset.)
The idea is that if we can efficiently construct such , which is significantly smaller than , then we can speed up any clustering algorithm with little loss of accuracy. In this paper, we give an algorithm to construct a coreset of size for means with outliers. Importantly, the coreset size is independent of and  the number of outliers and dimension, respectively.
Experimental Validation:
Our new coreset enables the implementation and comparison of all potentially practical algorithms, which are based on primaldual (Charikar et al., 2001), uniform sampling (Meyerson et al., 2004), or local search (Chawla and Gionis, 2013; Gupta et al., 2017). It is worth noting that, to the best of our knowledge, this is the first paper to implement the primaldual based algorithm (Charikar et al., 2001) and test it for large data sets. We also implemented natural extensions of means++ and our algorithm NKmeans. We note that for fair comparison, once each algorithm chose the centers, we considered all points and discarded the farthest points. Our experiments show that our NKmeans consistently outperforms other algorithms for both synthetic and realworld data sets with little running time overhead as compared to means++.
1.2 Comparison to the Previous Work
Algorithms for Means with Outliers:
To understand the contribution of our work, it is important to contrast the algorithm with previous work. We believe a significant contribution of our work is the algorithmic simplicity and speed as well as the theoretical bounds that our approach guarantees. In particular, we will discuss why the previous algorithms are difficult to use in practice.
The first potentially practical algorithm developed is based on primaldual (Charikar et al., 2001). Instead of solving a linear program (LP) and converting the solution to an integer solution, the primaldual approach only uses the LP and its dual to guide the algorithm. However, the algorithm does not scale well and is not easy to implement. In particular, it involves increasing variables uniformly, which requires running time and extra care to handle precision issues of fractional values. As mentioned before, this algorithm was never implemented prior to this paper. Our experiments show that this algorithm considerably underperforms compared to other algorithms.
The second potentially practical algorithm is based on uniform sampling (Meyerson et al., 2004). The main observation of Meyerson et al. (2004) is that if every cluster is large enough, then a small uniform sample can serve as a coreset. This observation leads to two algorithms for means clustering with outliers: (i) (implicit) reduction to means clustering via conservative uniform sampling and (ii) (explicit) aggressive uniform sampling plus primaldual (Charikar et al., 2001). In (i) it can be shown that a constant approximate means clustering of a uniform sample of size is a constant approximation for means clustering with outliers, under the assumption that every cluster has size . Here, the main idea is to avoid any noise by sampling conservatively. Although this assumption is reasonable as discussed before, the real issue is that conservative uniform sampling doesn’t give a sufficiently accurate sketch to be adopted in practice. For example, if there are noise points, then the conservative uniform sample has only points. In (ii), a more aggressive uniform sampling is used and followed by the primal dual (Charikar et al., 2001). It first obtains a uniform sample of size ; then the (expected) number of outliers in the sample becomes . This aggressive uniform sampling turns out to have very little loss in terms of accuracy. However, as mentioned before, the primaldual algorithm underperforms compared to other algorithms in speed and accuracy.
Another line of algorithmic development has been based on local search (Chawla and Gionis, 2013; Gupta et al., 2017). The algorithm in Chawla and Gionis (2013) guarantees the convergence to a local optimum, but has no approximation guarantees. The other algorithm (Gupta et al., 2017) is an approximation but theoretically it may end up with discarding outliers. These local search algorithms are considerably slower than our method and the theoretical guarantees require discarding many more points.
To summarize, there is a need for a fast and effective algorithm for means clustering with outliers.
Coresets for Means with Outliers:
The other main contribution of our work is a coreset for means with outliers of size  independent of the number of outliers and dimension .
The notion of coreset we consider is related to the concept of a weak coreset in the literature  see e.g. Feldman and Langberg (2011) for discussion of weak coresets and other types of coresets. Previous coreset constructions (some for stronger notions of coreset) have polynomial dependence on the number of outliers (Gupta et al., 2017), inverse polynomial dependence on the fraction of outliers (Meyerson et al., 2004; Huang et al., 2018), or polynomial dependence on the dimension (Huang et al., 2018). Thus, all coresets constructed in the previous work can have large size for some value of , e.g. , or for large values of . In contrast, our construction is efficient for all values of and yields coresets of size with no dependence on or .
1.3 Overview of Our Algorithms: NKmeans and SampleCoreset
Our preprocessing step, NKmeans, is reminiscent of densitybased clustering. Our algorithm tags an input point as light if it has relatively few points around it. Formally, a point is declared as light if it has less than points within a certain distance threshold , which can be set by binary search. Then a point is discarded if it only has light points within distance . We emphasize that the threshold is chosen by the algorithm, not by the algorithm user, unlike in densitybased clustering. While our preprocessing step looks similar to the algorithm for center clustering (Charikar et al., 2001), which optimizes the loss, we find it surprising that a similar idea can be used for means clustering.
It can take considerable time to label each point light or not. To speed up our algorithm, we develop a new corest construction for means with outliers. The idea is relatively simple. We first use aggressive sampling as in Meyerson et al. (2004). The resulting sample has size and includes
outliers with high probability. Then we use
means++ to obtain centers. As a result, we obtain a highquality coreset of size . Interestingly, to our best knowledge, combining aggressive sampling with another coreset for means with outliers has not been considered in the literature.1.4 Other Related Work
Due to the vast literature on clustering, we refer the reader to Aggarwal and Reddy (2013); Kogan et al. (2006); Jain et al. (1999) for an overview and survey of the literature. means clustering can be generalized by considering other norms of loss, and such extensions have been studied under different names. When the objective is norm loss, the problem is called medians. The median and mean clustering problems are closely related, and in general the algorithm and analysis for one can be readily translated into one for the other with an factor loss in the approximation ratio. Constant approximations are known for medians and means based on linear programming, primaldual, and local search (Arya et al., 2004; Charikar et al., 2002; Charikar and Guha, 1999). While its approximation ratio is , the means++ algorithm is widely used in practice for
means clustering due to its practical performance and simplicity. When the loss function is
, the problem is known as centers and a approximation is known for centers clustering with outliers (Charikar et al., 2001). For recent work on these outlier problems in distributed settings, see Malkomes et al. (2015); Li and Guo (2018); Guha et al. (2017); Chen et al. (2018).2 Preliminaries
In this paper we will consider the Euclidean means with outliers problem as defined in the introduction. Note that the distance satisfies the triangle inequality, so for all , . Further, the approximate triangle inequality will be useful to our analyses (this follows from the triangle inequality): . Given a set of centers , we say that the assignment cost of to is . For means with outliers, a set, , of centers naturally defines a clustering of the input points as follows:
Definition 2 (Clustering).
Let be a set of centers. A clustering of defined by is a partition of satisfying: For all and , , where ties between ’s are broken arbitrarily but consistently.
In summary, for the means with outliers problem, given a set of centers, we assign each point in to its closest center in . Then we exclude the points of with the highest assignment cost from the objective function (these points are our outliers.) This procedure defines a clustering of with outliers.
Notations: For , we define . Recall that as in the introduction, for any finite , we define: . For any , we define the ball centered at with radius by . For a set of centers, , and , we define the cost of by Recall that we define to be the subset of points of excluding the points with highest assignment costs. Thus the cost of is the cost of clustering with while excluding the points with highest assignment costs. As shorthand, when – so when we consider the means problem without outliers – we will denote the cost of clustering with by . Further, we say a set of centers is an optimal solution if it minimizes over all choices of centers, . Then we define to be the optimal objective value of the means with outliers instance . Analogously, for the means without outliers problem, we denote the optimal objective value of the means instance by .
3 NKmeans Algorithm
In this section, we will describe our algorithm, NKmeans, which turns a means algorithm without outliers to an algorithm for means with outliers in a black box fashion. We note that the algorithm naturally extends to medians with outliers and general metric spaces. For the remainder of this section, let , , and define an instance of means with outliers.
Algorithm Intuition: The guiding intuition behind our algorithm is as follows: We consider a ball of radius around each point . If this ball contains many points, then is likely not to be an outlier in the optimal solution.
More concretely, if there are more than points in ’s ball, then at most of these points can be outliers in the optimal solution. This means that the majority of ’s neighbourhood is real points in the optimal solution, so we can bound the assignment cost of to the optimal centers. We call such points heavy.
There are main steps to our algorithm. First, we use the concept of heavy points to decide which points are real points and those that are outliers. Then we run a means approximation algorithm on the real points.
Formal Algorithm: Now we formally describe our algorithm NKmeans. As input, NKmeans takes a means with outliers instance and an algorithm for mean without outliers, , where takes an instance of means as input.
We will prove that if is an approximation for means and the optimal clusters are sufficiently large with respect to , then NKmeans outputs a good clustering that discards outliers. More precisely, we will prove the following theorem about the performance of NKmeans:
Theorem 1.
Let be the output of . Suppose that is an approximation for means. If every cluster in the clustering defined by has size at least , then .
Corollary 1.
Let be the output of . Suppose that is an approximation. Then .
In other words, NKmeans gives a pseudoapproximationpreserving reduction from means with outliers to means, where any approximation for means implies a pseudoapproximation for means with outliers that throws away points as outliers.
3.1 Implementation Details
Here we describe a simple implementation of NKmeans that achieves runtime assuming we know the optimal objective value, , where is the runtime of the algorithm on inputs of size . This assumption can be removed by running that algorithm for many guesses of , say by trying all powers of to obtain a approximation of for the correct guess.
For our experiments, we implement the loop in Line 3 by enumerating over all pairs of points and computing their distance. This step takes time . We implement the loop in Line 9 by enumerating over all elements in and checking if it is heavy for each . This step takes . Running on takes time. We summarize the result of this section in the following lemma:
Lemma 1.
Assuming that we know and that takes time on inputs of size , then NKmeans can be implemented to run in time .
4 Coreset of Near Linear Size in k
In this section we develop a general framework to speed up any means with outliers algorithm, and we apply this framework to NKmeans to show that we can achieve nearlinear runtime. In particular, we achieve this by constructing what is called a coreset for the means with outliers problem of size , which is independent of the number of outliers, .
4.1 Coresets for kMeans with Outliers
Our coreset construction will leverage existing constructions of coresets for means with outliers. A coreset gives a good summary of the input instance in the following sense:
Definition 3 (Coreset for Means with Outliers).
^{2}^{2}2Note that our definition of coreset is parameterized by the number of outliers, , in contrast to previous work such as Meyerson et al. (2004) and Huang et al. (2018), whose constructions are parametereized by the fraction of outliers, .Let be an instance of means with outliers and be a (possibly weighted) subset of . We say the means with outliers instance is an coreset for if for any set of centers satisfying for some , we have .
In words, if is an coreset for , then running any approximate means with outliers algorithm on (meaning the algorithm throws away outliers and outputs a solution with cost at most ) gives a approximate solution to .
Note that if is a weighted set with weights , then the means with outliers problem is analogously defined, where the objective is a weighted sum of assignment costs: . Further, note that NKmeans generalizes naturally to weighted means with outliers with the same guarantees.
The two coresets we will utilize for our construction are kmeans++ (Aggarwal et al., 2009) and Meyerson’s sampling coreset (Meyerson et al., 2004). The guarantees of these coresets are as follows:
Theorem 2 (kmeans++).
Let denote running kmeans++ on input points to obtain a set of size . Further, let be the clustering of with centers , respectively. We define a weight function by for all . Suppose . Then with probability at least , the instance where has weights is an coreset for the means with outliers instance .
Theorem 3 (Sampling).
Let be a sample from , where every is included in independently with probability . Then with probability at least , the instance is a coreset for .
Observe that kmeans++ gives a coreset of size , and uniform sampling gives a coreset of size in expectation. If is small, then kmeans++ gives a very compact coreset for means with outliers, but if is large – say – then kmeans++ gives a coreset of linear size. However, the case where is large is exactly when uniform sampling gives a small coreset.
In the next section, we show how we can combine these two coresets to construct a small coreset that works for all .
4.2 Our Coreset Construction: SampleCoreset
Using the above results, our strategy is as follows: Let be an instance of means with outliers. If , then we can show that , so we can simply run kmeans++ on the input instance to get a good coreset. Otherwise, is large, so we first subsample approximately points from . Let denote the resulting sample. Then we compute a coreset on of size , where we scale down the number of outliers from proportionally.
Algorithm 2 formally describes our coreset construction. We will prove that SampleCoreset outputs with constant probability a good coreset for the means with outliers instance of size . In particular, we will show:
Theorem 4.
With constant probability, SampleCoreset outputs an coreset for the means with outliers instance of size in expectation.
4.3 A Near Linear Time Algorithm for kMeans With Outliers
Using SampleCoreset, we show how to speed up NKmeans to run in near linear time: Let be the result of . Then, to choose centers we run if ; otherwise, run , where is any approximate means algorithm with runtime on inputs of size .
Theorem 5.
There exists an algorithm that outputs with a constant probability an approximate solution to means with outliers while discarding outliers in expected time .
5 Experiment Results
This section presents our experimental results. The main conclusions are:

Our algorithm NKmeans almost always has the best performance and finds the largest proportion of ground truth outliers. In the cases where NKmeans is not the best, it is competitive within 5%.

Our algorithm results in a stable solution. Algorithms without theoretical guarantees have unstable objectives on some experiments.

Our coreset construction SampleCoreset allows us to run slower, more sophisticated, algorithms with theoretical guarantees on large inputs. Despite their theoretical guarantees, their practical performance is not competitive.
PrimalDual  means–  Local Search  Uniform Sample  NKmeans  
run time > 4hrs  9/16  1/16  8/16  0/16  0/16 
precision < 0.8  2/16  0/16  0/16  4/16  0/16 
total failure  11/16  1/16  8/16  4/16  0/16 
The experiments shows that for a modest overhead for preprocessing, NKmeans makes means clustering more robust to noise.
Algorithms Implemented: Our new coreset construction makes it feasible to compare many algorithms for large data sets. Without this, most known algorithms for kmeans with outliers become prohibitively slow even on modestly sized data sets. In our experiments, the coreset construction we utilize is SampleCoreset. More precisely, we first obtain a uniform sample by sampling each point independently with probability . Then, we run means++ on the sample to choose centers – the resulting coreset is of size .
Next we describe the algorithms tested. Besides the coreset construction, we use means++ to mean running means++ and then Lloyd’s algorithm for brevity. For more details, see Supplementary Material E. In the following, “on coreset” refers to running the algorithm on the coreset as opposed to the entire input. For fair comparison, we ensure each algorithm discards exactly outliers regardless of the theoretical guarantee. At the end of each algorithm’s execution, we discard the farthest points from the chosen centers as outliers.
Algorithms Tested:

NKmeans (plus means++ on coreset): We use NKmeans with means++ as the input . The algorithm requires a bound on the objective . For this, we considered powers of 2 in the range of .

means++ (on the original input): Note this algorithm is not designed to handle outliers.

means++ (on coreset): Same note as the above.

Primaldual algorithm of Charikar et al. (2001) (on coreset): A sophisticated algorithm based on constructing an approximate linear program solution.

Uniform Sample (conservative uniform sampling plus means++): We run means++ on a uniform sample consisting of points sampled with probability .

means– (Chawla and Gionis, 2013) on coreset: This algorithm is a variant of the Lloyd’s algorithm that executes each iteration of Lloyd’s excluding the farthest points.

Local search of Gupta et al. (2017) (on coreset) : This is an extension of the wellknown means local search algorithm.
Experiments: We now describe our experiments which were done on both synthetic and real data sets.
Synthetic Data Experiments
We first conducted experiments with synthetic data sets of various parameters. Every data set has equal one million points and and . Then we generated random Gaussian balls. For the th Gaussian we choose a center from uniformly at random. These are the true centers. Then, we add points drawn from for the th Gaussian. Next, we add noise. Points that are noise were sampled uniformly at random either from the same range or from a larger range depending on the experiment. We tagged the farthest points from the centers as ground truth outliers. We consider all possible combinations of values and the noise range.
Each experiment was conducted times, and we chose the result with the minimum objective and measured the total running time over all runs. We aborted the execution if the algorithm failed to terminate within hours. All experiments were performed on a cluster using a single node with 20 cores at 2301MHz and RAM size 128GB. Table 1 shows the number of times each algorithm aborted due to high run time. Also we measured the recall, which is defined as number of ground truth outliers reported by the algorithm, divided by , the number of points discarded. The recall was the same as the precision in all cases, so we use precision in the remaining text. We choose as the threshold for the acceptable precision and counted the number of inputs for which each algorithm had precision lower than . Our algorithm NKmeans, means++ on coreset, and means++ on the original input all had precision greater than for all data sets and always terminated within hours. The means++ results are excluded from the table. Details of the quality and runtime are deferred to the Supplementary Material E.
Skin5  Skin10  Susy5  Susy10  Power5  Power10  KddFull  
NKmeans 
1  1  1  1  1  1  1 
0.8065  0.9424  0.8518  0.9774  0.6720  0.9679  0.6187  
56  56  1136  1144  363  350  1027  
means–  0.9740  1.5082  1.2096  1.1414  1.0587  1.0625  2.0259 
0.7632  0.9044  0.8151  0.9753  0.6857  0.9673  0.6436  
86  89  672  697  291  251  122  
means++  1.0641  1.4417  1.0150  1.0091  1.0815  1.0876  1.5825 
coreset  0.7653  0.9012  0.8622  0.9865  0.7247  0.9681  0.3088 
39  37  462  465  177  142  124  
means++ 
0.9525  1.6676  1.0017  1.0351  1.0278  1.0535  1.5756 
original  0.7775  0.8975  0.8478  0.9814  0.7116  0.9649  0.3259 
34  43  6900  6054  689  943  652  

Real Data Experiments
For further experiments, we used real data sets. We used the same normalization, noise addition method and the same value of in all experiments. The data sets are Skin, Susy, and Power
. We normalized the data such that the mean and standard deviation are
and on each dimension, respectively. Then we randomly sampled points uniformly at random from and added them as noise. We discarded data points with missing entries.Real Data Sets:

Skin (30). , , , . Only the first features were used.

Susy (31). M, , , .

Power (18). , , , . Out of features, we dropped the first , date and time, that denote when the measurements were made.

KddFull (21). , , , . Each instance has features and we excluded nonnumeric features. This data set has classes and classes account for of the data points. We considered the other data points as ground truth outliers.
Table 2 shows our experiment results for the above real data sets. Due to their high failure rate observed in Table 1 and space constraints, we excluded the primaldual, local search, and conservative uniform sampling algorithms from Table 2; all results can be found in Supplementary Material E. As before, we executed each algorithm times. It is worth noting that NKmeans is the only algorithm with the worst case guarantees shown in Table 2. This gives a candidate explanation for the stability of our algorithm’s solution quality across all data sets in comparison to the other algorithms considered.
The result shows that our algorithm NKmeans has the best objective for all data sets, except within for Skin5. Our algorithm is always competitive with the best precision. For KddFull where we didn’t add artificial noise, NKmeans significantly outperformed other algorithms in terms of objective. We can see that NKmeans pays extra in the run time to remove outliers, but this preprocessing enables stability, and competitive performance.
6 Conclusion
This paper presents a near linear time algorithm for removing noise from data before applying a means clustering. We show that the algorithm has provably strong guarantees on the number of outliers discarded and approximation ratio. Further, NKmeans gives the first pseudoapproximationpreserving reduction from means with outliers to means without outliers. Our experiments show that the algorithm is the fastest among algorithms with provable guarantees and is more accurate than stateoftheart algorithms. It is of interest to determine if the algorithm achieves better guarantees if data has more structure such as being in low dimensional Euclidean space or being assumed to be wellclusterable (Braverman et al., 2011).
7 Acknowledgments
S. Im and M. Montazer Qaem were supported in part by NSF grants CCF1617653 and 1844939. B. Moseley and R. Zhou were supported in part by NSF grants CCF1725543, 1733873, 1845146, a Google Research Award, a Bosch junior faculty chair and an Infor faculty award.
References
 Adaptive sampling for means clustering. In RANDOM, pp. 15–28. Cited by: Appendix B, Appendix C, §4.1.
 Data clustering: algorithms and applications. Chapman and Hall/CRC. Cited by: §1.4.
 Means++: the advantages of careful seeding. In SODA, pp. 1027–1035. Cited by: Appendix B, §1.
 Local search heuristics for kmedian and facility location problems. SIAM Journal on computing 33 (3), pp. 544–562. Cited by: §1.4.
 Streaming kmeans on wellclusterable data. In Proceedings of the TwentySecond Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 2325, 2011, pp. 26–40. Cited by: §6.
 A constantfactor approximation algorithm for the kmedian problem. Journal of Computer and System Sciences 65 (1), pp. 129–149. Cited by: §1.4.
 Improved combinatorial algorithms for the facility location and kmedian problems. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 378–388. Cited by: §1.4, §1.
 Algorithms for facility location problems with outliers. In SODA, pp. 642–651. Cited by: item 4, §1.1, §1.1, §1.2, §1.2, §1.3, §1.4, §1, §1, item 4.

Means: A unified approach to clustering and outlier detection
. In ICDM, pp. 189–197. Cited by: item 6, §1.1, §1.2, §1, §1, item 6.  A practical algorithm for distributed clustering and outlier detection. In Advances in Neural Information Processing Systems, pp. 2248–2256. Cited by: §1.4.
 A constant factor approximation algorithm for median clustering with outliers. In SODA, pp. 826–835. Cited by: §1, §1.
 A densitybased algorithm for discovering clusters in large spatial databases with noise. In AAAI, pp. 226–231. Cited by: §1.

A unified framework for approximating and clustering data.
In
Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 68 June 2011
, L. Fortnow and S. P. Vadhan (Eds.), pp. 569–578. External Links: Link, Document Cited by: §1.2.  A ptas for kmeans clustering based on weak coresets. In Proceedings of the twentythird annual symposium on Computational geometry, pp. 11–18. Cited by: §1.
 Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 143–152. Cited by: §1.4, §1.
 Local search methods for kmeans with outliers. PVLDB 10 (7), pp. 757–768. External Links: Link, Document Cited by: item 7, §1.1, §1.1, §1.2, §1.2, §1, §1, item 7.
 Epsiloncoresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 79, 2018, M. Thorup (Ed.), pp. 814–825. External Links: Link, Document Cited by: §1.2, footnote 2.
 [18] Individual household electric power consumption data set. External Links: Link Cited by: item 3.
 Data clustering: a review. ACM Computing Surveys 31, pp. 264–323. Cited by: §1.4.
 A local search approximation algorithm for means clustering. Comput. Geom. 28 (23), pp. 89–112. Cited by: §1.
 [21] KDD cup 1999 data. External Links: Link Cited by: item 4.
 Grouping multidimensional data. Springer. Cited by: §1.4.
 Constant approximation for kmedian and kmeans with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 2529, 2018, pp. 646–659. External Links: Link, Document Cited by: §1.
 A simple linear time (1+ ) approximation algorithm for kmeans clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, Washington, DC, USA, pp. 454–462. External Links: ISBN 0769522289, Link, Document Cited by: §1.
 Distributed clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pp. 7838–7846. Cited by: §1.4, §1.
 Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. Cited by: §1.
 Fast distributed center clustering with outliers on massive data. In NIPS, pp. 1063–1071. Cited by: §1.4, §1.
 Streaming algorithms for center clustering with outliers and with anonymity. In APPROX, pp. 165–178. Cited by: §1.
 A kmedian algorithm with running time independent of data size. Machine Learning 56 (13), pp. 61–87. Cited by: Appendix D, §1.1, §1.1, §1.2, §1.2, §1.3, §1, §4.1, footnote 2.
 [30] Skin segmentation data set. External Links: Link Cited by: item 1.
 [31] SUSY data set. External Links: Link Cited by: item 2.
References
 Adaptive sampling for means clustering. In RANDOM, pp. 15–28. Cited by: Appendix B, Appendix C, §4.1.
 Data clustering: algorithms and applications. Chapman and Hall/CRC. Cited by: §1.4.
 Means++: the advantages of careful seeding. In SODA, pp. 1027–1035. Cited by: Appendix B, §1.
 Local search heuristics for kmedian and facility location problems. SIAM Journal on computing 33 (3), pp. 544–562. Cited by: §1.4.
 Streaming kmeans on wellclusterable data. In Proceedings of the TwentySecond Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 2325, 2011, pp. 26–40. Cited by: §6.
 A constantfactor approximation algorithm for the kmedian problem. Journal of Computer and System Sciences 65 (1), pp. 129–149. Cited by: §1.4.
 Improved combinatorial algorithms for the facility location and kmedian problems. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 378–388. Cited by: §1.4, §1.
 Algorithms for facility location problems with outliers. In SODA, pp. 642–651. Cited by: item 4, §1.1, §1.1, §1.2, §1.2, §1.3, §1.4, §1, §1, item 4.

Means: A unified approach to clustering and outlier detection
. In ICDM, pp. 189–197. Cited by: item 6, §1.1, §1.2, §1, §1, item 6.  A practical algorithm for distributed clustering and outlier detection. In Advances in Neural Information Processing Systems, pp. 2248–2256. Cited by: §1.4.
 A constant factor approximation algorithm for median clustering with outliers. In SODA, pp. 826–835. Cited by: §1, §1.
 A densitybased algorithm for discovering clusters in large spatial databases with noise. In AAAI, pp. 226–231. Cited by: §1.

A unified framework for approximating and clustering data.
In
Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 68 June 2011
, L. Fortnow and S. P. Vadhan (Eds.), pp. 569–578. External Links: Link, Document Cited by: §1.2.  A ptas for kmeans clustering based on weak coresets. In Proceedings of the twentythird annual symposium on Computational geometry, pp. 11–18. Cited by: §1.
 Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 143–152. Cited by: §1.4, §1.
 Local search methods for kmeans with outliers. PVLDB 10 (7), pp. 757–768. External Links: Link, Document Cited by: item 7, §1.1, §1.1, §1.2, §1.2, §1, §1, item 7.
 Epsiloncoresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 79, 2018, M. Thorup (Ed.), pp. 814–825. External Links: Link, Document Cited by: §1.2, footnote 2.
 [18] Individual household electric power consumption data set. External Links: Link Cited by: item 3.
 Data clustering: a review. ACM Computing Surveys 31, pp. 264–323. Cited by: §1.4.
 A local search approximation algorithm for means clustering. Comput. Geom. 28 (23), pp. 89–112. Cited by: §1.
 [21] KDD cup 1999 data. External Links: Link Cited by: item 4.
 Grouping multidimensional data. Springer. Cited by: §1.4.
 Constant approximation for kmedian and kmeans with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 2529, 2018, pp. 646–659. External Links: Link, Document Cited by: §1.
 A simple linear time (1+ ) approximation algorithm for kmeans clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, Washington, DC, USA, pp. 454–462. External Links: ISBN 0769522289, Link, Document Cited by: §1.
 Distributed clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pp. 7838–7846. Cited by: §1.4, §1.
 Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. Cited by: §1.
 Fast distributed center clustering with outliers on massive data. In NIPS, pp. 1063–1071. Cited by: §1.4, §1.
 Streaming algorithms for center clustering with outliers and with anonymity. In APPROX, pp. 165–178. Cited by: §1.
 A kmedian algorithm with running time independent of data size. Machine Learning 56 (13), pp. 61–87. Cited by: Appendix D, §1.1, §1.1, §1.2, §1.2, §1.3, §1, §4.1, footnote 2.
 [30] Skin segmentation data set. External Links: Link Cited by: item 1.
 [31] SUSY data set. External Links: Link Cited by: item 2.
Supplementary Material for: Fast Noise Removal for Means Clustering
Appendix A Analysis of NKmeans
The goal of this section is to prove Theorem 1. For the remainder of this section, let denote the optimal solution and denote the output of . Again, let .
We first show the benefits of optimal clusters having size at least .
Claim 1.
For each optimal center , let be the closest input point to . If the cluster defined by has size at least , then .
Proof.
Assume for contradiction that . Thus for each input point that is assigned to center in the optimal solution, we have . There are at least such points, so we can lower bound the assignment cost of these points by . This is a contradiction. ∎
Lemma 2.
If the cluster defined by in the optimal solution has size at least , then is heavy.
Proof.
Assume for contradiction that is light, so . However, at least points are assigned to in the optimal solution, so there are at least such points outside of .
Let be such a point that is assigned to in the optimal solution. By the triangle inequality, we have:
, which implies .
We conclude that for at least points assigned to in the optimal solution, their assignment costs are each at least . This is a contradiction. ∎
Now using this result, we can upper bound the number of outliers required by to remain competitive with the optimal solution (we will show that this quantity is upper bounded by the size of at the end of .)
Lemma 3.
At the end of , .
Proof.
Let .
For each
, we will classify points into two types:

:
We have that satisfies for some . If the cluster defined by has size at least , then by Lemma 2, is heavy.
Further, , so . Thus, we will not add to if its nearest optimal cluster has size at least .

:
We claim that there are at most such satisfying . Assume for contradiction that there are at least points with . At most of these points can be outliers, so the optimal solution must cluster at least of these points. Thus we can lower bound the assignment cost of these points to by:
This is a contradiction.
We conclude that includes no points of type 1 from clusters of size at least , at most points from each cluster of size less than , and at most points of type 2. ∎
Corollary 2.
If every optimal cluster has size at least , then at the end of , .
It remains to bound the cost of . Recall that the cost of is the cost of clustering with excluding the points of largest assignment cost.
Intuitively, we do not need to worry about the points in that are clustered in both the solution and the solution – so the points in , because such points are paid for in both solutions.
We must take some care to bound the cost of the points in that are clustered by the solution but are outliers in the solution , because such points could have unbounded assignment costs to . Here we will use the following property of heavy points:
Lemma 4.
Let be a heavy point. Then there exists some optimal center such that .
Proof.
Assume for contradiction that for every . However, is heavy, so . At least points in must be clustered by the optimal solution .
Consider any such . By the triangle inequality, we have
This implies . Thus we can lower bound the assignment cost to of all points in by:
This is a contradiction. ∎
Now we are ready to prove the main theorem of this section.
Proof of Theorem 1.
By Corolloary 2, we have .
Further, by construction, is an approximate means solution on . Then
so it suffices to show that .
We will consider two types of points:

, so points in that are also clustered in the optimal solution :
We have

, so points in that are outliers in the optimal solution :
Observe that by definition, , so there are at most such . By Lemma 2, for each such , we have . Thus,
We conclude that , as required. ∎
Appendix B Analysis of Coreset Construction and Near Linear Time Algorithm
The goal of this section is to prove Theorems 4 and 5. In our proof, we will use Theorems 2 and 3. For proofs of these theorems, see Sections C and D.
Proof of Theorem 4.
We consider cases: and .
If , then . Because , we have . Then , as required. Further, by Theorem 2, is a coreset for with constant probability.
Otherwise, if , then . Thus, , as required. By Theorem 3, with probability at least , is an coreset for . For the remainder of this analysis, we assume this condition holds. We also know that is a coreset for with constant probability. Assume this holds for the remainder of the analysis.
Let be a set of centers satisfying . Because is an coreset for , this implies:
Because is an coreset for , we conclude:
Thus is an coreset for . ∎
Proof of Theorem 5.
To analyze the runtime, note that we can compute in time . It is known that takes time (Arthur and Vassilvitskii, 2007; Aggarwal et al., 2009). Thus the runtime of SampleCoreset is dominated by the runtime of kmeans++ in both cases when and , which takes time.
Note that has size in expectation, so by Lemma 1, NKmeans can be implemented to run in time on in expectation. ∎
Appendix C Proof of Theorem 2
Lemma 5.
Let . Then with probability at least .
Corollary 3.
Let . Then with probability at least .
Proof.
Let be the optimal solution to the means with outliers instance . Note that is the set of outliers in the optimal solution, so .
Then we have . Combining this inequality with the above lemma gives the desired result. ∎
Using the above corollary, we can prove Theorem 2 by a moving argument:
Proof of Theorem 2.
Let with weights as defined in the theorem statement. By the above corollary, we have with constant probability. We assume for the remainder of the proof that this condition holds.
Let be any set of centers such that . We wish to bound .
Note that by definition of , , and each weight is an integer. Thus for the remainder of the proof we interpret as a multiset such that for each , there are copies of in the multiset.
It follows, we can associate each with a unique such that (so is a unique copy of the center that is assigned to in the clustering of with centers .)
Now we partition into two sets:
For each , we want to bound its assignment cost. There are two cases:

:
We can bound