Fast Noise Removal for k-Means Clustering

03/05/2020 ∙ by Sungjin Im, et al. ∙ 0

This paper considers k-means clustering in the presence of noise. It is known that k-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called k-means clustering with outliers. The goal of k-means clustering with outliers is to discard up to a specified number z of points as noise/outliers and then find a k-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any k-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from k-means with outliers to k-means without outliers. Secondly, we show how to construct a coreset of size O(k log n). When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental unsupervised learning method that offers a compact view of data sets by grouping similar input points. Among various clustering methods,

-means clustering is one of the most popular clustering methods used in practice, which is defined as follows: given a set of points in Euclidean space111The input space can be extended to an arbitrary metric space. and a target number of clusters , the goal is to choose a set of points from as centers, so as to minimize the -loss, i.e., the sum of the squared distances of every point to its closest center in .

Due to its popularity, -means clustering has been extensively studied for decades both theoretically and empirically, and as a result, various novel algorithms and powerful underlying theories have been developed. In particular, because the clustering problem is NP-hard, several constant-factor approximation algorithms have been developed (Charikar and Guha, 1999; Kanungo et al., 2004; Kumar et al., 2004; Feldman et al., 2007), meaning that their output is always within an factor of the optimum. One of the most successful algorithms used in practice is -means++ (Arthur and Vassilvitskii, 2007). The algorithm -means++ is a preprocessing step used to set the initial centers when using Lloyd’s algorithm (Lloyd, 1982)

. Lloyd’s algorithm is a simple local search heuristic that alternates between updating the center of every cluster and reassigning points to their closest centers.

-means++ has a provable approximation guarantee of by carefully choosing the initial centers.

-means clustering is highly sensitive to noise, which is present in many data sets. Indeed, it is not difficult to see that the -means clustering objective can vary significantly even with the addition of a single point that is far away from the true clusters. In general, it is a non-trivial task to filter out noise; without knowing the true clusters, we cannot identify noise, and vice versa. While there are other clustering methods, such as density-based clustering (Ester et al., 1996), that attempt to remove noise, they do not replace -means clustering because they are fundamentally different than -means.

Consequently, there have been attempts to study -means clustering in the presence of noise. The following problem formulation is the most popular formulation in the theory (Chen, 2008; Charikar et al., 2001; McCutchen and Khuller, 2008; Guha et al., 2017)

, machine learning

(Malkomes et al., 2015; Chawla and Gionis, 2013; Li and Guo, 2018) and database communities (Gupta et al., 2017). Note that traditional -means clustering is a special case of this problem when . Throughout, for , we let denote the distance between and . For a subset of points , let .

Definition 1 (-Means with Outliers).

In this problem we are given as input a subset of points in , a parameter (number of centers), and a parameter (number of outliers). The goal is to choose a collection of centers, , to minimize: , where is the subset of input points with the smallest distances to .

Because this problem generalizes -means clustering, it is NP-hard, and in fact, turns out to be significantly more challenging. The only known constant approximations (Chen, 2008; Krishnaswamy et al., 2018)

are highly sophisticated and are based on complicated local search or linear program rounding. They are unlikely to be implemented in practice due to their runtime and complexity. Therefore, there have been strong efforts to develop simpler algorithms that offer good approximation guarantees when allowed to discard more than

points as outliers (Charikar et al., 2001; Meyerson et al., 2004; Gupta et al., 2017), or heuristics (Chawla and Gionis, 2013). Unfortunately, all existing algorithms with theoretical guarantees suffer from either high running time or inherent loss in solution quality.

1.1 Our Results and Contributions

The algorithmic contribution of this paper is two-fold, and further these contributions are validated by experiments. In this section, we state our contribution and discuss it in detail compared to the previous work.

Simple Preprocessing Step for Removing Outliers with Provable Guarantees:

In this paper we develop a simple preprocessing step, which we term NK-means, to effectively filter out outliers. NK-means  stands for noise removal for -means. Our proposed preprocessing step can be combined with any algorithm for -means clustering. Despite the large amount of work on this problem, we give the first reduction to the standard -means problem. In particular, NK-means can be combined with the popular -means++. The algorithm is the fastest known algorithm for the -means with outliers problem. Its speed and simplicity gives it the potential to be used in practice. Formally, given an -approximation for -means clustering, we give an algorithm for -means with outliers that is guaranteed to discard up to points such that the cost of remaining points is at most times the optimum that discards up to exactly points. While the theoretical guarantee on the number of outliers is larger than on worst-case inputs, we show that NK-means removes at most outliers under the assumption that every cluster in an optimal solution has at least points. We believe that this assumption captures most practical cases since otherwise significant portions of the true clusters can be discarded as outliers. In actual implementation, we can guarantee discarding exactly points by discarding the farthest points from the centers we have chosen. It is worth keeping in mind that all (practical) algorithms for the problem discard more than points to have theoretical guarantees (Charikar et al., 2001; Meyerson et al., 2004; Gupta et al., 2017).

New Coreset Construction:

When the data set is large, a dominant way to speed up clustering is to first construct a coreset and then use the clustering result of the coreset as a solution to the original input. Informally, a set of (weighted) points is called a coreset of if a good clustering of is also a good clustering of (see Section 4.1 for the formal definition of coreset.)

The idea is that if we can efficiently construct such , which is significantly smaller than , then we can speed up any clustering algorithm with little loss of accuracy. In this paper, we give an algorithm to construct a coreset of size for -means with outliers. Importantly, the coreset size is independent of and - the number of outliers and dimension, respectively.

Experimental Validation:

Our new coreset enables the implementation and comparison of all potentially practical algorithms, which are based on primal-dual (Charikar et al., 2001), uniform sampling (Meyerson et al., 2004), or local search (Chawla and Gionis, 2013; Gupta et al., 2017). It is worth noting that, to the best of our knowledge, this is the first paper to implement the primal-dual based algorithm (Charikar et al., 2001) and test it for large data sets. We also implemented natural extensions of -means++ and our algorithm NK-means. We note that for fair comparison, once each algorithm chose the centers, we considered all points and discarded the farthest points. Our experiments show that our NK-means consistently outperforms other algorithms for both synthetic and real-world data sets with little running time overhead as compared to -means++.

1.2 Comparison to the Previous Work

Algorithms for -Means with Outliers:

To understand the contribution of our work, it is important to contrast the algorithm with previous work. We believe a significant contribution of our work is the algorithmic simplicity and speed as well as the theoretical bounds that our approach guarantees. In particular, we will discuss why the previous algorithms are difficult to use in practice.

The first potentially practical algorithm developed is based on primal-dual (Charikar et al., 2001). Instead of solving a linear program (LP) and converting the solution to an integer solution, the primal-dual approach only uses the LP and its dual to guide the algorithm. However, the algorithm does not scale well and is not easy to implement. In particular, it involves increasing variables uniformly, which requires running time and extra care to handle precision issues of fractional values. As mentioned before, this algorithm was never implemented prior to this paper. Our experiments show that this algorithm considerably under-performs compared to other algorithms.

The second potentially practical algorithm is based on uniform sampling (Meyerson et al., 2004). The main observation of Meyerson et al. (2004) is that if every cluster is large enough, then a small uniform sample can serve as a coreset. This observation leads to two algorithms for -means clustering with outliers: (i) (implicit) reduction to -means clustering via conservative uniform sampling and (ii) (explicit) aggressive uniform sampling plus primal-dual (Charikar et al., 2001). In (i) it can be shown that a constant approximate -means clustering of a uniform sample of size is a constant approximation for -means clustering with outliers, under the assumption that every cluster has size . Here, the main idea is to avoid any noise by sampling conservatively. Although this assumption is reasonable as discussed before, the real issue is that conservative uniform sampling doesn’t give a sufficiently accurate sketch to be adopted in practice. For example, if there are noise points, then the conservative uniform sample has only points. In (ii), a more aggressive uniform sampling is used and followed by the primal dual (Charikar et al., 2001). It first obtains a uniform sample of size ; then the (expected) number of outliers in the sample becomes . This aggressive uniform sampling turns out to have very little loss in terms of accuracy. However, as mentioned before, the primal-dual algorithm under-performs compared to other algorithms in speed and accuracy.

Another line of algorithmic development has been based on local search (Chawla and Gionis, 2013; Gupta et al., 2017). The algorithm in Chawla and Gionis (2013) guarantees the convergence to a local optimum, but has no approximation guarantees. The other algorithm (Gupta et al., 2017) is an -approximation but theoretically it may end up with discarding outliers. These local search algorithms are considerably slower than our method and the theoretical guarantees require discarding many more points.

To summarize, there is a need for a fast and effective algorithm for -means clustering with outliers.

Coresets for -Means with Outliers:

The other main contribution of our work is a coreset for -means with outliers of size - independent of the number of outliers and dimension .

The notion of coreset we consider is related to the concept of a weak coreset in the literature - see e.g. Feldman and Langberg (2011) for discussion of weak coresets and other types of coresets. Previous coreset constructions (some for stronger notions of coreset) have polynomial dependence on the number of outliers (Gupta et al., 2017), inverse polynomial dependence on the fraction of outliers (Meyerson et al., 2004; Huang et al., 2018), or polynomial dependence on the dimension (Huang et al., 2018). Thus, all coresets constructed in the previous work can have large size for some value of , e.g. , or for large values of . In contrast, our construction is efficient for all values of and yields coresets of size with no dependence on or .

1.3 Overview of Our Algorithms: NK-means  and SampleCoreset

Our preprocessing step, NK-means, is reminiscent of density-based clustering. Our algorithm tags an input point as light if it has relatively few points around it. Formally, a point is declared as light if it has less than points within a certain distance threshold , which can be set by binary search. Then a point is discarded if it only has light points within distance . We emphasize that the threshold is chosen by the algorithm, not by the algorithm user, unlike in density-based clustering. While our preprocessing step looks similar to the algorithm for -center clustering (Charikar et al., 2001), which optimizes the -loss, we find it surprising that a similar idea can be used for -means clustering.

It can take considerable time to label each point light or not. To speed up our algorithm, we develop a new corest construction for -means with outliers. The idea is relatively simple. We first use aggressive sampling as in Meyerson et al. (2004). The resulting sample has size and includes

outliers with high probability. Then we use

-means++ to obtain centers. As a result, we obtain a high-quality coreset of size . Interestingly, to our best knowledge, combining aggressive sampling with another coreset for -means with outliers has not been considered in the literature.

1.4 Other Related Work

Due to the vast literature on clustering, we refer the reader to Aggarwal and Reddy (2013); Kogan et al. (2006); Jain et al. (1999) for an overview and survey of the literature. -means clustering can be generalized by considering other norms of loss, and such extensions have been studied under different names. When the objective is -norm loss, the problem is called -medians. The -median and -mean clustering problems are closely related, and in general the algorithm and analysis for one can be readily translated into one for the other with an factor loss in the approximation ratio. Constant approximations are known for -medians and -means based on linear programming, primal-dual, and local search (Arya et al., 2004; Charikar et al., 2002; Charikar and Guha, 1999). While its approximation ratio is , the -means++ algorithm is widely used in practice for

-means clustering due to its practical performance and simplicity. When the loss function is

, the problem is known as -centers and a -approximation is known for -centers clustering with outliers (Charikar et al., 2001). For recent work on these outlier problems in distributed settings, see Malkomes et al. (2015); Li and Guo (2018); Guha et al. (2017); Chen et al. (2018).

2 Preliminaries

In this paper we will consider the Euclidean -means with outliers problem as defined in the introduction. Note that the -distance satisfies the triangle inequality, so for all , . Further, the approximate triangle inequality will be useful to our analyses (this follows from the triangle inequality): . Given a set of centers , we say that the assignment cost of to is . For -means with outliers, a set, , of centers naturally defines a clustering of the input points as follows:

Definition 2 (Clustering).

Let be a set of centers. A clustering of defined by is a partition of satisfying: For all and , , where ties between ’s are broken arbitrarily but consistently.

In summary, for the -means with outliers problem, given a set of centers, we assign each point in to its closest center in . Then we exclude the points of with the highest assignment cost from the objective function (these points are our outliers.) This procedure defines a clustering of with outliers.

Notations: For , we define . Recall that as in the introduction, for any finite , we define: . For any , we define the -ball centered at with radius by . For a set of centers, , and , we define the -cost of by Recall that we define to be the subset of points of excluding the points with highest assignment costs. Thus the -cost of is the cost of clustering with while excluding the points with highest assignment costs. As shorthand, when – so when we consider the -means problem without outliers – we will denote the -cost of clustering with by . Further, we say a set of centers is an optimal -solution if it minimizes over all choices of centers, . Then we define to be the optimal objective value of the -means with outliers instance . Analogously, for the -means without outliers problem, we denote the optimal objective value of the -means instance by .

3 NK-means Algorithm

In this section, we will describe our algorithm, NK-means, which turns a -means algorithm without outliers to an algorithm for -means with outliers in a black box fashion. We note that the algorithm naturally extends to -medians with outliers and general metric spaces. For the remainder of this section, let , , and define an instance of -means with outliers.

Algorithm Intuition: The guiding intuition behind our algorithm is as follows: We consider a ball of radius around each point . If this ball contains many points, then is likely not to be an outlier in the optimal solution.

More concretely, if there are more than points in ’s ball, then at most of these points can be outliers in the optimal solution. This means that the majority of ’s neighbourhood is real points in the optimal solution, so we can bound the assignment cost of to the optimal centers. We call such points heavy.

There are main steps to our algorithm. First, we use the concept of heavy points to decide which points are real points and those that are outliers. Then we run a -means approximation algorithm on the real points.

Formal Algorithm: Now we formally describe our algorithm NK-means. As input, NK-means takes a -means with outliers instance and an algorithm for -mean without outliers, , where takes an instance of -means as input.

We will prove that if is an -approximation for -means and the optimal clusters are sufficiently large with respect to , then NK-means outputs a good clustering that discards outliers. More precisely, we will prove the following theorem about the performance of NK-means:

Theorem 1.

Let be the output of . Suppose that is an -approximation for -means. If every cluster in the clustering defined by has size at least , then .

Corollary 1.

Let be the output of . Suppose that is an -approximation. Then .

In other words, NK-means gives a pseudo-approximation-preserving reduction from -means with outliers to -means, where any approximation for -means implies a pseudo-approximation for -means with outliers that throws away points as outliers.

1:  Suppose we know the optimal objective value
2:  Initialize ,
3:  for each  do
4:     Compute
5:     if  then
6:        Mark as heavy
7:     end if
8:  end for
9:  for each  do
10:     if  contains no heavy points then
11:        Update
12:     end if
13:  end for
14:  Output
Algorithm 1 for -means with outliers

3.1 Implementation Details

Here we describe a simple implementation of NK-means that achieves runtime assuming we know the optimal objective value, , where is the runtime of the algorithm on inputs of size . This assumption can be removed by running that algorithm for many guesses of , say by trying all powers of to obtain a -approximation of for the correct guess.

For our experiments, we implement the loop in Line 3 by enumerating over all pairs of points and computing their distance. This step takes time . We implement the loop in Line 9 by enumerating over all elements in and checking if it is heavy for each . This step takes . Running on takes time. We summarize the result of this section in the following lemma:

Lemma 1.

Assuming that we know and that takes time on inputs of size , then NK-means can be implemented to run in time .

4 Coreset of Near Linear Size in k

In this section we develop a general framework to speed up any -means with outliers algorithm, and we apply this framework to NK-means to show that we can achieve near-linear runtime. In particular, we achieve this by constructing what is called a coreset for the -means with outliers problem of size , which is independent of the number of outliers, .

4.1 Coresets for k-Means with Outliers

Our coreset construction will leverage existing constructions of coresets for -means with outliers. A coreset gives a good summary of the input instance in the following sense:

Definition 3 (Coreset for -Means with Outliers).
222Note that our definition of coreset is parameterized by the number of outliers, , in contrast to previous work such as Meyerson et al. (2004) and Huang et al. (2018), whose constructions are parametereized by the fraction of outliers, .

Let be an instance of -means with outliers and be a (possibly weighted) subset of . We say the -means with outliers instance is an -coreset for if for any set of -centers satisfying for some , we have .

In words, if is an coreset for , then running any -approximate -means with outliers algorithm on (meaning the algorithm throws away outliers and outputs a solution with cost at most ) gives a -approximate solution to .

Note that if is a weighted set with weights , then the -means with outliers problem is analogously defined, where the objective is a weighted sum of assignment costs: . Further, note that NK-means generalizes naturally to weighted -means with outliers with the same guarantees.

The two coresets we will utilize for our construction are k-means++ (Aggarwal et al., 2009) and Meyerson’s sampling coreset (Meyerson et al., 2004). The guarantees of these coresets are as follows:

Theorem 2 (k-means++).

Let denote running k-means++ on input points to obtain a set of size . Further, let be the clustering of with centers , respectively. We define a weight function by for all . Suppose . Then with probability at least , the instance where has weights is an -coreset for the -means with outliers instance .

Theorem 3 (Sampling).

Let be a sample from , where every is included in independently with probability . Then with probability at least , the instance is a -coreset for .

Observe that k-means++ gives a coreset of size , and uniform sampling gives a coreset of size in expectation. If is small, then k-means++ gives a very compact coreset for -means with outliers, but if is large – say – then k-means++ gives a coreset of linear size. However, the case where is large is exactly when uniform sampling gives a small coreset.

In the next section, we show how we can combine these two coresets to construct a small coreset that works for all .

4.2 Our Coreset Construction: SampleCoreset

1:  Let .
2:  if  then
3:     Output .
4:  else
5:     Let be a sample drawn from , where each is included in independently with probability .
6:     Output
7:  end if
Algorithm 2 Coreset Constuction for -Means with Outliers

Using the above results, our strategy is as follows: Let be an instance of -means with outliers. If , then we can show that , so we can simply run k-means++ on the input instance to get a good coreset. Otherwise, is large, so we first subsample approximately points from . Let denote the resulting sample. Then we compute a coreset on of size , where we scale down the number of outliers from proportionally.

Algorithm 2 formally describes our coreset construction. We will prove that SampleCoreset outputs with constant probability a good coreset for the -means with outliers instance of size . In particular, we will show:

Theorem 4.

With constant probability, SampleCoreset outputs an -coreset for the -means with outliers instance of size in expectation.

4.3 A Near Linear Time Algorithm for k-Means With Outliers

Using SampleCoreset, we show how to speed up NK-means to run in near linear time: Let be the result of . Then, to choose centers we run if ; otherwise, run , where is any -approximate -means algorithm with runtime on inputs of size .

Theorem 5.

There exists an algorithm that outputs with a constant probability an -approximate solution to -means with outliers while discarding outliers in expected time .

5 Experiment Results

This section presents our experimental results. The main conclusions are:

  • Our algorithm NK-means  almost always has the best performance and finds the largest proportion of ground truth outliers. In the cases where NK-means  is not the best, it is competitive within 5%.

  • Our algorithm results in a stable solution. Algorithms without theoretical guarantees have unstable objectives on some experiments.

  • Our coreset construction SampleCoreset  allows us to run slower, more sophisticated, algorithms with theoretical guarantees on large inputs. Despite their theoretical guarantees, their practical performance is not competitive.

Primal-Dual -means– Local Search Uniform Sample NK-means
run time > 4hrs 9/16 1/16 8/16 0/16 0/16
precision < 0.8 2/16 0/16 0/16 4/16 0/16
total failure 11/16 1/16 8/16 4/16 0/16
Table 1: Failure rates due to high run time or low precision.

The experiments shows that for a modest overhead for preprocessing, NK-means makes -means clustering more robust to noise.

Algorithms Implemented: Our new coreset construction makes it feasible to compare many algorithms for large data sets. Without this, most known algorithms for k-means with outliers become prohibitively slow even on modestly sized data sets. In our experiments, the coreset construction we utilize is SampleCoreset. More precisely, we first obtain a uniform sample by sampling each point independently with probability . Then, we run -means++ on the sample to choose centers – the resulting coreset is of size .

Next we describe the algorithms tested. Besides the coreset construction, we use -means++ to mean running -means++ and then Lloyd’s algorithm for brevity. For more details, see Supplementary Material E. In the following, “on coreset” refers to running the algorithm on the coreset as opposed to the entire input. For fair comparison, we ensure each algorithm discards exactly outliers regardless of the theoretical guarantee. At the end of each algorithm’s execution, we discard the farthest points from the chosen centers as outliers.

Algorithms Tested:

  1. NK-means  (plus -means++ on coreset): We use NK-means with -means++ as the input . The algorithm requires a bound on the objective . For this, we considered powers of 2 in the range of .

  2. -means++ (on the original input): Note this algorithm is not designed to handle outliers.

  3. -means++ (on coreset): Same note as the above.

  4. Primal-dual algorithm of Charikar et al. (2001) (on coreset): A sophisticated algorithm based on constructing an approximate linear program solution.

  5. Uniform Sample (conservative uniform sampling plus -means++): We run -means++ on a uniform sample consisting of points sampled with probability .

  6. -means– (Chawla and Gionis, 2013) on coreset: This algorithm is a variant of the Lloyd’s algorithm that executes each iteration of Lloyd’s excluding the farthest points.

  7. Local search of Gupta et al. (2017) (on coreset) : This is an extension of the well-known -means local search algorithm.

Experiments: We now describe our experiments which were done on both synthetic and real data sets.

Synthetic Data Experiments

We first conducted experiments with synthetic data sets of various parameters. Every data set has equal one million points and and . Then we generated random Gaussian balls. For the th Gaussian we choose a center from uniformly at random. These are the true centers. Then, we add points drawn from for the th Gaussian. Next, we add noise. Points that are noise were sampled uniformly at random either from the same range or from a larger range depending on the experiment. We tagged the farthest points from the centers as ground truth outliers. We consider all possible combinations of values and the noise range.

Each experiment was conducted times, and we chose the result with the minimum objective and measured the total running time over all runs. We aborted the execution if the algorithm failed to terminate within hours. All experiments were performed on a cluster using a single node with 20 cores at 2301MHz and RAM size 128GB. Table 1 shows the number of times each algorithm aborted due to high run time. Also we measured the recall, which is defined as number of ground truth outliers reported by the algorithm, divided by , the number of points discarded. The recall was the same as the precision in all cases, so we use precision in the remaining text. We choose as the threshold for the acceptable precision and counted the number of inputs for which each algorithm had precision lower than . Our algorithm NK-means, -means++ on coreset, and -means++ on the original input all had precision greater than for all data sets and always terminated within hours. The -means++ results are excluded from the table. Details of the quality and runtime are deferred to the Supplementary Material  E.

Skin-5 Skin-10 Susy-5 Susy-10 Power-5 Power-10 KddFull

NK-means
1 1 1 1 1 1 1
0.8065 0.9424 0.8518 0.9774 0.6720 0.9679 0.6187
56 56 1136 1144 363 350 1027
-means– 0.9740 1.5082 1.2096 1.1414 1.0587 1.0625 2.0259
0.7632 0.9044 0.8151 0.9753 0.6857 0.9673 0.6436
86 89 672 697 291 251 122
-means++ 1.0641 1.4417 1.0150 1.0091 1.0815 1.0876 1.5825
coreset 0.7653 0.9012 0.8622 0.9865 0.7247 0.9681 0.3088
39 37 462 465 177 142 124

-means++
0.9525 1.6676 1.0017 1.0351 1.0278 1.0535 1.5756
original 0.7775 0.8975 0.8478 0.9814 0.7116 0.9649 0.3259
34 43 6900 6054 689 943 652

Table 2: Experiment results on real data sets with . The top, middle, bottom in each entry are the objective (normalized relative to NK-means), precision, and run time (sec.), resp. Bold indicates the best in the category.

Real Data Experiments

For further experiments, we used real data sets. We used the same normalization, noise addition method and the same value of in all experiments. The data sets are Skin-, Susy-, and Power-

. We normalized the data such that the mean and standard deviation are

and on each dimension, respectively. Then we randomly sampled points uniformly at random from and added them as noise. We discarded data points with missing entries.

Real Data Sets:

  1. Skin- (30). , , , . Only the first features were used.

  2. Susy- (31). M, , , .

  3. Power- (18). , , , . Out of features, we dropped the first , date and time, that denote when the measurements were made.

  4. KddFull (21). , , , . Each instance has features and we excluded non-numeric features. This data set has classes and classes account for of the data points. We considered the other data points as ground truth outliers.

Table 2 shows our experiment results for the above real data sets. Due to their high failure rate observed in Table 1 and space constraints, we excluded the primal-dual, local search, and conservative uniform sampling algorithms from Table 2; all results can be found in Supplementary Material E. As before, we executed each algorithm times. It is worth noting that NK-means is the only algorithm with the worst case guarantees shown in Table 2. This gives a candidate explanation for the stability of our algorithm’s solution quality across all data sets in comparison to the other algorithms considered.

The result shows that our algorithm NK-means has the best objective for all data sets, except within for Skin-5. Our algorithm is always competitive with the best precision. For KddFull where we didn’t add artificial noise, NK-means significantly outperformed other algorithms in terms of objective. We can see that NK-means pays extra in the run time to remove outliers, but this preprocessing enables stability, and competitive performance.

6 Conclusion

This paper presents a near linear time algorithm for removing noise from data before applying a -means clustering. We show that the algorithm has provably strong guarantees on the number of outliers discarded and approximation ratio. Further, NK-means gives the first pseudo-approximation-preserving reduction from -means with outliers to -means without outliers. Our experiments show that the algorithm is the fastest among algorithms with provable guarantees and is more accurate than state-of-the-art algorithms. It is of interest to determine if the algorithm achieves better guarantees if data has more structure such as being in low dimensional Euclidean space or being assumed to be well-clusterable (Braverman et al., 2011).

7 Acknowledgments

S. Im and M. Montazer Qaem were supported in part by NSF grants CCF-1617653 and 1844939. B. Moseley and R. Zhou were supported in part by NSF grants CCF-1725543, 1733873, 1845146, a Google Research Award, a Bosch junior faculty chair and an Infor faculty award.

References

  • A. Aggarwal, A. Deshpande, and R. Kannan (2009) Adaptive sampling for -means clustering. In RANDOM, pp. 15–28. Cited by: Appendix B, Appendix C, §4.1.
  • C. C. Aggarwal and C. K. Reddy (2013) Data clustering: algorithms and applications. Chapman and Hall/CRC. Cited by: §1.4.
  • D. Arthur and S. Vassilvitskii (2007) -Means++: the advantages of careful seeding. In SODA, pp. 1027–1035. Cited by: Appendix B, §1.
  • V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit (2004) Local search heuristics for k-median and facility location problems. SIAM Journal on computing 33 (3), pp. 544–562. Cited by: §1.4.
  • V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku (2011) Streaming k-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pp. 26–40. Cited by: §6.
  • M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys (2002) A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences 65 (1), pp. 129–149. Cited by: §1.4.
  • M. Charikar and S. Guha (1999) Improved combinatorial algorithms for the facility location and k-median problems. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 378–388. Cited by: §1.4, §1.
  • M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan (2001) Algorithms for facility location problems with outliers. In SODA, pp. 642–651. Cited by: item 4, §1.1, §1.1, §1.2, §1.2, §1.3, §1.4, §1, §1, item 4.
  • S. Chawla and A. Gionis (2013)

    -Means-: A unified approach to clustering and outlier detection

    .
    In ICDM, pp. 189–197. Cited by: item 6, §1.1, §1.2, §1, §1, item 6.
  • J. Chen, E. S. Azer, and Q. Zhang (2018) A practical algorithm for distributed clustering and outlier detection. In Advances in Neural Information Processing Systems, pp. 2248–2256. Cited by: §1.4.
  • K. Chen (2008) A constant factor approximation algorithm for -median clustering with outliers. In SODA, pp. 826–835. Cited by: §1, §1.
  • M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pp. 226–231. Cited by: §1.
  • D. Feldman and M. Langberg (2011) A unified framework for approximating and clustering data. In

    Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011

    , L. Fortnow and S. P. Vadhan (Eds.),
    pp. 569–578. External Links: Link, Document Cited by: §1.2.
  • D. Feldman, M. Monemizadeh, and C. Sohler (2007) A ptas for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pp. 11–18. Cited by: §1.
  • S. Guha, Y. Li, and Q. Zhang (2017) Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 143–152. Cited by: §1.4, §1.
  • S. Gupta, R. Kumar, K. Lu, B. Moseley, and S. Vassilvitskii (2017) Local search methods for k-means with outliers. PVLDB 10 (7), pp. 757–768. External Links: Link, Document Cited by: item 7, §1.1, §1.1, §1.2, §1.2, §1, §1, item 7.
  • L. Huang, S. H.-C. Jiang, J. Li, and X. Wu (2018) Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, M. Thorup (Ed.), pp. 814–825. External Links: Link, Document Cited by: §1.2, footnote 2.
  • [18] Individual household electric power consumption data set. External Links: Link Cited by: item 3.
  • A. K. Jain, M. N. Murty, and P. J. Flynn (1999) Data clustering: a review. ACM Computing Surveys 31, pp. 264–323. Cited by: §1.4.
  • T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2004) A local search approximation algorithm for -means clustering. Comput. Geom. 28 (2-3), pp. 89–112. Cited by: §1.
  • [21] KDD cup 1999 data. External Links: Link Cited by: item 4.
  • J. Kogan, C. Nicholas, M. Teboulle, et al. (2006) Grouping multidimensional data. Springer. Cited by: §1.4.
  • R. Krishnaswamy, S. Li, and S. Sandeep (2018) Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pp. 646–659. External Links: Link, Document Cited by: §1.
  • A. Kumar, Y. Sabharwal, and S. Sen (2004) A simple linear time (1+ ) -approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, Washington, DC, USA, pp. 454–462. External Links: ISBN 0-7695-2228-9, Link, Document Cited by: §1.
  • S. Li and X. Guo (2018) Distributed -clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pp. 7838–7846. Cited by: §1.4, §1.
  • S. P. Lloyd (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. Cited by: §1.
  • G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley (2015) Fast distributed -center clustering with outliers on massive data. In NIPS, pp. 1063–1071. Cited by: §1.4, §1.
  • R. M. McCutchen and S. Khuller (2008) Streaming algorithms for -center clustering with outliers and with anonymity. In APPROX, pp. 165–178. Cited by: §1.
  • A. Meyerson, L. O’callaghan, and S. Plotkin (2004) A k-median algorithm with running time independent of data size. Machine Learning 56 (1-3), pp. 61–87. Cited by: Appendix D, §1.1, §1.1, §1.2, §1.2, §1.3, §1, §4.1, footnote 2.
  • [30] Skin segmentation data set. External Links: Link Cited by: item 1.
  • [31] SUSY data set. External Links: Link Cited by: item 2.

References

  • A. Aggarwal, A. Deshpande, and R. Kannan (2009) Adaptive sampling for -means clustering. In RANDOM, pp. 15–28. Cited by: Appendix B, Appendix C, §4.1.
  • C. C. Aggarwal and C. K. Reddy (2013) Data clustering: algorithms and applications. Chapman and Hall/CRC. Cited by: §1.4.
  • D. Arthur and S. Vassilvitskii (2007) -Means++: the advantages of careful seeding. In SODA, pp. 1027–1035. Cited by: Appendix B, §1.
  • V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit (2004) Local search heuristics for k-median and facility location problems. SIAM Journal on computing 33 (3), pp. 544–562. Cited by: §1.4.
  • V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku (2011) Streaming k-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pp. 26–40. Cited by: §6.
  • M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys (2002) A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences 65 (1), pp. 129–149. Cited by: §1.4.
  • M. Charikar and S. Guha (1999) Improved combinatorial algorithms for the facility location and k-median problems. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 378–388. Cited by: §1.4, §1.
  • M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan (2001) Algorithms for facility location problems with outliers. In SODA, pp. 642–651. Cited by: item 4, §1.1, §1.1, §1.2, §1.2, §1.3, §1.4, §1, §1, item 4.
  • S. Chawla and A. Gionis (2013)

    -Means-: A unified approach to clustering and outlier detection

    .
    In ICDM, pp. 189–197. Cited by: item 6, §1.1, §1.2, §1, §1, item 6.
  • J. Chen, E. S. Azer, and Q. Zhang (2018) A practical algorithm for distributed clustering and outlier detection. In Advances in Neural Information Processing Systems, pp. 2248–2256. Cited by: §1.4.
  • K. Chen (2008) A constant factor approximation algorithm for -median clustering with outliers. In SODA, pp. 826–835. Cited by: §1, §1.
  • M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pp. 226–231. Cited by: §1.
  • D. Feldman and M. Langberg (2011) A unified framework for approximating and clustering data. In

    Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011

    , L. Fortnow and S. P. Vadhan (Eds.),
    pp. 569–578. External Links: Link, Document Cited by: §1.2.
  • D. Feldman, M. Monemizadeh, and C. Sohler (2007) A ptas for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pp. 11–18. Cited by: §1.
  • S. Guha, Y. Li, and Q. Zhang (2017) Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 143–152. Cited by: §1.4, §1.
  • S. Gupta, R. Kumar, K. Lu, B. Moseley, and S. Vassilvitskii (2017) Local search methods for k-means with outliers. PVLDB 10 (7), pp. 757–768. External Links: Link, Document Cited by: item 7, §1.1, §1.1, §1.2, §1.2, §1, §1, item 7.
  • L. Huang, S. H.-C. Jiang, J. Li, and X. Wu (2018) Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, M. Thorup (Ed.), pp. 814–825. External Links: Link, Document Cited by: §1.2, footnote 2.
  • [18] Individual household electric power consumption data set. External Links: Link Cited by: item 3.
  • A. K. Jain, M. N. Murty, and P. J. Flynn (1999) Data clustering: a review. ACM Computing Surveys 31, pp. 264–323. Cited by: §1.4.
  • T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2004) A local search approximation algorithm for -means clustering. Comput. Geom. 28 (2-3), pp. 89–112. Cited by: §1.
  • [21] KDD cup 1999 data. External Links: Link Cited by: item 4.
  • J. Kogan, C. Nicholas, M. Teboulle, et al. (2006) Grouping multidimensional data. Springer. Cited by: §1.4.
  • R. Krishnaswamy, S. Li, and S. Sandeep (2018) Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pp. 646–659. External Links: Link, Document Cited by: §1.
  • A. Kumar, Y. Sabharwal, and S. Sen (2004) A simple linear time (1+ ) -approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, Washington, DC, USA, pp. 454–462. External Links: ISBN 0-7695-2228-9, Link, Document Cited by: §1.
  • S. Li and X. Guo (2018) Distributed -clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pp. 7838–7846. Cited by: §1.4, §1.
  • S. P. Lloyd (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. Cited by: §1.
  • G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley (2015) Fast distributed -center clustering with outliers on massive data. In NIPS, pp. 1063–1071. Cited by: §1.4, §1.
  • R. M. McCutchen and S. Khuller (2008) Streaming algorithms for -center clustering with outliers and with anonymity. In APPROX, pp. 165–178. Cited by: §1.
  • A. Meyerson, L. O’callaghan, and S. Plotkin (2004) A k-median algorithm with running time independent of data size. Machine Learning 56 (1-3), pp. 61–87. Cited by: Appendix D, §1.1, §1.1, §1.2, §1.2, §1.3, §1, §4.1, footnote 2.
  • [30] Skin segmentation data set. External Links: Link Cited by: item 1.
  • [31] SUSY data set. External Links: Link Cited by: item 2.

Supplementary Material for: Fast Noise Removal for -Means Clustering

Appendix A Analysis of NK-means

The goal of this section is to prove Theorem 1. For the remainder of this section, let denote the optimal -solution and denote the output of . Again, let .

We first show the benefits of optimal clusters having size at least .

Claim 1.

For each optimal center , let be the closest input point to . If the cluster defined by has size at least , then .

Proof.

Assume for contradiction that . Thus for each input point that is assigned to center in the optimal solution, we have . There are at least such points, so we can lower bound the assignment cost of these points by . This is a contradiction. ∎

Lemma 2.

If the cluster defined by in the optimal solution has size at least , then is heavy.

Proof.

Assume for contradiction that is light, so . However, at least points are assigned to in the optimal solution, so there are at least such points outside of .

Let be such a point that is assigned to in the optimal solution. By the triangle inequality, we have:

, which implies .

We conclude that for at least points assigned to in the optimal solution, their assignment costs are each at least . This is a contradiction. ∎

Now using this result, we can upper bound the number of outliers required by to remain competitive with the optimal -solution (we will show that this quantity is upper bounded by the size of at the end of .)

Lemma 3.

At the end of , .

Proof.

Let .

For each

, we will classify points into two types:

  1. :

    We have that satisfies for some . If the cluster defined by has size at least , then by Lemma 2, is heavy.

    Further, , so . Thus, we will not add to if its nearest optimal cluster has size at least .

  2. :

    We claim that there are at most such satisfying . Assume for contradiction that there are at least points with . At most of these points can be outliers, so the optimal solution must cluster at least of these points. Thus we can lower bound the assignment cost of these points to by:

    This is a contradiction.

We conclude that includes no points of type 1 from clusters of size at least , at most points from each cluster of size less than , and at most points of type 2. ∎

Corollary 2.

If every optimal cluster has size at least , then at the end of , .

It remains to bound the -cost of . Recall that the -cost of is the cost of clustering with excluding the points of largest assignment cost.

Intuitively, we do not need to worry about the points in that are clustered in both the -solution and the -solution – so the points in , because such points are paid for in both solutions.

We must take some care to bound the cost of the points in that are clustered by the -solution but are outliers in the -solution , because such points could have unbounded assignment costs to . Here we will use the following property of heavy points:

Lemma 4.

Let be a heavy point. Then there exists some optimal center such that .

Proof.

Assume for contradiction that for every . However, is heavy, so . At least points in must be clustered by the optimal -solution .

Consider any such . By the triangle inequality, we have

This implies . Thus we can lower bound the assignment cost to of all points in by:

This is a contradiction. ∎

Now we are ready to prove the main theorem of this section.

Proof of Theorem 1.

By Corolloary 2, we have .

Further, by construction, is an -approximate -means solution on . Then

so it suffices to show that .

We will consider two types of points:

  1. , so points in that are also clustered in the optimal -solution :

    We have

  2. , so points in that are outliers in the optimal -solution :

    Observe that by definition, , so there are at most such . By Lemma 2, for each such , we have . Thus,

We conclude that , as required. ∎

Appendix B Analysis of Coreset Construction and Near Linear Time Algorithm

The goal of this section is to prove Theorems 4 and 5. In our proof, we will use Theorems 2 and 3. For proofs of these theorems, see Sections C and D.

Proof of Theorem 4.

We consider cases: and .

If , then . Because , we have . Then , as required. Further, by Theorem 2, is a -coreset for with constant probability.

Otherwise, if , then . Thus, , as required. By Theorem 3, with probability at least , is an -coreset for . For the remainder of this analysis, we assume this condition holds. We also know that is a -coreset for with constant probability. Assume this holds for the remainder of the analysis.

Let be a set of centers satisfying . Because is an -coreset for , this implies:

Because is an -coreset for , we conclude:

Thus is an -coreset for . ∎

Proof of Theorem 5.

The approximation guarantees follow directly from Theorems 1 and 4.

To analyze the runtime, note that we can compute in time . It is known that takes time (Arthur and Vassilvitskii, 2007; Aggarwal et al., 2009). Thus the runtime of SampleCoreset is dominated by the runtime of k-means++ in both cases when and , which takes time.

Note that has size in expectation, so by Lemma 1, NK-means can be implemented to run in time on in expectation. ∎

Appendix C Proof of Theorem 2

Our proof of Theorem 2 relies on the following lemma which is implicit in Aggarwal et al. (2009):

Lemma 5.

Let . Then with probability at least .

Corollary 3.

Let . Then with probability at least .

Proof.

Let be the optimal solution to the -means with outliers instance . Note that is the set of outliers in the optimal solution, so .

Then we have . Combining this inequality with the above lemma gives the desired result. ∎

Using the above corollary, we can prove Theorem 2 by a moving argument:

Proof of Theorem 2.

Let with weights as defined in the theorem statement. By the above corollary, we have with constant probability. We assume for the remainder of the proof that this condition holds.

Let be any set of centers such that . We wish to bound .

Note that by definition of , , and each weight is an integer. Thus for the remainder of the proof we interpret as a multiset such that for each , there are copies of in the multiset.

It follows, we can associate each with a unique such that (so is a unique copy of the center that is assigned to in the clustering of with centers .)

Now we partition into two sets:

For each