The rise of big data has brought the design of distributed learning algorithm to the forefront. For example, in many practical settings the large quantities of data are collected and stored at different locations, while we want to learn properties of the union of the data. For many machine learning tasks, in order to speed up the computation we need to partition the data into a number of machines for a joint computation. In a different dimension, since real-world data often contain background noise or extreme values, it is desirable for us to perform the computation on the “clean data” by discarding a small portion of the data from the input. Sometimes these outliers are interesting by themselves; for example, in the study of statistical data of a population, outliers may represent those people who deserve special attention. In this paper we studyclustering with outliers, a fundamental problem in unsupervised learning, in the distributed model where data are partitioned across multiple sites, who need to communicate to arrive at a consensus on the cluster centers and labeling of outliers.
For many clustering applications it is common to model data objects as points in , and the similarity between two objects is represented as the Euclidean distance of the two corresponding points. In this paper we assume for simplicity that each point can be sent by one unit of communication. Note that when is large, we can apply standard dimension reduction tools (for example, the Johnson-Lindenstrauss lemma) before running our algorithms.
We focus on the two well-studied objective functions -means and -median, defined in Definition 1. It is worthwhile to mention that our algorithms also work for other metrics as long as the distance oracles are given.
Definition 1 (-means/median)
Let be a set of points, and be two parameters. For the -median problem we aim for computing a set of centers of size at most and a set of outliers of size at most so that the objective function is minimized. For the -means we simply replace the objective function with .
Computation Model. We study the clustering problems in the coordinator model, a well-adopted model for distributed learning Balcan et al. (2013); Chen et al. (2016); Guha et al. (2017); Diakonikolas et al. (2017). In this model we have sites and a central coordinator; each site can communicate with the coordinator. The input data points are partitioned among the sites, who, together with the coordinator, want to jointly compute some function on the global data. The data partition can be either adversarial or random. The former can model the case where the data points are independently collected at different locations, while the latter is common in the scenario where the system uses a dispatcher to randomly partition the incoming data stream into multiple workers/sites for a parallel processing (and then aggregates the information at a central server/coordinator).
In this paper we focus on the one-round communication model (also called the simultaneous communication model), where each site sends a sketch of its local dataset to the coordinator, and then the coordinator merges these sketches and extracts the answer. This model is arguably the most practical one since multi-round communication will cost a large system overhead.
Our goals for computing -means/median in the coordinator model are the following: (1) to minimize the clustering objective functions; (2) to accurately identify the set of global outliers; and (3) to minimize the computation time and the communication cost of the system. We will elaborate on how to quantify the quality of outlier detection in Section 5.
Our Contributions. A natural way of performing distributed clustering in the simultaneous communication model is to use the two-level clustering framework (see e.g., Guha et al. (2003, 2017)). In this framework each site performs the first level clustering on its local dataset , getting a subset with each point being assigned a weight; we call the summary of . The site then sends to the coordinator, and the coordinator performs the second level clustering on the union of the summaries. We note that the second level clustering is required to output at most centers and outliers, while the summary returned by the first level clustering can possibly have more than weighted points. The size of the summary will contribute to the communication cost as well as the running time of the second level clustering.
The main contribution of this paper is to propose a simple and practical summary construction at sites with the following properties.
It is extremely fast: runs in time , where is the size of the dataset.
The summary has small size: for adversarial data partition and for random data partition.
When coupled with a second level (centralized) clustering algorithm that -approximates -means/median, we obtain an -approximation algorithm for distributed -means/median.111We say an algorithm -approximates a problem if it outputs a solution that is at most times the optimal solution.
It can be used to effectively identify the global outliers.
We emphasize that both the first and the second properties are essential to make the distributed clustering algorithm scalable on large datasets. Our extensive set of experiments have demonstrated the clear superiority of our algorithm against all the baseline algorithms in almost all metrics.
To the best of our knowledge, this is the first practical algorithm with theoretical guarantees for distributed clustering with outliers.
Related Work. Clustering is a fundamental problem in computer science and has been studied for more than fifty years. A comprehensive review of the work on -means/median is beyond the scope of this paper, and we will focus on the literature for centralized/distributed -means/median clustering with outliers and distributed -means/median clustering.
In the centralized setting, several -approximation or -approximation222We say a solution is an -approximation if the cost of the solution is while excluding points, where is the cost of the optimal solution excluding points. algorithms have been proposed Charikar et al. (2001); Chen (2009)
. These algorithms make use of linear programming and need time at least, which is prohibitive on large datasets. Feldman and Schulman (2012) studied -median via coresets, but the running times of their algorithm includes a term which is not practical.
Chawla and Gionis (2013) proposed for -means an algorithm called -means--, which is an iterative procedure and can be viewed as a generalization of Llyod’s algorithm Lloyd (1982). Like Llyod’s algorithm, the centers that -means-- outputs are not the original input points; we thus cannot use it for the summary construction in the first level clustering at sites because some of the points in the summary will be the outliers we report at the end. However, we have found that -means-- is a good choice for the second level clustering: it outputs exactly centers and outliers, and its clustering quality looks decent on datasets that we have tested, though it does not have any worst case theoretical guarantees.
Recently Gupta et al. (2017) proposed a local-search based -approximation algorithm for -means. The running time of their algorithm is ,333 hides some logarithmic factors. which is again not quite scalable. The authors mentioned that one can use the -means++ algorithm Arthur and Vassilvitskii (2007) as a seeding step to boost the running time to . We note that first, this running time is still worse than ours. And second, since in the first level clustering we only need a summary – all that we need is a set of weighted points that can be fed into the second level clustering at the coordinator, we can in fact directly use -means++ with a budget of centers for constructing a summary. We will use this approach as a baseline algorithm in our experimental studies.
In the past few years there has been a growing interest in studying -means/median clustering in the distributed models Ene et al. (2011); Bahmani et al. (2012); Balcan et al. (2013); Liang et al. (2014); Cohen et al. (2015); Chen et al. (2016). In the case of allowing outliers, Guha et al. Guha et al. (2017) gave a first theoretical study for distributed -means/median. However, their algorithms need running time at sites and are thus again not quite practical on large-scale datasets. We note that the -means algorithm proposed by Bahmani et al. (2012) can be extended (again by increasing the budget of centers from to ) and used as a baseline algorithm for comparison. The main issue with -means is that it needs rounds of communication which holds back its overall performance.
We are going to use the notations listed in Table 1.
|input dataset||, size of the dataset|
|number of centers|
|number of outliers||outliers chosen by OPT|
|of iterations in Algo 1|
|remaining points at the -th|
|iteration of Algorithm 1|
|clustered points at the -th|
|iteration of Algorithm 1|
We will also make use of the following lemmas.
Lemma 1 (Chernoff Bound)
Let be independent Bernoulli random variables such that
be independent Bernoulli random variables such that. Let , and let . It holds that and for any .
Lemma 2 (Mettu and Plaxton (2002))
Consider the classic balls and bins experiment where balls are thrown into bins, for some . Also, let be a weight associated with the -th bin, for . Assuming, the probability of each ball falling into the
. Assuming, the probability of each ball falling into the-th bin is and , the following holds:
For any , there exists a such that
Note that the dependence of on is independent of or .
3 The Summary Construction
In this section we present our summary construction for -median/means in the centralized model. In Section 4 we will show how to use this summary construction for solving the problems in the distributed model.
3.1 The Algorithm
Our algorithm is presented in Algorithm 1. It works for both the -means and -median objective functions. We note that Algorithm 1 is partly inspired by the algorithm for clustering without outliers proposed in Mettu and Plaxton (2002). But since we have to handle outliers now, the design and analysis of our algorithm require new ideas.
For a set and a scalar value , define . Algorithm 1 works in rounds indexed by . Let be the initial set of input points. The idea is to sample a set of points of size for a constant (assuming ) from , and grow a ball of radius centered at each . Let be the set of points in the union of these balls. The radius is chosen such that at least a constant fraction of points of are in .
Define . In the -th round, we add the points in to the set of centers, and assign points in to their nearest centers in . We then recurse on the rest of the points , and stop until the number of points left unclustered becomes at most . Let be the final value of . Define the weight of each point in to be the number of points in that are assigned to , and the weight of each point in to be . Our summary consists of points in together with their weights.
3.2 The Analysis
We now try to analyze the performance of Algorithm 1. The analysis will be conducted for the -median objective function, while the results also hold for -means; we will discuss this briefly at the end of this section.
Definition 2 (Information Loss)
For a summary constructed by Algorithm 1, we define the information loss of as
That is, the sum of distances of moving each point to the corresponding center (we can view each outlier as a center itself).
We will prove the following theorem, which says that the information loss of the summary constructed by Algorithm 1 is bounded by the optimal -median clustering cost on .
As a consequence of Theorem 1, we obtain by triangle inequality arguments the following corollary that directly characterizes the quality of the summary in the task of -median.
If we run a -approximation algorithm for -median on , we can obtain a set of centers and a set of outliers such that
with probability .
In the rest of this section we prove Theorem 1. We will start by bounding the information loss.
Definition 3 (, and )
Define to be the set of outliers chosen by running the optimal -median algorithm on ; we thus have . For , define and , where and are defined in Algorithm 1.
We need the following utility lemma. It says that at each iteration in the while loop in Algorithm 1, we always make sure that at least half of the remaining points are not in .
For any , where is the total number of rounds in Algorithm 1, we have .
Proof: According to the condition of the while loop in Algorithm 1 we have for any . Since , we have
The rest of the proof for Theorem 1 proceeds as follows. We first show in Lemma 4 that can be upper bounded by (Lemma 4). We then show in Lemma 5 that can be lower bounded by with high probability (Lemma 5). Theorem 1 then follows.
Lemma 4 (upper bound)
It holds that
Here is the radius we chosen in the -th round of Algorithm 1.
We now turn to the lower bound of .
Lemma 5 (lower bound)
It holds that
Before proving the lemma, we would like to introduce a few more notations.
Definition 4 ( and )
Let ; we thus have (recall in Algorithm 1 that is a fixed constant). For any , let be the minimum radius such that there exists a set of size with
The purpose of introducing is to use it as a bridge to connect and . We first have the following.
Fix an arbitrary set of size as centers. To prove Lemma 6 we will use a charging argument to connect and . To this end we introduce the following definitions and facts.
Definition 5 (, and )
For each , define . For any , define . Let .
Clearly, if and , then and are disjoint. This leads to the following fact.
For any , we have
By the definitions of and we directly have:
For any , .
Let (a constant), we have
For any , .
Proof: We first show that is a geometrically decreasing sequence.
As a result, we have that holds a least a constant fraction of points in .
For any , .
Proof:(of Lemma 6) Let
Then is at least
The lemma then follows from the fact that is chosen arbitrarily.
With probability , we have for all .
Proof: Fix an , and let be a set of size such that . Let . We assign each point in to its closest point in , breaking ties arbitrarily. Let be the set of all points in that are assigned to ; thus forms a partition of .
Recall that in Algorithm 1 is constructed by a random sampling. Define
We have the following claim.
Note that once we have , we have that for a sufficiently small constant ,
Since , we have . By the definition of , we have that . The success probability in Lemma 7 is obtained by applying a union bound over all iterations.
Denote . Since is a random sample of (of size ), by (5) we have that for each point , . For each , define a random variable such that if , and otherwise. Let ; we thus have . By applying Lemma 1 (Chernoff bound) on ’s, we have that for any positive constant and , there exists a sufficiently large constant (say, ) such that
In other words, with probability at least , . The claim follows by applying Lemma 2 on each point in as a ball, and each set as a bin with weight .
The running time. We now analyze the running time of Algorithm 1.
At the -th iteration, the sampling step at Line 1 can be done in time. The nearest-center assignments at Line 1 and 1 can be done in time. Line 1 can be done by first sorting the distances in the increasing order and then scanning the shorted list until we get enough points. In this way the running time is bounded by . Thus the total running time can be bounded by
where the first equation holds since the size of decreases geometrically, and the second equation is due to the definition of .
Finally, we comment that we can get a similar result for -means by appropriately adjusting various constant parameters in the proof.
Let and be computed by Algorithm 1. We have with probability that
We note that in the proof for the median objective function we make use of the triangle inequality in various places, while for the means objective function where the distances are squared, the triangle inequality does not hold. However we can instead use the inequality , which will only make the constant parameters in the proofs slightly worse.
3.3 An Augmentation
In the case when , which is typically the case in practice since the number of centers does not scale with the size of the dataset while the number of outliers does, we add an augmentation procedure to Algorithm 1 to achieve a better practical performance. The procedure is presented in Algorithm 2.
The augmentation is as follows, after computing the set of outliers and the set of centers in Algorithm 1, we sample randomly from an additional set of center points of size . That is, we try to make the number of centers and the number of outliers in the summary to be balanced. We then reassign each point in the set to its nearest center in . Denote the new mapping by . Finally, we include points in and , together with their weights, into the summary .
It is clear that the augmentation procedure preserves the size of the summary asymptotically. And by including more centers we have , where is the mapping returned by Algorithm 1. The running time will increase to due to the reassignment step, but our algorithm is still much faster than all the baseline algorithms, as we shall see in Section 5.
4 Distributed Clustering with Outliers
In this section we discuss distributed -median/means using the summary constructed in Algorithm 1. We will first discuss the case where the data is randomly partitioned among the sites, which is the case in all of our experiments. The algorithm is presented in Algorithm 3. We will discuss the adversarial partition case at the end. We again only show the results for -median since the same results will hold for -means by slightly adjusting the constant parameters.
We will make use of the following known results. The first lemma says that the sum of costs of local optimal solutions that use the same number of outliers as the global optimal solution does is upper bounded by the cost of the global optimal solution.
Lemma 8 (Guha et al. (2017))
For each , let where is the set of outliers produced by the optimal -median algorithm on . We have
The second lemma is a folklore for two-level clustering.
Let be the union of the summaries of the local datasets, and let be the cost function of a -approximation algorithm for -median. We have
Now by Lemma 8, Lemma 9 and Theorem 1, we have that with probability , . And by Chernoff bounds and a union bound we have for all with probability .444For the convenience of the analysis we have assumed , which is justifiable in practice since typically scales with the size of the dataset while is usually a fixed number.
it outputs a set of centers and a set of outliers such that ;
it uses one round of communication whose cost is bounded by ;
the running time at the -th site is bounded by , and the running time at the coordinator is that of the second level clustering.
In the case that the dataset is adversarially partitioned, the total communication increases to . This is because all of the outliers may go to the same site and thus in Line 3 needs to be replaced by .
5.1 Experimental Setup
5.1.1 Datasets and Algorithms
We make use of the following datasets.
gauss-. This is a synthetic dataset, generated as follows: we first sample centers from , i.e., each dimension is sampled uniformly at random from . For each center , we generate points by adding each dimension of
a random value sampled from the normal distribution. This way, we obtain M points in total. We next construct the outliers as follows: we sample points from the M points, and for each sampled point, we add a random shift sampled from .
kddFull. This dataset is from 1999 kddcup competition and contains instances describing connections of sequences of tcp packets. There are about M data points. 555More information can be found in http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html We only consider the
numerical features of this dataset. We also normalize each feature so that it has zero mean and unit standard deviation. There areclasses in this dataset, points of the dataset belong to classes (normal , neptune , and smurf ). We consider small clusters as outliers and there are outliers.
kddSp. This data set contains about points of kddFull (released by the original provider). This dataset is also normalized and there are 8752 outliers.
susy-. This data set has been produced using Monte Carlo simulations by Baldi et al. (2014). Each instance has numerical features and there are M instances in total.666More information about this dataset can be found in https://archive.ics.uci.edu/ml/datasets/SUSY. We normalize each feature as we did in kddFull. We manually add outliers as follows: first we randomly sample data points; for each data point, we shift each of its dimension by a random value chosen from .
Spatial-. This dataset is about 3D road network with elevation information from North Jutland, Denmark. It is designed for clustering and regression tasks. There are about 0.4M data points with 4 features. We normalize each feature so that it has zero mean and unit standard deviation. We add outliers as we did for susy-. 777More information can be found in https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark).
Finding appropriate and values for the task of clustering with outliers is a separate problem, and is not part of the topic of this paper. In all our experiments, and are naturally suggested by the datasets we use unless they are unknown.
We compare the performance of following algorithms, each of which is implemented using the MPI framework and run in the coordinator model. The data are randomly partitioned among the sites.
rand. Each site constructs a summary by randomly sampling points from its local dataset. Each sampled point is assigned a weight equal to the number of points in the local dataset that are closer to than other points in the summary. The coordinator then collects all weighted samples from all sites and feeds to -means-- for a second level clustering.
-means++. Each site constructs a summary of the local dataset using the -means++ algorithm Arthur and Vassilvitskii (2007), and sends it to the coordinator. The coordinator feeds the unions all summaries to -means-- for a second level clustering.
-means. An MPI implementation of the -means algorithm proposed by Bahmani et al. (2012) for distributed -means clustering. To adapt their algorithm to solve the outlier version, we increase the parameter in the algorithm to , and then feed the outputted centers to -means-- for a second level clustering.
Let and be the sets of centers and outliers respectively returned by a tested algorithm. To evaluate the quality of the clustering results we use two metrics: (a) -loss (for -median): ; (b) -loss (for -means): .
To measure the performance of outlier detection we use three metrics. Let be the set of points fed into the second level clustering -means-- in each algorithm, and let be the set of actual outliers (i.e., the ground truth), we use the following metrics: (a) preRec: the proportion of actual outliers that are included in the returned summary, defined as ; (b) recall: the proportion of actual outliers that are returned by -means--, defined as ; (c) prec: the proportion of points in that are actually outliers, defined as .
5.1.3 Computation Environments
All algorithms are implemented in C++ with Boost.MPI support. We use Armadillo Sanderson (2010) as the numerical linear library and -O3 flag is enabled when compile the code. All experiments are conducted in a PowerEdge R730 server equipped with 2 x Intel Xeon E5-2667 v3 3.2GHz. This server has 8-core/16-thread per CPU, 192GB Memeory and 1.6TB SSD.
5.2 Experimental Results
We now present our experimental results. All results take the average of runs.
We first compare the qualities of the summaries returned by ball-grow, rand and -means. Note that the size of the summary returned by ball-grow is determined by the parameters and , and we can not control the exact size. In -means, the summary size is determined by the sample ratio, and again we can not control the exact size. On the other hand, the summary sizes of rand and -means++ can be fully controlled. To be fair, we manually tune those parameters so that the sizes of summaries returned by different algorithms are roughly the same (the difference is less than ). In this set of experiments, each dataset is randomly partitioned into sites.
Table 2 presents the experimental results on gauss datasets with different . We observe that ball-grow consistently gives better -loss and -loss than -means and -means++, and rand performs the worst among all.
For outlier detection, rand fails completely. In both gauss- and gauss-, ball-grow outperforms -means++ and -means in almost all metrics. -means slightly outperforms -means++. We also observe that in all gauss datasets, ball-grow gives very high preRec, i.e., the outliers are very likely to be included in the summary constructed by ball-grow.
Table 3 presents the experimental results on kddSp and kddFull datasets. In this set of experiments, ball-grow again outperforms its competitors in all metrics. Note that -means does not scale to kddFull.
|-means||does not stop after hours|
Table 4 presents the experimental results for susy- dataset. We can observe that ball-grow produces slightly better results than -means, -means++ and rand in -loss and -loss. For outlier detection, ball-grow outperforms -means++ and -means significantly in terms of preRec and recall, while -means gives slightly better prec. Table 5 presents the results for Spatial- dataset, and ball-grow again outperforms all other baseline algorithms in all metrics.