Improved MapReduce and Streaming Algorithms for k-Center Clustering (with Outliers)

02/26/2018 ∙ by Matteo Ceccarello, et al. ∙ Università di Padova IT University of Copenhagen 0

We present efficient MapReduce and Streaming algorithms for the k-center problem with and without outliers. Our algorithms exhibit an approximation factor which is arbitrarily close to the best possible, given enough resources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Center-based clustering is a fundamental unsupervised learning primitive for data management, with applications in a variety of domains such as database search, bioinformatics, pattern recognition, networking, facility location, and many more

(HennigMMR15, ). Its general goal is to partition a set of data items into groups according to a notion of similarity, captured by closeness to suitably chosen group representatives, called centers. There is an ample and well-established literature on sequential strategies for different instantiations of center-based clustering (AwasthiB15, ). However, the explosive growth of data that needs to be processed often rules out the use of these strategies which are efficient on small-sized datasets, but impractical on large ones. Therefore, it is of paramount importance to devise efficient clustering strategies tailored to the typical computational frameworks for big data processing, such as MapReduce and Streaming (LeskovecRU14, ).

In this paper, we focus on the -center problem, formally defined as follows. Given a set of points in a metric space and a positive integer , find a subset of points, called centers, so that the maximum distance between any point of to its closest center in is minimized. (Note that the association of each point to the closest center naturally defines a clustering of .) Along with -median and -means, which require to minimize, respectively, the sum of all distances and all square distances to the closest centers, -center is a very popular instantiation of center-based clustering which has recently proved a pivotal primitive for data and graph analytics (IndykMMM14, ; AghamolaeiFZ15, ; CeccarelloPPU15, ; CeccarelloPPU16, ; CeccarelloPPU17, ; CeccarelloFPPV17, ), and whose efficient solution in the realm of big data has attracted a lot of attention in the literature (Charikar2001, ; McCutchen2008, ; EneIM11, ; MalkomesKCWM15, ).

The -center problem is NP-hard (Gonzalez85, ), therefore one has to settle for approximate solutions. Also, since its objective function involves a maximum, the solution is at risk of being severely influenced by a few “distant” points, called outliers. In fact, the presence of outliers is inherent in many datasets, since these points are often artifacts of data collection, or represent noisy measurements, or simply erroneous information. To cope with this problem, -center admits a formulation that takes into account outliers (Charikar2001, ): when computing the objective function, up to points are allowed to be discarded, where is a user-defined input parameter.

A natural approach to compute approximate solutions to large instances of combinatorial optimization problems entails efficiently extracting a much smaller subset of the input, dubbed

coreset, which contains a good approximation to the global optimum, and then applying a standard sequential approximation algorithm to such a coreset. The benefits of this approach are evident when the coreset construction is substantially more efficient than running the (possibly very expensive) sequential approximation algorithm directly on the whole input, so that significant performance improvements are attained by confining the execution of such algorithm on a small subset of the data. Using coresets much smaller than the input, the authors of (MalkomesKCWM15, ) present MapReduce algorithms for the -center problem with and without outliers, whose (constant) approximation factors are, however, substantially larger than their best sequential counterparts. In this work, we further leverage the coreset approach and unveil interesting tradeoffs between the coreset size and the approximation quality, showing that better approximation is achievable through larger coresets. The obtainable tradeoffs are regulated by the doubling dimension of the underlying metric space and allow us to obtain improved MapReduce and Streaming algorithms for the two formulations of the -center problem, whose approximation ratios can be made arbitrarily close to the one featured by the best sequential algorithms. Also, as a by-product, we obtain a sequential algorithm for the case with outliers which is considerably faster than existing ones.

1.1. Related work

Back in the 80’s, Gonzalez (Gonzalez85, ) developed a very popular 2-approximation sequential algorithm for the -center problem running in time, which is referred to as gmm in the recent literature. In the same paper, the author showed that it is impossible to achieve an approximation factor , for fixed , in general metric spaces, unless . To deal with noise in the dataset, Charikar et al. (Charikar2001, ) introduced the -center problem with outliers, where the clustering is allowed to ignore points of the input. For this problem, they gave a 3-approximation algorithm which runs in time. Furthermore, they proved that, for this problem, it is impossible to achieve an approximation factor , for fixed , in general metric spaces, unless .

With the advent of big data, a lot of attention has been devoted to the MapReduce model of computation, where a set of processors with limited-size local memories process data in a sequence of parallel rounds (DeanG08, ; PietracaprinaPRSU12, ; LeskovecRU14, ). The -center problem under this model was first studied by Ene et al. (EneIM11, ), who provided a 10-approximation randomized algorithm. This result was subsequently improved in (MalkomesKCWM15, ) with a deterministic 4-approximation algorithm requiring an -size local memory. As for the -center problem with outliers, a deterministic -approximation MapReduce algorithm was presented in (MalkomesKCWM15, ), requiring an -size local memory. We remark that randomized multi-round MapReduce algorithms for the two formulations of the -center problem, with approximation ratios and 4 respectively, have been claimed but not described in the short communication (ImM15, ). While, theoretically, the MapReduce algorithms proposed in our work seem competitive with respect to both round complexity and space requirements with the algorithms announced in (ImM15, ), any comparison is clearly subject to the availability of more details.

As mentioned before, the algorithms in (MalkomesKCWM15, ) are based on the use of (composable) coresets, a very useful tool in the MapReduce setting (Agarwal2004, ; IndykMMM14, ). For a given objective function, a coreset is a small subset extracted from the input which embodies a solution whose cost is close to the cost of the optimal solution on the whole set. The additional property of composability requires that, if coresets are extracted from distinct subsets of a given partition of the input, their union embodies a close-to-optimal solution of the whole input. Composable coresets enable the development of parallel algorithms, where each processor computes the coreset relative to one subset of the partition, and the computation of the final solution is then performed by one processor that receives the union of the coresets. Composable coresets have been used for a number of problems, including diversity maximization (IndykMMM14, ; AghamolaeiFZ15, ; CeccarelloPPU17, ; CeccarelloPP18, ), submodular maximization (Mirrokni2015, ), graph matching and vertex cover (Assadi2017, ). In (BadoiuHI02, ) the authors provide a coreset-based -approximation sequential algorithm to the -center problem for -dimensional Euclidean spaces, whose time is exponential in and and linear in and . However, the coreset construction is rather involved, not easily parallelizable and the resulting algorithm seems to be mainly of theoretical interest.

Another option when dealing with large amounts of data is to process the data in a streaming fashion. In the Streaming model, algorithms use a single processor with limited working memory and are allowed only a few sequential passes over the input (ideally just one) (HenzingerRR98, ; LeskovecRU14, ). Originally developed for the external memory setting, this model also captures the scenario in which data is generated on the fly and must be analyzed in real-time, for instance in a streamed DMBS or in a social media platform (e.g., Twitter trends detection). Under this model, Charikar et al. (CharikarCFM04, ) developed a 1-pass algorithm for the -center problem which requires working memory and computes an 8-approximation, deterministically, or a 5.43-approximation, probabilistically. Later, the result was improved in (McCutchen2008, ) attaining a approximation, deterministically, needing a working memory of size . In the same paper, the authors give a deterministic -approximation Streaming algorithm for the formulation with outliers, which requires working memory.

1.2. Our contribution

The coreset-based MapReduce algorithms of (MalkomesKCWM15, ) for -center, with and without outliers, use the gmm sequential approximation algorithm for -center in a “bootstrapping” fashion: namely, in a first phase, a set of centers ( centers in the case with outliers) is determined in each subset of an arbitrary partition of the input dataset, and then the final solution is computed on the coreset provided by the union of these centers, using a sequential approximation algorithm for the specific problem formulation. Our work is motivated by the following natural question: what if we select more centers from each subset of the partition in the first phase? Intuitively, we should get a better solution than if we just selected (resp., ) centers. In fact, selecting more and more centers from each subset should yield a solution progressively closer to the one returned by the best sequential algorithm on the whole input, at the expense of larger space requirements.

This paper provides a thorough characterization of the space-accuracy tradeoffs achievable by exploiting the aforementioned idea for both formulations of the -center problem (with and without outliers). We present improved MapReduce and Streaming algorithms which leverage a judicious selection of larger (composable) coresets to boost the quality of the solution embodied in the (union of the) coresets. We analyze the memory requirements of our algorithms in terms of the desired approximation quality, captured by a precision parameter , and of the doubling dimension

of the underlying metric space, a parameter that generalizes the dimensionality of Euclidean spaces to arbitrary metric spaces and is thus related to the difficulty of spotting good clusterings. We remark that this kind of parametrized analysis is particularly relevant in the realm of big data, where distortions introduced to account for worst-case scenarios may be too extreme to provide meaningful insights on actual algorithm’s performance, and it has been employed in a variety of contexts including diversity maximization, clustering, nearest neighbour search, routing, machine learning, and graph analytics (see

(CeccarelloPPU17, ) and references therein).

Our specific results are the following:

  • A deterministic 2-round, -approximation MapReduce algorithm for the -center problem, which requires local memory.

  • A deterministic 2-round, -approximation MapReduce algorithm for the -center problem with outliers, which requires local memory.

  • A randomized 2-round, -approximation MapReduce algorithm for the -center problem with outliers, which reduces the local memory requirements to .

  • A deterministic 1-pass, -approximation Streaming algorithm for the -center problem with outliers, which requires working memory.

Using our coreset constructions we can also attain a -approximation Streaming algorithm for -center without outliers, which however would not improve on the state-of-the-art algorithm (McCutchen2008, ). Nonetheless, for the sake of completeness, we will compare these two algorithms experimentally in Section 5.

Observe that for both formulations of the problem, our algorithms feature approximation guarantees which are a mere additive term larger than the best achievable sequential guarantee, and yield substantial quality improvements over the state-of-the-art (MalkomesKCWM15, ; McCutchen2008, ). Moreover, the randomized MapReduce algorithm for the formulation with outliers features smaller coresets, thus attaining a reduction in the local memory requirements which becomes substantial in plausible scenarios where the number of outliers (e.g., due to noise) is considerably larger than the target number of clusters, although much smaller than the input size.

While our algorithms are applicable to general metric spaces, on spaces of constant doubling dimension and for constant , their local space/working memory requirements are polynomially sublinear in the dataset size, in the MapReduce setting, and independent of the dataset size, in the Streaming setting. Moreover, a very desirable feature of our MapReduce algorithms is that they are oblivious to , in the sense that the value (which may be not known in advance and hard to evaluate) is not used explicitly in the algorithms but only in their analysis. In contrast, the 1-pass Streaming algorithm makes explicit use of , although we will show that it can be made oblivious to at the expense of one extra pass on the input stream.

As a further important result, the MapReduce algorithm for the case with outliers admits a direct sequential implementation which substantially improves the time performance of the state-of-the-art algorithm by (Charikar2001, ) while essentially preserving the approximation quality.

We also provide experimental evidence of the competitiveness of our algorithms on real-world and synthetic datasets of up to over a billion points, comparing with baselines set by the algorithms in (MalkomesKCWM15, ) for MapReduce, and (McCutchen2008, ) for Streaming. In the MapReduce setting, the experiments show that tighter approximations over the algorithms in (MalkomesKCWM15, ) are indeed achievable with larger coresets. In fact, while our theoretical bounds on the space requirements embody large constant factors, the improvements in the approximation quality are already noticeable with a modest increase of the coreset size. In the Streaming setting, for -center without outliers we show that the -approximation algorithm based on our techniques is comparable to (McCutchen2008, ), whereas for -center with outliers we obtain solutions of better quality using significantly less memory and time. The experiments also show that the Streaming algorithms feature high-throughput, and that the MapReduce algorithms exhibit high scalability. Finally, we show that, indeed, implementing our coreset strategy sequentially yields a substantial running time improvement with respect to the state-of-the art algorithm (Charikar2001, ), while preserving the approximation quality.
Organization of the paper The rest of the paper is organized as follows. Section 2 contains a number of preliminary concepts. Section 3 and Section 4 present, respectively, our MapReduce and Streaming algorithms. The experimental results are reported in Section 5. Finally, Section 6 offers some concluding remarks.

2. Preliminaries

Consider a metric space with distance function . For a point , the ball of radius centered at is the set of points at distance at most from . The doubling dimension of is the smallest such that for any radius and point , all points in the ball of radius centered at are included in the union of at most balls of radius centered at suitable points. It immediately follows that, for any , a ball of radius can be covered by at most balls of radius . Notable examples of metric spaces with bounded doubling dimension are Euclidean spaces and spaces induced by shortest-path distances in mildly-expanding topologies. Also, the notion of doubling dimension can be defined for an individual dataset and it may turn out much lower than the one of the underlying metric space (e.g., a set of collinear points in ). In fact, the space-accuracy tradeoffs of our algorithms only depend on the doubling dimension of the input dataset.

Define the distance between a point and a set as . Consider now a dataset and a subset . We define the radius of with respect to as

The -center problem requires to find a subset of size such that is minimized. We define as the radius achieved by the optimal solution to the problem. Note that induces immediately a partition of into clusters by assigning each point to its closest center, and we say that is the radius of such a clustering.

In Section 1.1 we mentioned the gmm algorithm (Gonzalez85, ), which provides a sequential 2-approximation to the -center problem. Here we briefly review how gmm works. Given a set , gmm builds a set of centers incrementally in iterations. An arbitrary point of is selected as the first center and is added to . Then, the algorithm iteratively selects the next center as the point with maximum distance from , and adds it to , until contains centers. Note that, rather than setting a priori, gmm can be used to grow the set until a target radius is achieved. In fact, the radius of with respect to the set of centers incrementally built by gmm is a non-increasing function of the iteration number. In this paper, we will make use of the following property of gmm which bounds its accuracy when run on a subset of the data.

Lemma 0 ().

Let . For a given , let be the output of gmm when run on . We have .

Proof.

We prove this lemma by rephrasing the proof by Gonzalez (Gonzalez85, ) in terms of subsets. We need to prove that, , . Assume by contradiction that this is not the case. Then, for some it holds that . By the greedy choice of gmm, we have that for any pair , , otherwise would have been included in . So we have that . Therefore, the set consists of points at distance from each other. Consider now the optimal solution to -center on the set . Since , two of the points of , say and , must be closest to the same optimal center . By the triangle inequality we have , a contradiction. ∎

For a given set , the -center problem with outliers requires to identify a set of centers which minimizes

where is the set of points in with largest distance from (ties broken arbitrarily). In other words, the problem allows to discard up the farthest points when computing the radius of the set of centers, hence of its associated clustering. For given , , and , we denote the radius of the optimal solution of this problem by . It is straightforward to argue that the optimal solution of the problem without outliers with centers has a smaller radius than the optimal solution of the problem with centers and outliers, that is

(1)

2.1. Computational frameworks

A MapReduce algorithm (DeanG08, ; PietracaprinaPRSU12, ; LeskovecRU14, ) executes in a sequence of parallel rounds. In a round, a multiset of key-value pairs is first transformed into a new multiset of key-value pairs by applying a given map function (simply called mapper) to each individual pair, and then into a final multiset of pairs by applying a given reduce function (simply called reducer) independently to each subset of pairs of having the same key. The model features two parameters, , the local memory available to each mapper/reducer, and , the aggregate memory across all mappers/reducers. In our algorithms, mappers are straightforward constant-space transformations, thus the memory requirements will be related to the reducers. We remark that the MapReduce algorithms presented in this paper also afford an immediate implementation and similar analysis in the Massively Parallel Computation (MPC) model (BeameKS13, ), which is popular in the database community.

In the Streaming framework (HenzingerRR98, ; LeskovecRU14, ) the computation is performed by a single processor with a small working memory, and the input is provided as a continuous stream of items which is usually too large to fit in the working memory. Multiple passes on the input stream may be allowed. Key performance indicators are the size of the working memory and the number of passes.

The holy grail of big data algorithmics is the development of MapReduce (resp., Streaming) algorithms which work in as few rounds (resp., passes) as possible and require substantially sublinear local memory (resp., working memory) and linear aggregate memory.

3. MapReduce algorithms

The following subsections present our MapReduce algorithms for the -center problem (Subsection 3.1) and the -center problem with outliers (Subsection 3.2). The algorithms are based on the use of composable coresets, which were reviewed in the introduction, and can be viewed as improved variants of those by (MalkomesKCWM15, ). The main novelty of our algorithms is their leveraging a judiciously increased coreset size to attain approximation qualities that are arbitrarily close to the ones featured by the best known sequential algorithms. Also, in the analysis, we relate the required coreset size to the doubling dimension of the underlying metric space (whose explicit knowledge, however, is not required by the algorithms) showing that coreset sizes stay small for spaces of bounded doubling dimension.

3.1. MapReduce algorithm for -center

Consider an instance of the -center problem and fix a precision parameter , which will be used to regulate the approximation ratio. The MapReduce algorithm works in two rounds. In the first round, is partitioned into subsets of equal size, for . In parallel, on each we run gmm incrementally and call the set of centers selected in the first iterations of the algorithm. Let denote the radius of the set with respect to the first centers. We continue to run gmm until the first iteration such that , and define the coreset . In the second round, the union of the coresets is gathered into a single reducer and gmm is run on to compute the final set of centers. In what follows, we show that these centers are a good solution to the -center problem on .

The analysis relies on the following two lemmas which state that each input point has a close-by representative in and that has small size. We define a proxy function that maps each into the closest point in , for every . The following lemma is an easy consequence of Lemma 1.

Lemma 0 ().

For each , .

Proof.

Fix , and consider , and the set computed by the first iterations of gmm. Since is a subset of , by Lemma 1 we have that . By construction, we have that , hence . Consider now the proxy function . For every and , it holds that . ∎

We can conveniently bound the size of , the union of the coresets, as a function of the doubling dimension of the underlying metric space.

Lemma 0 ().

If belongs to a metric space of doubling dimension , then

Proof.

Fix an . We prove an upper bound on the number of iterations of gmm needed to obtain , which in turn bounds the size of . Consider the -center clustering of induced by the centers in , with radius . By the doubling dimension property, we have that each of the clusters can be covered using at most balls of radius , for a total of at most such balls. Consider now the execution of iterations of the gmm algorithm on . Let be the set of returned centers and let be the farthest point of from . The center selection process of the gmm algorithm ensures that any two points in are at distance at least from one another. Thus, since two of these points must fall into one of the aforementioned balls of radius , this implies immediately (by the triangle inequality) that

Hence, after iterations we are guaranteed that gmm finds a set which meets the stopping condition. Therefore, , for every , and the lemma follows. ∎

We now state the main result of this subsection.

Theorem 3 ().

Let . If the points of belong to a metric space of doubling dimension , then the above 2-round MapReduce algorithm computes a -approximation for the -center problem with local memory and linear aggregate memory.

Proof.

Let be the solution found by gmm on . Since , from Lemma 1 it follows that . Consider an arbitrary point , along with its proxy , as defined before. By Lemma 1 we know that . Let be the center closest to . It holds that . By applying the triangle inequality, we have that . The bound on follows since in the first round each processor needs to store points of the input and computes a coreset of size , as per Lemma 2, while in the second round, one processor needs enough memory to store such coresets. Finally, it is immediate to see that aggregate memory proportional to the input size suffices. ∎

By setting in the above theorem we obtain:

Corollary 0 ().

Our 2-round MapReduce algorithm computes a -approximation for the -center problem with local memory and linear aggregate memory. For constant and , the local memory bound becomes .

3.2. MapReduce algorithm for -center with outliers

Consider an instance of the -center problem with outliers and fix a precision parameter intended, as before, to regulate the approximation ratio. We propose the following 2-round MapReduce algorithm for the problem. In the first round, is partitioned into equally-sized subsets , with , and for each , in parallel, gmm is run incrementally. Let be the set of the first selected centers. We continue to run gmm until the first iteration such that . Define the coreset . As before, for each point we define its proxy to be the point of closest to , but, furthermore, we attach to each a weight , which is the number of points of with proxy .

In the second round, the union of the weighted coresets is gathered into a single reducer. Before describing the details of this second round, we need to introduce a sequential algorithm, dubbed OutliersCluster (see pseudocode below), for solving a weighted variant of the -center problem with outliers which is a modification of the one presented in (MalkomesKCWM15, ) (in turn, based on the unweighted algorithm of (Charikar2001, )).

while  do
       for () do 
return
Algorithm 1 OutliersCluster

OutliersCluster returns two subsets such that is a set of (at most) centers, and is a set of points referred to as uncovered points. The algorithm starts with and builds incrementally in iterations as follows. In each iteration, the next center is chosen as the point maximizing the aggregate weight of uncovered points in its ball of radius (note that needs not be an uncovered point). Then, all uncovered points at distance at most from are removed from . The algorithm terminates when either or . By construction, the final consists of all points at distance greater than from .

Let us return to the second round of our MapReduce algorithm. The reducer that gathered runs OutliersCluster

multiple times to estimate the minimum value

such that the aggregate weight of the points in the set returned by OutliersCluster is at most . More specifically, the computed estimate, say , is within a multiplicative tolerance from the true , with , and it is obtained through a binary search over all possible distances between points of combined with a geometric search with step . To avoid storing all distances, the value of at each iteration of the binary search can be determined in space linear in by the median-finding Streaming algorithm in (MunroP80, ). The output of the MapReduce algorithm is the set of centers computed by OutliersCluster.

We now analyze our 2-round MapReduce algorithm. The following lemma bounds the distance between a point and its proxy.

Lemma 0 ().

For each , .

Proof.

Consider any subset of the partition of . By construction, we have that for each , . Since is a subset of , Lemma 1 ensures that . Hence, . Since , as observed before in Eq. 1, we have . ∎

Next, we characterize the quality of the solution returned by OutliersCluster when run on , the union of the weighted coresets, and with a radius .

Lemma 0 ().

For , let be the sets returned by OutliersCluster , and define . Then,

and .

Proof.

The proof uses an argument akin to the one used for the analysis of the sequential algorithm by (Charikar2001, ) and later adapted by (MalkomesKCWM15, ) to the weighted coreset setting. The first claim follows immediately from the workings of the algorithm, since each point in belongs to some , with . We are left to show that . Suppose first that . In this case, it must be , hence , and the proof follows. We now concentrate on the case . Consider the -th iteration of the while loop of OutliersCluster and define as the center of selected in the iteration, and as the set of uncovered points at the beginning of the iteration. Recall that is the point of which maximizes the cumulative weight of the set of uncovered points in at distance at most from , and that the set of all uncovered points at distance at most from is removed from at the end of the iteration. We now show that

(2)

which will immediately imply that . For this purpose, let be an optimal set of centers for the problem instance under consideration, and let be the set of at most outliers at distance greater than from . For each , define as the set of nonoutlier points which are closer to than to any other center of , with ties broken arbitrarily. To prove (2), it is sufficient to exhibit an ordering of the centers in so that, for every , it holds

The proof uses an inductive charging argument to assign each point in to a point in , where each in the latter set will be in charge of at most points. We define two charging rules. A point can be either charged to its own proxy (Rule 1) or to another point of (Rule 2).

Fix some arbitrary , with , and assume, inductively, that the points in have been charged to points in for some choice of distinct optimal centers . We have two cases.
Case 1. There exists an optimal center still unchosen such that there is a point with , for some . We choose as one such center. Hence . By repeatedly applying the triangle inequality we have that for each

hence, . Therefore we can charge each point to its proxy, by Rule 1.
Case 2. For each unchosen optimal center and each , . We choose to be the unchosen optimal center which maximizes the cardinality of . We distinguish between points with , hence , and those with . We charge each with to its own proxy by Rule 1. As for the other points, we now show that we can charge them to the points of . To this purpose, we first observe that contains , since for each

Therefore the aggregate weight of is at least . Since Iteration selects as the center such that has maximum aggregate weight, we have that

hence, the points in have enough weight to be charged with each point with . Figure 1 illustrates the charging under Case 2.

points with their proxy not covered are charged to by Rule 2

points with their proxy covered by , for some , are charged to their proxy by Rule 1

Figure 1. Application of charging rules in case 2 of the proof. Round points are points of , whereas star-shaped points are proxy points in . Arrows represent charging.

Note that the points of did not receive any charging by Rule 1 in previous iterations, since they are uncovered at the beginning of Iteration , and will not receive chargings by Rule 1 in subsequent iterations, since does not intersect the set of any optimal center yet to be chosen. Also, no further charging to points of by Rule 2 will happen in subsequent iterations, since Rule 2 will only target sets with . These observations ensure that any point of receives charges through either Rule 1 or Rule 2, but not both, and never in excess of its weight, and the proof follows. ∎

The following lemma bounds the size of , the union of the weighted coresets.

Lemma 0 ().

If belongs to a metric space of doubling dimension , then

Proof.

The proof proceeds similarly to the one of Lemma 2, with the understanding that the definition of doubling dimension is applied to each of the clusters induced by the points of on . ∎

Finally, we state the main result of this subsection.

Theorem 8 ().

Let . If the points of belong to a metric space of doubling dimension , then, when run with , the above 2-round MapReduce algorithm computes a -approximation for the -center problem with outliers with local memory and linear aggregate memory.

Proof.

The result of Lemma 6 combined with the stipulated tolerance of the search performed in the second round of the algorithm implies that the radius discovered by the search is with . Also, by the triangle inequality, the distance between each non-outlier point in and its closest center will be at most , which proves the approximation bound. The bound on follows since in the first round each reducer needs enough memory to store points of the input, while in the second round the reducer computing the final solution requires enough memory to store the union of the coresets, which, by Lemma 7, has size each. Also, globally, the reducers need only sufficient memory to store the input, hence . ∎

By setting in the above theorem we obtain:

Corollary 0 ().

Our 2-round MapReduce algorithm computes a -approximation for the -center problem with outliers, with local memory and linear aggregate memory. For constant and , the local memory bound becomes .

Improved sequential algorithm. A simple analysis implies that, by setting , our MapReduce strategy for the -center problem with outliers yields an efficient sequential -approximation algorithm whose running time is , where , is the coreset size. For a wide range of values of and this yields a substantially improved performance over the -time state-of-the-art algorithm of (Charikar2001, ), at the expense of a negligibly worse approximation.

3.2.1. Higher space efficiency through randomization

The analysis of very noisy datasets might require setting the number of outliers much larger than , while still . In this circumstance, the size of the union of the coresets is proportional to , and may turn out too large for practical purposes, due to the large local memory requirements and to the running time of the cubic sequential approximation algorithm run on in the second round, which may become the real performance bottleneck of the entire algorithm. In this subsection, we show that this drawback can be significantly ameliorated by simply partitioning the pointset at random in the first round, at the only expense of probabilistic rather than deterministic guarantees on the resulting space and approximation guarantees. We say that an event related to a dataset occurs

with high probability

if , for some constant .

The randomized variant of the algorithm works as follows. In the first round, the input set is partitioned into subsets , with , by assigning each point to a random subset chosen uniformly and independently of the other points. Let and observe that, for large and , we have that . Then, in parallel on each partition , gmm is run to yield a set of centers, where is the smallest value such that . Define the coreset and, again, for each point define its proxy to be the point of closest to . The rest of the algorithm is exactly as before using these new ’s.

The analysis proceeds as follows. Consider an optimal solution of the -center problem with outliers for , and let be the set of centers and the set of outliers, that is the points of most distant from . Recall that any point of is at distance at most from . The following lemma states that the outliers (set ) are well distributed among the ’s.

Lemma 0 ().

With high probability, each contains no more than points of .

Proof.

The result follows by applying Chernoff bound (4.3) of (MitzemacherU17, ) and the union bound, which yield that the stated event occurs with probability at least . ∎

The rest of the analysis mimics the one of the deterministic version.

Lemma 0 ().

The statements of both Lemmas 5 and 6 hold with high probability.

Proof.

We first prove that, with high probability, for each for each , (same as Lemma 5). Consider and . We condition on the event that each contains at most points of , which, by Lemma 10, occurs with high probability. Focus on an arbitrary subset . For , let be the set of points of whose closest optimal center is , and let . Consider the set of centers determined by the first iterations of the gmm algorithm and let be the farthest point of from . By arguing as in the proof of Lemma 2, it can be shown that any two points in are at distance at least from one another and since two of these points must belong to the same for some , by the triangle inequality we have that

Recall that the gmm algorithm on is stopped at the first iteration such that , hence

The desired bound on immediately follows. Conditioning on this bound, the proof of Lemma 6 can be repeated identically, hence the stated property holds. ∎

By repeating the same argument used in Lemma 7, one can easily argue that, if belongs to a metric space of doubling dimension , then the size of the weighted coreset is

This bound, together with the results of the preceding lemma, immediately implies the analogous of Theorem 8 stating that, with high probability, the randomized algorithm computes a -approximation for the -center problem with outliers with local memory and linear aggregate memory. Observe that is now replaced by (the much smaller) in the local memory bound.

By choosing we obtain:

Corollary 0 ().

With high probability, our 2-round MapReduce algorithm computes a -approximation for the -center problem with outliers, with local memory and linear aggregate memory. For constant and , the local memory bound becomes

With respect to the deterministic version, for large values of a substantial improvement in the local memory requirements is achieved.
Remark. Thanks to the incremental nature of gmm, our coreset-based MapReduce algorithms for the -center problem, both without and with outliers, need not know the doubling dimension of the underlying metric space in order to attain the claimed performance bounds. This is a very desirable property, since, in general, may not be known in advance. Moreover, if were known, a factor in local memory (where for -center, and for -center with outliers) could be saved by setting to be a factor smaller.

4. Streaming algorithm for -center with outliers

As mentioned in the introduction, in the Streaming setting we will only consider the -center problem with outliers. Consider an instance of the problem and fix a precision parameter . Suppose that the points of belong to a metric space of known doubling dimension . Our Streaming algorithm also adopts a coreset-based approach. Specifically, in a pass over the stream of points of a suitable weighted coreset is selected and stored in the working memory. Then, at the end of the pass, the final set of centers is determined through multiple runs of OutliersCluster on as was done in the second round of the MapReduce algorithm described in Subsection 3.2. Below, we will focus on the coreset construction.

The algorithm computes a coreset of centers which represent a good approximate solution to the -center problem on (without outliers). The value of , which will be fixed later, depends on and . The main difference with the MapReduce algorithm is the fact that we cannot exploit the incremental approach provided by gmm, since no efficient implementation of gmm in the Streaming setting is known. Hence, for the computation of we resort to a novel weighted variant of the doubling algorithm by Charikar et al. (CharikarCFM04, ) which is described below.

For a given stream of points and a target number of centers , the algorithm maintains a weighted set of centers selected among the points of processed so far, and a lower bound on . is initialized with the first points of , with each assigned weight , while is initialized to half the minimum distance between the points of . For the sake of the analysis, we will define a proxy function which, however, will not be explicitly stored by the algorithm. Initially, each point of is proxy for itself. The remaining points of are processed one at a time maintaining the following invariants:

  1. [label=()]

  2. contains at most centers.

  3. we have

  4. processed so far, .

  5. , .

  6. .

The following two rules are applied to process each new point . The update rule checks if . If this is the case, the center closest to is identified and is incremented by one, defining . If instead , then is added as a new center to , setting to 1 and defining . Note that in this latter case, the size of may exceed , thus violating invariant (a). When this happens, the following merge rule is invoked repeatedly until invariant (a) is re-established. Each invocation of this rule first sets to , which, in turn, may lead to a violation of invariant (b). If this is the case, for each pair of points violating invariant (b), we discard and set . Conceptually, this corresponds to the update of the proxy function which redefines , for each point for which was equal to .

Observe that, at the end of the initialization, invariants (a) and (b) do not hold, while invariants (c)(e) do hold. Thus, we prescribe that the merge rule and the reinforcement of invariant (b) are applied at the end of the initialization before any new point is processed. This will ensure that all invariants hold before the nd point of is processed. The following lemma shows the above rules maintain all invariants.

Lemma 0 ().

After the initialization, at the end of the processing of each point , all invariants hold.

Proof.

As explained above, all invariants are enforced at the end of the initialization. Consider the processing of a new point