Center-based clustering is a crucial primitive for data management, with application domains as diverse as recommendation systems, facility location, database search, bioinformatics, content distribution systems, and many more . In general terms, given a dataset , a distance function between pairs of points in , and a value , a solution for center-based clustering is a set of representative points, called centers, which induce a partition of into subsets (clusters), each containing all points in closest to the same center. One important formulation of center-based clustering is the -center problem, where the set of centers must be chosen as a subset of which minimizes the maximum distance of any point of to its closest center. It is well known that -center is -hard, that it admits a -approximation algorithm, and that for any it is not -approximable unless .
A number of natural variants of -center have been studied in the literature. The constrained variants introduced in  restrict the set of returned centers to obey an additional constraint, which can be expressed either as a matroid constraint, that is, the set of centers must be an independent set of a given matroid system defined on the input dataset , or a knapsack constraint, where each point in carries a weight, and the aggregate weight of the returned centers cannot exceed a certain budget. Matroid and knapsack constraints arise naturally in the context of recommendation systems or facility location. In the former context, consider for instance the case of points in the dataset belonging to different categories, where all categories should have a given quota of representatives (centers) in the returned solution, a constraint naturally expressible as a partition matroid. In the latter, "opening" a center at a given location might carry different costs, and the final solution cannot exceed a total budget.
Another variant of the original problem is motivated by the observation that the -center objective function involves a maximum, thus the optimal solution is at risk of being severely influenced by a few “distant” points in , called outliers. In fact, the presence of outliers is inherent in many datasets, since these points are often due to artifacts or errors in data collection. To cope with this issue, -center admits the following robust formulation that takes into account outliers : given an additional input parameter , when computing the -center objective function, the points of with the largest distances from their respective centers are disregarded in the computation of the maximum. Robust formulations of the constrained variants have been also studied, referred to as Robust Matroid Center (RMC) and Robust Knapsack Center (RKC) problems, respectively .
The explosive growth of data that needs to be processed in modern computing scenarios often rules out the use of traditional sequential strategies which, while efficient on small-sized datasets, often prove to be prohibitive on massive ones. It is thus of paramount importance to devise clustering strategies amenable to the typical computational frameworks employed for big data processing, such as MapReduce and Streaming . Coreset-based strategies have recently emerged as ideal approaches for big data processing. Informally, these strategies entail the (efficient) extraction of a very succinct summary (dubbed coreset) of the dataset , so that a solution for can be obtained by running (suitable modifications of) the best sequential algorithm on . Coreset constructions that can be either parallelized or streamlined efficiently yield scalable and space-efficient algorithms in the big data realm. To objective of this paper is to devise novel coreset-based strategies for the RMC and RKC problems, featuring efficient sequential, MapReduce and Streaming implementations.
1.1 Previous work
Due to space constraints we only report on the works most closely related to the specific topic of this paper, and refer the interested reader to  and references therein for a more comprehensive overview on center-based clustering. Sequential approximation algorithms for the RMC and RKC problem are given in [9, 12, 7]. The best algorithms to date are sequential 3-approximations for both RMC [12, 7] and RKC . All of these algorithms, however, do not seem immediately amenable to MapReduce or Streaming implementations. Coreset-based Streaming algorithms for RMC and RKC have been recently devised by Kale in . For , Kale’s streaming algorithms compute a coreset of size containing a -approximate solution, where is the number of outliers and is the rank of the matroid, for RMC, or the maximum cardinality of a feasible solution, for the RKC problem. The solution embedded in the coresets of  can be extracted using a brute-force approach. Alternatively, one of the 3-approximate sequential algorithms in [12, 7] can be run on the coreset to yield a -approximate solution. To the best of our knowledge no MapReduce algorithms for RKC and RKC have been presented in the open literature.
Coreset-based algorithms for the MapReduce and Streaming setting for the unconstrained (robust) -center problem and related problems can be found in [19, 6, 4]. Useful techniques to deal with matroid constraints in big data scenarios have been introduced in [3, 5] in the realm of diversity maximization.
1.2 Our contribution
By leveraging ideas introduced in [3, 17], we present novel algorithms for the RMC and RKC problems which attain approximation ratios close to the best attainable ones, and feature efficient sequential implementations as well as efficient implementations in the MapReduce and Streaming settings, thus proving suitable for dealing with massive inputs. Our strategies exploit the basic -center primitive to extract a small coreset from the input set , with the property that the distance between each point of and the closest point of is a small fraction of cost of the optimal solution. Also, contains a good solution for the original problem on which can be computed by assigning a suitable multiplicity to each point of and by running the best-known sequential algorithms for RMC and RKC on , adapted to take multiplicities into account.
More specifically, for any fixed , our RMC and RKC algorithms feature a approximation ratio (see Corollaries 3 and 4 for a formal statement of the results). Let be the number of outliers and let denote the matroid rank, in the RMC problem, and the minimum cardinality of an optimal solution, in the RKC problem. The time and space requirements of the algorithms are analyzed in terms of , , the approximation quality, captured by , and the doubling dimension of the input set
, a parameter that generalizes the notion of Euclidean dimension to arbitrary metric spaces. We remark that this kind of dimensionality-aware analysis is particularly relevant in the realm of big data, and it has been employed in a variety of contexts including diversity maximization, clustering, nearest neighbour search, routing, and machine learning (see and references therein).
For both problems, the sequential complexity of our algorithms is , for a certain function , and it is thus linear for fixed values of and . The RMC strategy admits a 2-round MapReduce implementation requiring local memory sublinear in (Theorem 5.1), and a 1-pass Streaming implementation with working memory size dependent only on and (Theorem 5.2). The RKC strategy admits an -round MapReduce implementation requiring local memory sublinear in (Theorem 5.1), and an -pass Streaming implementation with working memory size dependent only on and (Theorem 5.2), where . For constant , the number of rounds (resp., passes) can be reduced to , at the expense of a (resp., ) increase in the local memory (resp., working memory) size. Remarkably, while the analysis of our algorithms is performed in terms of the doubling dimension of , the algorithms are oblivious to the value
which, in fact, would be difficult to estimate.
Our MapReduce algorithms provide the first efficient solutions to RMC and RKC in a distributed setting and attain an approximation quality that can be made arbitrarily close to that of the best sequential algorithms. Our Streaming algorithms share the same approximation quality as the MapReduce algorithms and substantially improve upon the approximations attained in . Furthermore, all of our algorithms are very space efficient for a wide range of the parameter space. In particular, the working space of our RKC Streaming algorithm depends on the size of the smallest optimal solution rather than on the largest feasible solution as in , which might result in a considerable space-saving. Finally, it is important to observe that in the sequential and Streaming settings, for fixed values of , and , exhaustive search on the coresets yields -approximate solutions to RMC and RKC with work merely linear in .
The rest of the paper is organized as follows. Section 2 introduces some key technical notions and formally defines the RMC and RKC problems. The coreset-based strategies for RMC and RKC are described and analyzed in Sections 3 and 4, respectively, while their MapReduce and Streaming implementations are discussed in Section 5. Section 6 offers some concluding remarks.
This section introduces some key notions and basic properties that will be used throughout the paper, and defines the computational problems studied in this work.
Let be a ground set of elements from a metric space with distance function satisfying the triangle inequality. A matroid  on is a pair , where is a family of subsets of , called independent sets, satisfying the following properties: (i) the empty set is independent; (ii) every subset of an independent set is independent (hereditary property); and (iii) if and , and , then there exist such that (augmentation property). An independent set is maximal if it is not properly contained in another independent set. A basic property of a matroid is that all of its maximal independent sets have the same size. The notion of maximality can be naturally extended to any subset of the ground set. Namely, for , an independent set of maximum cardinality among all independent sets contained in is called a maximal independent set of , and all maximal independent sets of have the same size. We let the rank of a subset , denoted by to be the size of a maximal independent set in . The rank of the matroid is then defined as . An important property of the rank function is submodularity: for any it holds that . The following lemma is an adaptation of [17, Lemma 3] and provides a useful property of matroids which will be exploited to derive the results of this paper. [Extended augmentation property] Let be a matroid. Consider an independent set , a subset , and a maximal independent set of . If there exists such that , then there exists such that .
Since is maximal in in , we have that . Also, , since . By applying the submodularity property to sets and we have the inequality
which can be manipulated using the above relations to yield , whence . So, there exists an independent set of elements, and the lemma follows. ∎
2.2 Definitions of the problems
The well-known -center problem is defined as follows. Given a set of points from a metric space with distance function , determine a subset of size which minimizes . For convenience, throughout the paper we will use the notation . Several variants of the -center problem have been proposed and studied in the literature. Mostly, these variants impose additional constraints on the solution and/or allow a given number of points to be disregarded from the computation of the maximum in the objective function. In this paper, we focus on two of these variants defined below using the same terminology adopted in .
Let be a matroid defined over the set of points , and let be an integer, with . The Robust Matroid Center RMC problem on with parameter , requires to determine a set minimizing
We use the tuple to denote an instance of RMC. Let be a set of points. Suppose that for each a weight is given and let be an integer, with . The Robust Knapsack Center RKC problem on with parameter and weights , requires to determine a set with , minimizing
We use the tuple to denote an instance of RKC.
The RMC and RKC problems share the same cost function but exhibit different feasible solutions for the same ground set . Observe that coincides with the -th smallest distance of a point of from . In other words, the best solution is allowed to ignore the contribution of the most distant points, which can be regarded as outliers.
The state of the art on sequential approximation algorithms for the two problems are the 3-approximation algorithms for the RMC and RKC problems presented in . The coreset-based approaches developed in this paper require the solution of generalized versions of the above two problems, where each point comes with a positive integer multiplicity . Let . The generalized versions of the two problems, dubbed RMC problem with Multiplicities (RMCM problem) and RKC problem with Multiplicities (RKCM problem), respectively, allow to vary in and modify the cost function as follows:
Letting , we use the tuples and to denote instances of RMCM and RKCM, respectively. To the best of our knowledge, prior to this work, no algorithms had been devised to solve the RMCM and RKCM problems. However, in the rest of the subsection we describe how the sequential algorithms in  can be easily adapted to solve the more general RMCM and RKCM problems, featuring the same 3-approximation guarantee as in the case without multiplicities.
We start by giving the definition of Robust -Supplier problem with Multiplicities, which generalizes the Robust -Supplier problem of , and recall the definition of the auxiliary -maximization under Partition Constraint (-PCM) problem .
An instance of the Robust -Supplier problem with Multiplicities is a tuple where is a metric space, is an integer parameter, is a down-closed family of subsets of , and is a function that associates to each point of its multiplicity . The objective is to find and for which and is minimized. An instance of -PCM is a tuple , where is a finite set, is a down-closed family of subsets of , is a sub-partition of , and is integer valued function consistent with in the sense that for each and for each pair , . For a set , we let . The objective of -PCM problem is to compute
The following theorem extends [7, Theorem 1] to encompass multiplicities.
Let be an algorithm for the -PCM problem, and let denote its complexity. Given an instance of the Robust -Supplier problem with Multiplicities, consider the instance of -PCM. Then, there is an algorithm for the Robust -Supplier problem with Multiplicities which returns a -approximate solution to in time .
The proof of this theorem, follows the same reasoning of [7, Theorem 1], hence we describe here only the differences with respect to that proof.
In Algorithm 1, described in [7, Section 3.1], we substitute Line 10 with the following line:
Next, we substitute the politope defined at the beginning of [7, Section 3.2], with the one described by the constraints below. (Note that only the first constraint is different with respect to the original ones.)
The remaining part of the proof, follows exactly the same passages as the original proof. However, we need the following modified version of [7, Claim 7], whose proof requires only straightforward adaptations to accommodate multiplicities.
[Modified Claim 7 of ] Let be any feasible solution of the -PCM instance constructed by Algorithm 1. Then,
Since the RMCM and RKCM problem can be seen as instantiations of the Robust -Supplier problem with Multiplicities, Theorem 2.2, combined with the -PCM algorithm from , allows us to derive the result stated in the following theorem. There exist -approximate polynomial-time sequential algorithms for the RMCM and RKCM problem.
2.3 Doubling dimension
The algorithms in this paper will be analyzed in terms of the dimensionality of the ground set as captured by the well-established notion of doubling dimension. Formally, given a point , let the ball of radius centered at be the subset of points of at distance at most from . The doubling dimension of is the smallest value such that any balls of radius centered at a point is contained in the union of at most balls of radius suitably centered at points of . The algorithms that will be presented in this paper adapt automatically to the doubling dimension of the input dataset and attain their best performance when is small, possibly constant. This is the case, for instance, of ground sets whose points belong to low-dimensional Euclidean spaces, or represent nodes of mildly-expanding network topologies under shortest-path distances.
The doubling dimension of a ground set allows the following interesting characterization of how the radius of a -center clustering decreases as increases, which will be crucially exploited in this paper.
Let . Consider a set of size , and let . If has doubling dimension , there exists a set of size such that .
By repeatedly applying the definition of doubling dimension, it is easily seen that each ball of radius around a point in can be covered with at most smaller balls of radius . The centers of all of these smaller balls provide the desired set . ∎
3 Coreset-based strategy for the RMC problem
In this section, we present a two-phase strategy to solve the RMC problem based on the following simple high-level idea. In the first phase we extract a small coreset from the ground set , that is, a subset of with the property that each point has a suitably “close” proxy in . In the second phase, an approximate solution to the RMCM problem on is computed, where the multiplicity of each is defined as the number of distinct points whose proxy is . In what follows, we first determine sufficient conditions on the coreset which guarantee that a good solution to the RMCM problem on is also a good solution for the RMC problem on , and then describe how such a coreset can be constructed, analyzing its size in terms of the doubling dimension of .
Let be an instance of the RMC problem and be the cost of its optimal solution. Consider a coreset with proxy function , and let , for every . Let denote the restriction of matroid to the coreset , where for each , . Finally, let denote the RMCM instance defined by , and . We have:
Let be a design parameter. Suppose that the coreset with proxy function satisfies the following conditions:
For each , ;
For each independent set there exists an injective mapping such that:
is an independent set;
for each , .
There exists a solution to of cost at most ;
Every solution to of cost is also a solution to of cost .
Let us first show P1. Let be the optimal solution to the RMC instance and let . We will show that is a of cost at most . By C2, and is an independent set in . Consider now a point such that with and observe that there are at least such points (e.g., all nonoutliers). We have that
Let and observe that . We have that
which concludes the proof of P1. In order to prove P2, let be a solution to of cost . Clearly, is an independent set in . Consider a generic point such that and let be the point of closest to . Observe that the points with are such that . Since , there are at least points of that are within a distance from . ∎
Later in this section (see Theorem 3) we will show that if coreset exhibits properties P1 and P2 stated in the above lemma, then a good solution to the RMC instance can be obtained by running an approximation algorithm for RMCM on . We now show how to construct a coreset satisfying Conditions C1 and C2 of Lemma 3 (hence, exhibiting properties P1 and P2 by virtue of the lemma). The construction strategy is simple and, as will be discussed in Section 5, also features efficient MapReduce and Streaming implementations. As in previous works, we assume that constant-time oracles are available to compute the distance between two elements of and to check the independence of a subset of (see e.g., ). Let be the rank of matroid . We make the reasonable assumption that is known to the algorithm. Also, for ease of presentation, we restrict the attention to matroids such that for every . This restriction can be easily removed with simple modifications to the algorithms.
In order to construct the coreset , we first compute a -approximate solution to -center on and determine . In the sequential setting, Gonzalez’s algorithm , provides a approximation111In the streaming setting, Gonzalez’s algorithm cannot be used, and a slightly larger value of will be needed and computes and in time. Then, we compute a set of points of such that , for every , and , for every . Hence, . Clearly, the value will depend on , , and . can be computed in time by adapting the well known greedy strategy by Hochbaum and Shmoys , namely, by performing a linear scan of and adding to (an initially empty) all those points at distance greater than from the current . (Observe that is the size of the final set but the construction does not require the knowledge of .) Let and, for , define the cluster (ties broken arbitrarily for points equidistant from two or more points of ). From each we extract a maximum independent set and define . For every and every point we set the proxy , where (ties broken arbitrarily). For each , its multiplicity is set to . We have:
The coreset constructed by the above algorithm satisfies Conditions C1 and C2 of Lemma 3.
First, we prove C1. Consider an arbitrary point and suppose that belongs to cluster , for some , hence belongs to and . Let be the cost of the optimal solution to the -center problem on . Since any solution to the instance of RMC, augmented with the outlier points, is a solution to -center on , it is easy to see that . Now, by using the fact that is a -approximate solution to -center on , we have
thus proving C1. As for C2, we reason as follows. Consider an arbitrary independent set . We now show that there exists an injective mapping which transforms into an independent set contained in , and such that, for each and , (i.e., and belong to the same cluster ) . This will immediately imply that . Let . We define the mapping incrementally one element at a time. Suppose that we have fixed the mapping for the first elements of and assume, inductively, that is an independent set of size and that and belong to the same cluster, for . Consider now and suppose that , for some . We distinguish among the following two cases:
Case 1. If , we set , hence .
Case 2. If , we apply the extended augmentation property stated in Lemma 2.1 with , , , and to conclude that there exists a point such that is an independent set.
After iterations of the above inductive argument, we have that the mapping is completely specified and exhibits the following properties: it is inductive, is independent, and, for , if then also , hence . This proves C2. ∎
The size of coreset can be conveniently bounded as a function of the doubling dimension of the ground set . If has doubling dimension , then the coreset obtained with the above construction has size .
Observe that , hence we need to bound . Consider the first set of centers computed by the coreset construction algorithm. Proposition 2.3 implies that there exists a set of at most points such that , hence can be covered with balls of radius at most . It is easy to see that in the adaptation of Hochbaum and Shmoys’ strategy  described above to construct , only one point from each such ball can be added to . Hence, , and the theorem follows. ∎
Let and . Suppose that the coreset exhibits Properties P1 and P2 of Lemma 3, for . Then, an -approximate solution to instance of RMCM is a -approximate solution to instance of RMC.
4 Coreset-based strategy for the RKC problem
In this section we present a coreset-based strategy for the RKC problem which is similar in spirit to the one presented in the previous section for the RMC problem. Consider an instance of RKC, and let denote the cost of an optimal solution. The idea is to extract a coreset from a -clustering of by picking one point per cluster so that contains a good solution for , and then to run an approximation algorithm for the RKCM problem on , using, for each , the size of its cluster as multiplicity . The cost penalty introduced by seeking the solution on rather than on the entire set will be limited by ensuring that for each , the distance is sufficiently small. The main difficulty with the above strategy is the choice of a suitable clustering granularity , hence we resort to testing geometrically increasing guesses for . Observe that in this fashion we generate a sequence of coresets, thus a sequence of RKCM instances upon which the approximation algorithm has to be run. A challenge of this approach is to devise a suitable stopping condition for detecting a good guess.
More specifically, our coreset-based strategy, dubbed RKnapCenter, works as follows. Let be an -approximation algorithm for the RKCM problem, and let be a fixed accuracy parameter. For each value in a geometric progression, we run a procedure dubbed CoresetComputeAndTest, which first computes a partition of into clusters , induced by a solution to -center on , sets coreset to contain one point of minimum weight from each cluster, and finally runs on the RKCM instance , where is the restriction of to , and , with being the size of the cluster that belongs to. CoresetComputeAndTest returns , the solution computed by , and , where . If , then the algorithm terminates and returns as final solution. (See Algorithm 1 for the pseudocode.)
For any , consider the triplet returned by one execution of CoresetComputeAndTest within RKnapCenter. Then:
is a solution to the RKC instance of cost at most .
Let us prove Point 1 first. Since is a feasible solution to and is the restriction of of the points of , is also feasible for . By definition of , there exists a subset such that and . Consider a point and suppose that . Then, by the triangle inequality, , . Thus, the points in at distance at most from are at least .
As for Point 2, let be an optimal solution to and let . We now show that is a feasible solution to of cost at most , hence must have a cost of at most . Observe that since contains the points of minimum weight from each cluster, . Consider a point such that . Clearly there are at least such points (e.g., all nonoutliers). Let . Hence, . Consider a cluster with and the point . Since , contains a point with . Let be the point of closest to and suppose that belongs to cluster . Letting be the point in , by the triangle inequality we have . This immediately implies that . ∎
The following two theorems bound, respectively, the maximum value of set by the do-while loop in RKnapCenter (hence, the size of the coreset from which the final solution is extracted), and the approximation ratio featured by the algorithm.
Assume that a -approximation algorithm for -center is used in Line 8 of RKnapCenter and let be the value of at which the algorithm stops. If has doubling dimension , then , where is the minimum cardinality of an optimal solution to the RKC instance and .
Let be the optimal solution to the RKC instance of minimum cardinality , of cost . By reasoning as in the proof of Lemma 3, we conclude that the points of together with the at most outliers form a solution to -center on of cost at most . Hence, letting denote the cost of an optimal solution to -center on , we have . Proposition 2.3 implies that for every , the cost of the optimal solution to -center on is . Let be the smallest value of tested by the algorithm such that (hence, ) and let be the triplet returned by CoresetComputeAndTest . Observe that . We now show that and satisfy the stopping condition, thus proving the theorem. By Point 1 of Lemma 4, is a feasible solution to the RKC instance of cost at most , hence, combining this fact with the previous observations, we have . This implies that . By substituting and applying trivial algebra, we obtain , which proves that the stopping condition is met. ∎
Let and let be the approximation factor of the sequential algorithm