Clustering is one of the most basic unsupervised learning tools developed to infer informative patterns in data. Given a set of data points in, the goal in clustering problems is to find a smaller number of points, namely cluster centers, that form a good representation of the entire dataset. The quality of the clusters is usually measured using a cost function, which is the sum of the distances of the individual points to their closest cluster center.
With the constantly growing size of datasets in various domains, centralized clustering algorithms are no longer desirable and/or feasible, which highlights a need to design efficient distributed algorithms for the clustering task. In a distributed setting, we assume that a collection of data points is too large to fit in the memory of a single server. Therefore, we employ a setup with compute nodes and one coordinator server. In this setup, the most natural approach is to partition the data point in into subsets and assign each of these parts to a different compute node. These nodes then perform local computation on the data points assigned to them and communicate their results to the coordinator. The coordinator then combines all the local computation received from the compute nodes and outputs the final clustering. Here, we note that the overall computation of clustering may potentially involve multiple rounds of communications between the compute nodes and the coordinator. This natural approach for distributed clustering has received significant attention from the research community. Interestingly, it is possible to obtain a clustering in the distributed setup that has its cost bounded by a constant multiple of the cost of clustering achievable by a centralized algorithm (see [1, 2, 3] and references therein).
In this paper, we aim to address the issue of stragglers that arises in the context of large scale distributed computing systems, especially the ones running on the so-called cloud. The stragglers correspond to those compute nodes that take significantly more time than expected (or fail) to complete and deliver the computation assigned to them. Various system-related issues lead to this behavior, including unexpected background tasks such as software updates being performed on the compute nodes, power outages, and congested communication networks in the computing setup. Some simple approaches used to handle stragglers include ignoring them and relying on asynchronous methods. The loss of information arising due to the straggler nodes can be traded for efficiency for specific tasks such as computing distributed gradients [4, 5, 6]. However, with the existing methods for unsupervised learning tasks such as clustering or dimensionality reduction, ignoring the stragglers can lead to extremely poor quality solutions.
Alternatively, one can distribute data to the compute nodes in a redundant manner such that the information obtained from the non-straggler nodes is sufficient to compute the desired function on the entire dataset. Following this approach, multiple coding based solutions (mainly focusing on the linear computation and first-order methods for optimization) have been recently proposed (e.g., see [4, 5, 7, 6, 8, 9]).
This paper focuses on the relatively unexplored area of designing straggler-resilient unsupervised learning methods for distributed computing setups. We study a general class of distributed clustering problems in the presence of stragglers. In particular, we consider the -medians clustering problem (cf. Section 3.2) and the subspace clustering problem (cf. Section 3.3). Note that the subspace clustering problem covers both the
-means and the principal component analysis (PCA) problems as special cases. The proposed Straggler-resilient clustering methods in this paper rely on a redundant data distribution scheme that allows us to compute provably good-quality cluster centers even in the presence of a large number of straggling nodes (cf. Section3.1 and 3.4). In Section 4, we empirically evaluate our proposed solution for -median clustering and demonstrate its utility.
In this section, we first provide the necessary background on the clustering problems studied in the paper. We then formally state the main objective of this work.
2.1 Distributed clustering
Given a dataset with points , distributed among compute nodes, the goal in distributed clustering problems is to find a set of cluster centers that closely represent the entire dataset. The quality of these centers (and the associated clusters) is usually measured by a cost function . Two of the most prevalent cost functions for clustering are the -median and the -means functions, which are defined as follows.
where, denotes the Euclidean distance between two points and . We denote the cluster associated with the center by . For any , the set of cluster centers , is called an -approximate solution to the clustering problem if the cost of clustering with , , is at most times the optimal (minimum) clustering cost with -centers.
In certain applications, the dataset is weighted with an associated non-negative weight function . The -means cost for such a weighted dataset is defined as The -median cost for is analogously defined.
We also consider a general class of -error fitting problems known as the -subspace clustering problem.
Definition 1 (-subspace clustering).
Given a dataset find a set of -subspaces (linear or affine) , each of dimension , that minimizes
Note that for , this is exactly the -means problem described above. Another special case, when , is known as principal component analysis (PCA). If we consider the matrix , with the data points in as its rows, it is well-known that the desired subspace is spanned by the top
-right singular vectors of.
2.2 Coresets and clustering
In a distributed computing setup, where -th compute node stores a partial dataset , one way to perform distributed clustering is to have each node communicate a summary of its local data to the coordinator. An approximate solution to the clustering problem can then be computed from the combined summary received from all the compute nodes. This summary, called a coreset, is essentially a weighted set of points that approximately represents the original set of points in .
Definition 2 (-coreset).
For , an -coreset for a dataset with respect to a cost function is a weighted dataset with an associated weight function such that, for any set of centers , we have
The next results shows the utility of a coreset for clustering.
Theorem 1 ( ).
Let be an -coreset for a dataset with respect to the cost function . Any -approximate solution to the clustering problem on input , is an -approximate solution to the clustering problem on .
2.3 Straggler-resilient distributed clustering
The main objective of this paper is to design the distributed clustering methods that are robust to the presence of straggling nodes. Since the straggling nodes are unable to communicate the information about their local data, the distributed clustering method may miss valuable structures in the dataset resulting from this information loss. This can potentially lead to clustering solutions with poor quality (as verified in Section 4).
Given the prevalence of the stragglers in modern distributed computing systems, it is natural to desire clustering methods that generate provably good clustering solutions despite the presence of stragglers. Let be the cost of the best clustering solution for the underlying dataset. In this paper, we explore the following question: Given a dataset and distributed computing setup with compute nodes where at most nodes may behave as stragglers, can we design a clustering method that generates a solution with the cost at most , for a small approximation factor ?
In this paper, we affirmatively answer this question for the -median clustering and the
-subspace clustering. Our proposed solutions add on to the growing literature on straggler mitigation via coded computation, which has primarily focused on the supervised learning tasks so far.
3 Main Results
We propose to systematically modify the initial data assignment to the compute nodes in order to mitigate the effect of stragglers. In particular, we employ redundancy in the assignment process and map every vector in the dataset to multiple compute nodes. This way each vector affects the local computation performed at multiple compute nodes, which allows us to obtain final clusters at the coordinator server by taking into account the contribution of most of the vectors in even when some of the compute nodes behave as stragglers.
We first introduce the assignment schemes with straggler-resilience property. This property enables us to combine local computations from non-straggling compute nodes at the coordinator while preserving most of the relevant information present in the dataset for the underlying clustering task. Subsequently, we utilize such an assignment scheme to obtain good-quality solutions to the -medians and the -subspace clustering problem in Section 3.2 and Section 3.3, respectively. Finally, in Section 3.4, we propose a randomized construction of an assignment scheme with the desired straggler-resilience property.
3.1 Straggler-resilient data assignment
Let the compute nodes in the system be indexed by the set . Furthermore let be assigned to the compute nodes indexed by the set . We can alternatively represent the overall data assignment by an assignment matrix , where the columns and the rows of are associated with distinct points in and distinct compute nodes, respectively. In particular, the -th column of , which corresponds to the data point , is an indicator of , i.e., For any , we denote the set of data points allocated to the -th compute node by .
Let denote the set of non-straggling compute nodes. We assume that , where denotes an upper bound on the number of stragglers in the system. Let denote the submatrix of with only the rows corresponding to the non-straggling compute nodes (indexed by ). For any such set of non-stragglers , we require that the assignment matrix satisfies the following property.
Property 1 (Straggler-resilience property).
Let be a given constant. For every with , there exists a recovery vector, such that for some ,
The straggler-resilience property is closely related to the gradient coding introduced in . However, two key points distinguish our work from the gradient coding work. First, the recovery vector is restricted to have only non-negative coordinates. Second, and more importantly, the utilization of the redundant data assignment in this work (cf. Lemma 3) differs from that of gradient coding in  where gradient coding is used to recover the full-gradient.
The following result, which is based on the combinatorial characterization for the assignment scheme enforced by Property 1, enables us to combine the information received from non-stragglers to generate close to optimal clustering solutions.
Let be a dataset distributed across compute nodes using an assignment matrix that satisfies Property 1. Let denote the set of non-straggler nodes. For any , let be the recovery vector corresponding to . Then, for any set of centers , and any weight function ,
Equipped with Lemma 3, we are now in the position to describe our solutions for the straggler-resilient clustering.
3.2 Straggler-resilient distributed -median
We distributed the dataset among the compute nodes using an assignment matrix that satisfies Property 1. Each compute node sends a set of (weighted) -medians centers of their local datasets which when combined at the coordinator gives a summary for the entire dataset. Thus, the weighted -median clustering on this summary at the coordinator gives a good quality clustering solution for the entire dataset . Algorithm 1 provides a detailed description of this approach.
Before assessing the quality of on the entire dataset , we show that for any set of centers , the cost incurred by the weighted dataset is close to the cost incurred by .
For any set of -centers
The following result quantifies the quality of the clustering solution returned by Algorithm 1 on the entire dataset .
Let be the optimal set of -median centers for the dataset . Then,
In Algorithm 1, each compute node sends clustering solution on its local data using which the coordinator is able to construct a good summary of the entire dataset despite the presence of the stragglers. This summary is sufficient to generate a good quality -median clustering solution on . In Section 3.3, we show that if each compute node sends more information in the form of a coreset of its local data, the accumulated information at the coordinator is sufficient to solve the more general problem of -subspace clustering in a straggler-resilient manner.
3.3 Straggler-resilient distributed -subspace clustering
In this subsection, we utilize our redundant data assignment with straggler-resilient property to combine local coresets received from the non-straggling nodes to obtain a global coreset for the entire dataset, which further enables us to perform distributed -subspace clustering in a straggler-resilient manner. In particular, we propose two approaches to perform subspace clustering, which rely on the coresets  and the relaxed coresets [10, 12], respectively.
3.3.1 Distributed -subspace clustering using coresets
Here, we propose a distributed -subspace clustering algorithm that used the existing coreset constructions from the literature in a black-box manner. Each compute node sends a coreset of its partial data which when re-weighted according to Lemma 3 gives us a coreset for the entire dataset even in presence of stragglers. Given this global coreset, we can then construct a solution to the underlying -subspace clustering problem at the coordinator (cf. Theorem 1). The complete description of this approach is given in Algorithm 2.
Before we analyze the quality of , the solution returned by Algorithm 2, we present the following result that shows the utility of an assignment scheme with Property 1 to construct a global coreset for the entire dataset from the coresets of the partial datasets in a straggler-resilient manner.
Let be distributed according to with Property 1. Let be the recovery vector for the set of non-straggler nodes . For any , let be an -coreset for the local dataset with weight function with respect to the cost function . Then, with the weight function defined as for all is a -coreset for .
Since each is a -coreset for , it follows from Lemma 3 that is a -coreset for . This allows us to quantify the quality of as a solution to the underlying problem of -subspace clustering on .
Let be the cost of the optimal -subspace clustering solution for . Then, we have that .
Let be the optimal -subspace clustering solution for , i.e., . Since is a -coreset, we have
where and follow from Definition 2; and follows from the fact that the coordinator performs an -approximate subspace clustering on . ∎
Coreset constructions for various clustering algorithms with squared error was considered by [11, 13, 14, 15, 16, 17, 18, 19, 20]. However, the size of such constructed coresets depends on the ambient dimension which makes them prohibitive for high-dimensional datasets. Interestingly, Feldman et al.  and Balcan et al.  show that one could construct smaller sized relaxed coresets (of size independent of both and ) for the -PCA problem. Further, they use these relaxed coresets for approximate-PCA to solve the class of -subspace clustering problems. Below we show that, by utilizing a redundant assignment scheme with Property 1, the distributed approximate PCA and hence, -subspace clustering algorithms of [10, 12] can be made straggler resilient.
3.3.2 Straggler-resilient PCA using relaxed coresets
In this section, we use to denote both the set of data points and the matrix with points in as its rows. We use the set and matrix notation interchangeably through this section.
The goal in PCA is to find the linear -dimensional subspace , that best fits the data. It is well-known that the subspace spanned by the top right singular vectors of gives an optimal solution to the PCA problem. The main question addressed by Feldman et al.  is to find an approximate solution to PCA in a distributed setting by constructing relaxed coresets for the local data. In Algorithm 3, we adapt the distributed PCA algorithm of  to obtain a straggler-resilient distributed PCA algorithm.
Lemma 4 ().
For any , let denote the rows of . Then, for any -dimensional linear subspace , there exists such that
Let . Then, for any -dimensional linear subspace ,
Let be the optimal solution to -PCA on . Then, .
Note that is independent of the choice of . Thus,
where and follow from Lemma 5; and follows as is the optimal solution to the -PCA problem on . ∎
3.4 Construction of assignment matrix
Finally, we present a randomized construction of the assignment matrix that satisfies Property 1
. For the construction, we assume a random straggler model, where every compute node behaves as a straggler independently with probability. Therefore, we receive the local computation from each compute node with probability .
Consider the following random ensemble of assignment matrices such that for some (to be chosen later) the -th entry of the assignment matrix is defined as
We show that for an appropriate choice of , and hence
, the random matrixsatisfies Property 1 with high probability.
The two parameters of importance when constructing an assignment matrix are the load per machine and the fraction of stragglers that can be tolerated. Increasing the redundancy makes the assignment matrix robust to more stragglers while at the same time, increases the computational load on individual compute nodes. For , our construction assigns data points to each compute node and is resilient to a constant fraction of random stragglers.
In this section, we demonstrate the performance of our straggler-resilient distributed -medians algorithm and compare it with non-redundant data assignment scheme. We consider the synthetic Gaussian data-set  with two-dimensional points. The points are distributed on compute nodes, of which are randomly chosen to be stragglers. The results are presented in Figures 1(a)-1(d).
Figures 1(a) shows the ground truth -median clustering, using the centroids provided in the data-set. Figures 1(b) shows the results obtained by ignoring the local computations from the straggler nodes. We use Algorithm 1 without any redundant data assignment. The data points are randomly partitioned among compute nodes. Each non-straggler compute node sends its local -median centers to the coordinator. The coordinator then runs a -median algorithm on the accumulated centers obtained from the non-stragglers. As evident from the comparison between Figure 1(a) and Figure 1(b), such a scheme can output a set of poor quality -centers.
Figure 1(c) shows the result of Algorithm 1. The assignment matrix is chosen randomly (see Section 3.4 for details) with each . Such an assignment matrix assigns data points to each compute node in expectation, leading to a non-redundant data assignment. Figure 1(d) shows the effect of increasing to , and hence the redundancy in the data assignment step. Each compute node now gets about data points. Note that the results of Figure 1(d) are very close to the ground truth clustering shown in Figure 1(a).
5 Conclusion and Future directions
It is an interesting direction to explore the tradeoff between the communication and the approximation factor achieved by the clustering method. In the -median algorithm described above, each compute node returns a set of -centers to achieve an approximation factor of about . Whereas, in Algorithm 2, the compute nodes send the coresets of their local data which can then be combined to construct a global coreset for the entire data. While, the quality of the obtained centers improves, it increases the communication cost between the compute nodes and the coordinator since a coreset would contain more than points.
Another natural question that we are currently exploring is using data distribution techniques to design distributed algorithms that are robust to byzantine adversaries.
-  S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000.
-  P. Awasthi, M. F. Balcan, and C. White. General and robust communication-efficient algorithms for distributed clustering. CoRR, abs/1703.00830, 2017.
G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley.
Fast distributed k-center clustering with outliers on massive data.In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1063–1071, 2015.
K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran.
Speeding up distributed machine learning using codes.IEEE Transactions on Information Theory, 64(3):1514–1529, March 2018.
-  R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. In Proceedings of the 34th International Conference on International Conference on Machine Learning (ICML), pages 3368–3376, 2017.
-  S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd. In Proc. of the AISTATS, pages 803–812, 2018.
-  C. Karakus, Y. Sun, S. Diggavi, and W. Yin. Straggler mitigation in distributed optimization through data encoding. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), pages 5440–5448, 2017.
-  Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr. Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding. In Proceedings of 2018 IEEE International Symposium on Information Theory (ISIT), pages 2022–2026, June 2018.
S. Dutta, V. Cadambe, and P. Grover.
Short-dot: Computing large linear transforms distributedly using coded short dot products.In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), pages 2100–2108, 2016.
Dan Feldman, Melanie Schmidt, and Christian Sohler.
Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering.In Proceedings of the Twenty-fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1434–1453, 2013.
Sariel Har-Peled and Soham Mazumdar.
On coresets for k-means and k-median clustering.
Proceedings of the thirty-sixth annual ACM Symposium on Theory of Computing, pages 291–300. ACM, 2004.
-  Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, and David Woodruff. Improved distributed principal component analysis. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), pages 3113–3121, 2014.
-  Sariel Har-Peled. No, coreset, no cry. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 324–335. Springer, 2004.
-  Michael Edwards and Kasturi Varadarajan. No coreset, no cry: Ii. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 107–115. Springer, 2005.
-  Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the thirty-seventh annual ACM Symposium on Theory of Computing, pages 209–217. ACM, 2005.
-  Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3–19, 2007.
-  Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.
-  Michael Langberg and Leonard J Schulman. Universal -approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM Symposium on Discrete Algorithms, pages 598–607. SIAM, 2010.
-  Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM Symposium on Theory of Computing, pages 569–578. ACM, 2011.
-  Kasturi Varadarajan and Xin Xiao. A near-linear algorithm for projective clustering integer points. In Proceedings of the twenty-third annual ACM-SIAM Symposium on Discrete Algorithms, pages 1329–1342. SIAM, 2012.
-  P. Fränti and O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–765, 2006.
Appendix A Missing proofs from Section 2
Proof of Theorem 1.
Let and be the optimal sets of -centers for the clustering problem with the dataset and the weighted dataset as inputs, respectively. Let be an -approximate solution for the clustering problem with the weighted dataset . Our goal is to show that
where and follow from the fact that is an -coreset for (cf. Defintion 2). and follow from the fact that and correspond to the optimal and -approximate clustering for , respectively. Now, for small enough (in particular, ), (3) follows from (A). ∎
Appendix B Missing proofs from Section 3.1
Proof of Lemma 3.
We prove the result for the cost function, and the proof extends similarly to as well. The proof is independent of the choice of the distance function, and we only use the properties of the assignment matrix.
Appendix C Missing proofs from Section 3.2
Proof of Lemma 2.
Upper Bound. We first show that for any set of -centers , and any , . This ensures that the weighted -centers is a good representation of the partial data .
For any , let denote its closest center in . It follows from (C) that
where and employ the definition of and the triangle inequality, respectively. Note that follows from the optimality of the -centers on the partial dataset , i.e., . Next, note that
Lower Bound. To establish the lower bound, we start from Lemma 3. For any set of -centers , we have
Recall that is the set of -median centers for the data-set . By the definition of cluster centers, we know that for any two points , and any set of -centers, . Plugging this observation in (C), we obtain that
where is the cluster center in that is closest to . Now using triangle inequality, we get that
where employs the triangle inequality.
Missing Proofs from Section 3.3
Proof of Lemma 3.
Note that, for any , the weighted point set is an -coreset of the partial dataset . Thus, according to Definition 2, we have
for any set of -centers .
Proof of Lemma 5.
From Algorithm 3, note that
The last equality follows from the observation that , where is an orthonormal matrix. Therefore, for any -dimensional subspace .
Missing Proofs from Section 3.4
Proof of Theorem 6.
Let denote the set of non-stragglers, then for any , we have
Next, we argue that for any , we can choose large enough to ensure Property 1 with high probability. First, we analyze the weight of each of the column in the random matrix . For and , define an event as follows