1 Introduction
Clustering is a ubiquitous task in machine learning and data mining for partitioning a data set into groups of similar points. The means clustering problem is arguably the most wellstudied problem in machine learning. However, designing provably optimal means clustering algorithms is a challenging task as the means clustering objective is NPhard to optimize [WS11] (in fact, it is also NPhard to find nearoptimal solutions [ACKS15, LSW17]). A popular approach to cope with this intractability is to study averagecase models for the means problem. The most widely used such statistical model for clustering is the Gaussian Mixture Model (GMM), that has a long and rich history [Tei61, Pea94, Das99, AK01, VW04, DS07, BV08, MV10, BS10, KK10].
In this model there are clusters, and the points from cluster are generated from a Gaussian in dimensions with mean , and covariance matrix with spectral norm . Each of the points in the instance is now generated independently at random, and is drawn from the
th component with probability
( are also called mixing weights). If the means of the underlying Gaussians are separated enough, the ground truth clustering is well defined^{1}^{1}1A separation of for suffices w.h.p.. The algorithmic task is to recover the ground truth clustering for any data set generated from such a model (note that the parameters of the Gaussians, mixing weights and the cluster memberships of the points are unknown).Starting from the seminal work of Dasgupta [Das99], there have been a variety of algorithms to provably cluster data from a GMM model. Algorithms based on PCA and distancebased clustering [AK01, VW04, AM05, KSV08]
provably recover the clustering when there is adequate separation between every pair of components (parameters). Other algorithmic approaches include the methodofmoments
[KMV10, MV10, BS10], and algebraic methods based on tensor decompositions
[HK12, GVX14, BCMV14, ABG14, GHK15]. (Please see Section 1.1 for a more detailed comparison of the guarantees).On the other hand, the methodofchoice in practice are iterative algorithms like the Lloyd’s algorithm (also called means algorithm) [Llo82] and the means++ algorithm of [AV07] (Lloyd’s algorithm initialized with centers from distancebased sampling). In the absence of good worstcase guarantees, a compelling direction is to use beyondworstcase paradigms like averagecase analysis to provide provable guarantees. Polynomial time guarantees for recovering means optimal clustering by the Lloyd’s algorithm and means++ are known when the points are drawn from a GMM model under sufficient separation conditions [DS07, KK10, AS12].
Although the study of Gaussian mixture models has been very fruitful in designing a variety of efficient algorithms, real world data rarely satisfies such strong distributional assumptions. Hence, our choice of algorithm should be informed not only by its computational efficiency but also by its robustness to errors and model misspecification. As a first step, we need theoretical frameworks that can distinguish between algorithms that are tailored towards a specific probabilistic model and algorithms robust to modeling assumptions. In this paper we initiate such a study in the context of clustering, by studying a natural semirandom model that generalizes the GMM model and also captures robustness to certain adversarial dependencies in the data.
Semirandom models involve a set of adversarial choices in addition to the random choices of the probabilistic model, while generating the instance. These models have been successfully applied to study the design of robust algorithms for various optimization problems [BS95, FK98, MS10, KMM11, MMV12, MMV14] (see Section 1.1) In a typical semirandom model, there is a “planted” or “groundtruth” solution, and an instance is first generated according to a simple probabilistic model. An adversary is then allowed to make “monotone” or helpful changes to the instance that only make the planted solution more pronounced. For instance, in the semirandom model of Feige and Kilian [FK98] for graph partitioning, the adversary is allowed to arbitrarily add extra edges within each cluster or delete edges between different clusters of the planted partitioning. These adversarial choices only make the planted partition more prominent; however, the choices can be dependent and thwart algorithms that rely on the excessive independence or strong but unrealistic structural properties of these instances.
Hence, the study of semirandom models helps us understand and identify robust algorithms. Our motivation for studying semirandom models for clustering is twofold: a) design algorithms that are robust to strong distributional data assumptions, and b) explain the empirical success of simple heuristics such as the Lloyd’s algorithm.
Semirandom mixtures of Gaussians
In an ideal clustering instance, each point in the th cluster is significantly closer to the mean than to any other mean for (for a general instance, in the optimal solution, ). Moving each point in toward its own mean only increases this gap between the distance to its mean and to any other mean. Hence, this perturbation corresponds to a monotone perturbation that only make this planted clustering even better. In our semirandom model, the points are first drawn from a mixture of Gaussians (this is the planted clustering). The adversary is then allowed to move each point in the th cluster closer to its mean . This allows the points to be even better clustered around their respective means, however these perturbations are allowed to have arbitrary dependencies. We now formally define the semirandom model.
Definition 1.1 (Semirandom GMM model).
Given a set of parameters and , a clustering instance on points is generated as follows.

Adversary chooses an arbitrary partition of and let for all .

For each and each , is generated independently at random according to a Gaussian with mean and covariance with
i.e., variance at most
in each direction. 
The adversary then moves each point towards the mean of its component by an arbitrary amount i.e., for each , the adversary picks arbitrarily in . (Note that these choices can be correlated arbitrarily.)
The instance is and is parameterized by with the planted clustering . We will denote by .
Data generated by mixtures of highdimensional Gaussians have certain properties that are often not exhibited by realworld instances. Highdimensional Gaussians have strong concentration properties; for example, all the points generated from a highdimensional Gaussian are concentrated at a reasonably far distance from the mean (they are far away w.h.p.). In many realworld datasets on the other hand, clusters in the groundtruth often contain dense “cores” that are close to the mean. Our semirandom model admits such instances by allowing points in a cluster to move arbitrarily close to the mean.
Our Results.
Our first result studies the Lloyd’s algorithm on the semirandom GMM model and gives an upper bound on the clustering error achieved by the Lloyd’s algorithm with the initialization procedure used in [KK10].
Informal Theorem 1.2.
Consider any semirandom instance with points generated by the semirandom GMM model (Definition 1.1) with planted clustering and parameters satisfying
and . There is polynomial time algorithm based on the Lloyd’s iterative algorithm that recovers the cluster memberships of all but points.
The in the above statement hides a and factor. Please see Theorem 3.1 for a formal statement. Furthermore, we show that in the above result the Lloyd’s iterations can be initialized using the popular mean++ algorithm that uses sampling [AV07]. The most closely related to our work is that of [KK10] and [AS12] who provided deterministic data conditions under which the Lloyd’s algorithm converges to the optimal clustering. Along these lines, our work provides further theoretical justification for the enormous empirical success that the Lloyd’s algorithm enjoys.
It is also worth noting that in spite of being robust to semirandom perturbations, the separation requirement of in our upper bound matches the separation requirement in the best guarantees [AS12] for Lloyd’s algorithm even in the absence of any semirandom errors or perturbations (see Section 1.1 for a comparison) ^{2}^{2}2We note that for clustering GMMs, the work of Brubaker and Vempala [BV08] give a qualitatively different separation condition that does not depend on the maximum variance, and can model Gaussian mixtures that look like “parallel pancakes”. However this separation condition is incomparable to [AS12], because of the potentially worse dependence on .
. We also remark that while the algorithm recovers a clustering of the given data that is very close to the planted clustering, this does not necessarily estimate the means of the original Gaussian components up to inverse polynomial accuracy (in fact the centers of the planted clustering after the semirandom perturbation may be
far from the original means). This differs from the recent body of work on parameter estimation in the presence of some adversarial noise (please refer to Section 1.1 for a comparison).While the monotone changes allowed in the semirandom model should only make the clustering task easier, our next result shows that the error achieved by the Lloyd’s algorithm is in fact near optimal. More specifically, we provide a lower bound on the number of points that will be misclassified by any means optimal solution for the instance.
Informal Theorem 1.3.
Given any (that is sufficiently large polynomial in ) and such that , there exists an instance on points in dimensions generated from the semirandom GMM model 1.1 with parameters , and planted clustering having separation s.t. any optimal means clustering solution of misclassifies at least points with high probability.
The above lower bound also holds when the semirandom (monotone) perturbations are applied to points generated from a mixture of spherical Gaussians each with covariance and weight . Further, the lower bound holds not just for the optimal means solution, but also for any “locally optimal” clustering solution. Please see Theorem 4.1 for a formal statement. These two results together show that the Lloyd’s algorithm essentially recovers the planted clustering up to the optimal error possible for any means clustering based algorithm.
Unlike algorithmic results for other semirandom models, an appealing aspect of our algorithmic result is that it gives provable robust guarantees in the semirandom model for a simple, popular algorithm that is used in practice (Lloyd’s algorithm). Further, other approaches for clustering like distancebased clustering, methodofmoments and tensor decompositions seem inherently nonrobust to these semirandom perturbations (see Section 1.1 for details). This robustness of the Lloyd’s algorithm suggests an explanation for its widely documented empirical success across different application domains.
Considerations in the choice of the Semirandom GMM model.
Here we briefly discuss different semirandom models, and considerations involved in favoring Definition 1.1. Another semirandom model that comes to mind is one that can move each point closer to the mean of its own cluster (closer just in terms of distance, regardless of direction). Intuitively this seems appealing since this improves the cost of the planted clustering. However, in this model the optimal means clustering of the perturbed instance can be vastly different from the planted solution. This is because one can move many points in cluster in such a way that becomes closer to a different mean rather than . For high dimensional Gaussians it is easy to see that the distance of each point to its own mean will be on the order of . Hence, in our regime of interest, the inter mean separation of could be much smaller than the radius of any cluster (when ). Consider an adversary that moves a large fraction of the points in a given cluster to the mean of another cluster. While the distance of these points to their cluster mean has only decreased from roughly to around , these points now become closer to the mean of a different cluster! In the semirandom GMM model on the other hand, the adversary is only allowed to move the point along the direction of ; hence, each point becomes closer to its own mean than to the means of other clusters. Our results show that in such a model, the optimal clustering solution can change by at most points.
Challenges in the Semirandom GMM model and Overview of Techniques.
Lloyd’s algorithm has been analyzed in the context of clustering mixtures of Gaussians [KK10, AS12]. Any variant of the Lloyd’s algorithm consists of two steps — an initialization stage where a set of initial centers are computed, and the iterative algorithm which successively improves the clustering in each step. Kumar and Kannan [KK10] considered a variant of the Lloyd’s method where the initialization is given by using PCA along with a factor approximation to the means optimization problem. The improved analysis of this algorithm in [AS12] leads to state of the art results that perfectly recovers all the clusters when the separation is of the order .
We analyze the variant of Lloyd’s algorithm that was introduced by Kumar and Kannan [KK10]. However, there are several challenges in extending the analysis of [AS12] to the semirandom setting. While the semirandom perturbations in the model only move points in a cluster closer to the mean , these perturbations can be coordinated in a way that can move the empirical mean of the cluster significantly. For instance, Lemma 4.3 gives a simple semirandom perturbation to the points in that moves the empirical mean of the points in to s.t. , for any desired direction . This shift in the empirical means may now cause some of the points in cluster to become closer to (in particular points that have a relatively large projection onto ) and viceversa. In fact, the lower bound instance in Theorem 4.1 is constructed by applying such a semirandom perturbation given by Lemma 4.3 to the points in a cluster, along a carefully picked direction so that points are misclassified per cluster.
The main algorithmic contribution of the paper is an analysis of the Lloyd’s iterative algorithm when the points come from the semirandom GMM model. The key is to understand the number of points that can be misclassified in an intermediate step of the Lloyd’s iteration. We show in Lemma 3.3 that if in the current iteration of the Lloyd’s algorithm, each of the current estimates of the means is within from , then the number of misclassified points by the current iteration of Lloyd’s iteration is at most . This relies crucially on Lemma 2.11 which upper bounds the number of points in a cluster s.t. has a large inner product along any (potentially bad) direction .
The effect of these bad points has to be carefully accounted for when analyzing both stages of the algorithm – the initialization phase, and the iterative algorithm. Proposition 3.2 argues about the closeness of the initial centers to the true means. As in [KK10], these initial centers are obtained via a boosting technique that first maps the points to an expanded feature space and then uses the (SVD + means approximation) to get initial centers. When using this approach for semirandom data one needs to carefully argue about how the set of bad points behave in the expanded feature space. This is done in Lemmas B.1 and B.2. Given the initial centers, it is not hard to see that the analysis of [AS12] can be carried out to argue about the improvements made in the Lloyd’s iterative step; however, this leads to a bound that is suboptimal by the factor of . Instead, we perform a much finer analysis for the semirandom model to control the effect of the bad points and achieve nearly optimal error bounds. This is done in Lemma 3.4.
1.1 Related Work
There has been a long line of algorithmic results on Gaussian mixture models starting from [Tei61, Tei67, Pea94]. These results fall into two broad categories: (1) Clustering algorithms, which aim to recover the component/cluster memberships of the points and (2) Parameter estimation, where the goal is to estimate the parameters of the Gaussian components. When the components of the mixture are sufficiently wellseparated, i.e., , then the Gaussians do not overlap w.h.p., and then the two tasks become equivalent w.h.p. We now review the different algorithms that have been designed for these two tasks, and comment on their robustness to semirandom perturbations.
Clustering Algorithms.
The first polynomial time algorithmic guarantees were given by Dasgupta [Das99], who showed how to cluster a mixture of Gaussians with identical covariance matrices when the separation between the cluster means is of the order , where denotes the maximum variance of any cluster along any direction^{3}^{3}3The term involves a dependence of either or . . Distancebased clustering algorithms that are based on strong distanceconcentration properties of highdimensional Gaussians improved the separation requirement between means and to be [AK01, DS07], where denotes the maximum variance of points in cluster along any direction. Vempala and Wang [VW04] and subsequent results [KSV08, AM05] used PCA to project down to dimensions (when ), and then used the above distancebased algorithms to get stateoftheart guarantees for many settings: for spherical Gaussians a separation of roughly suffices [VW04]. For nonspherical Gaussians, a separation of is known to suffice [AM05, KSV08]. Brubaker and Vempala [BV08] gave a qualitative improvement on the separation requirement for nonspherical Gaussians by having a dependence only on the variance along the direction of the line joining the respective means, as opposed to the maximum variance along any direction.
Recent work has also focused on provable guarantees for heuristics such as the Lloyd’s algorithm for clustering mixtures of Gaussians [KK10, AS12]. Iterative algorithms like the Lloyd’s algorithm (also called means algorithm) [Llo82] and its variants like means++ [AV07] are the methodofchoice for clustering in practice. The best known guarantee [AS12] along these lines requires a separation of order between any pair of means, where is the maximum variance among all clusters along any direction. To summarize, for a mixture of Gaussians in dimensions with variance of each cluster being bounded by in every direction, the stateoftheart guarantees require a separation of roughly between the means of any two components [VW04] for spherical Gaussians, while a separation of is known to suffice for nonspherical Gaussians [AS12].
The techniques in many of the above works rely on strong distance concentration properties of highdimensional Gaussians. For instance, the arguments of [AK01, VW04] that obtain a separation of order crucially rely on the tight concentration of the squared distance around , between any pair of points in the same cluster. These arguments do not seem to carry over to the semirandom model. Brubaker [Bru09] gave a robust algorithm for clustering a mixture of Gaussians when at most fraction of the points are corrupted arbitrarily. However, it is unclear if the arguments can be modified to work under the semirandom model, since the perturbations can potentially affect all the points in the instance. On the other hand, our results show that the Lloyd’s algorithm of Kumar and Kannan [KK10] is robust to these semirandom perturbations.
Parameter Estimation.
A different approach is to design algorithms that estimate the parameters of the underlying Gaussian mixture model, and then assuming the means are well separated, accurate clustering can be performed. A very influential line of work focuses on the methodofmoments [KMV10, MV10, BS10] to learn the parameters of the model when the number of clusters . Moment methods (necessarily) require running time (and sample complexity) of roughly , but do not assume any explicit separation between the components of the mixture. Recent work [HK13, BCV14, GVX14, BCMV14, ABG14, GHK15] uses uniqueness of tensor decompositions (of order and above) to implement the method of moments and give polynomial time algorithms assuming the means are sufficiently high dimensional, and do not lie in certain degenerate configurations [HK12, GVX14, BCMV14, ABG14, GHK15].
Algorithmic approaches based on methodofmoments and tensor decompositions rely heavily on the exact parametric form of the Gaussian distribution and the exact algebraic expressions to express various moments of the distribution in terms of the parameters. These algebraic methods can be easily foiled by a monotone adversary, since the adversary can perturb any subset to alter the moments significantly (for example, even the first moment, i.e., the mean of a cluster, can change by
).Recent work has also focused on provable guarantees for heuristics such as Maximum Likelihood estimation and the Expectation Maximization (EM) algorithm for parameter estimation
[DS07, BWY14, XHM16, DTZ16]. Very recently, [RV17] considered other iterative algorithms for parameter estimation, and studied the optimal order of separation required for parameter estimation. However, we are not aware of any existing analysis that shows that these iterative algorithms for parameter estimation are robust to modeling errors.Another recent line of exciting work concerns designing robust highdimensional estimators of the mean and covariance of a single Gaussian (and mixtures of Gaussians) when an fraction of the points are adversarially corrupted [DKK16, LRV16, CSV17]. However, these results and similar results on agnostic learning do not necessarily recover the groundtruth clustering. Further, they typically assume that only a fraction of the points are corrupted, while potentially all the points could be perturbed in the semirandom model. On the other hand, our work does not necessarily give guarantees for estimating the means of the original Gaussians (in fact the centers given by the planted clustering in the semirandom instance can be far from the original means). Hence, our semirandom model is incomparable to the model of robustness considered in these works.
Semirandom models for other optimization problems.
There has been a long line of work on the study of semirandom models for various optimization problems. Blum and Spencer [BS95] initiated the study of semirandom models, and studied the problem of graph coloring. Feige and Kilian [FK98] considered semirandom models involving monotone adversaries for various problems including graph partitioning, independent set and clique. Makarychev et al. [MMV12, MMV14] designed algorithms for more general semirandom models for various graph partitioning problems. The work of [MPW15] studied the power of monotone adversaries in the context of community detection (stochastic block models), while [MMV16] considered the robustness of community detection to monotone adversaries and different kinds of errors and model misspecification. Semirandom models have also been studied for correlation clustering [MS10, MMV15], noisy sorting [MMV13] and coloring [DF16].
2 Preliminaries and Semirandom model
We first formally define the Gaussian mixture model.
Definition 2.1.
(Gaussian Mixture Model). A Gaussian mixture model with components is defined by the parameters . Here is the mean for component and is the corresponding covariance matrix. is the mixing weight and we have that . An instance from the mixture is generated as follows: for each , sample a component independently at random with probability . Given the component, sample from . The points can be naturally partitioned into clusters where cluster corresponds to the points that are sampled from component . We will refer to this as the planted clustering or ground truth clustering.
Clustering data from a mixture of Gaussians is a natural averagecase model for the means clustering problem. Specifically, if the means of a Gaussian mixture model are well separated, then with high probability, the ground truth clustering of an instance sampled from the model corresponds to the means optimal clustering.
Definition 2.2.
(means clustering). Given an instance of points in , the means problems is to find points such as to minimize .
The optimal means or centers naturally define a clustering of the data where each point is assigned to its closest cluster. A key property of the means objective is that the optimal solution induces a locally optimal clustering.
Definition 2.3.
(Locally Optimal Clustering). A clustering of data points in is locally optimal if for each , and we have that . Here is the average of the points in .
Hence, given the optimal means clustering, the optimal centers can be recovered by simply computing the average of each cluster. This is the underlying principle behind the popular Lloyd’s algorithm [Llo82] for means clustering. The algorithm starts with a choice of initial centers. It then repeatedly computes new centers to be the average of the clusters induced by the current centers. Hence the algorithm converges to a locally optimal clustering. Although popular in practice, the worst case performance of Lloyd’s algorithm can be arbitrarily bad [AV05]. The choice of initial centers is very important in the success of the Lloyd’s algorithm. We show that our theoretical guarantees hold when the initialization is done via the popular means++ algorithm [AV07]. There also exist more sophisticated constant factor approximation algorithms for the means problem [KMN02, ANSW16] that can be used for seeding in our framework.
While the clustering typically represents a partition of the index set , we will sometimes abuse notation and use to also denote the set of points in that correspond to these indices in . Finally, many of the statements are probabilistic in nature depending on the randomness in the semirandom model. In the following section, w.h.p. will refer to a probability of at least (say ), unless specified otherwise.
2.1 Properties of Semirandom Gaussians
In this section we state and prove properties of semirandom mixtures that will be used throughout the analysis in the subsequent sections. We first start with a couple of simple lemmas that follow directly from the corresponding lemmas about high dimensional Gaussians.
Lemma 2.4.
Consider any semirandom instance with parameters and clusters . Then with high probability we have
(1) 
Proof.
Lemma 2.5.
Consider any semirandom instance with parameters and clusters , and let
be a fixed unit vector in
. Then with probability at least we have(2) 
Proof.
Let denote the point generated in the semirandom model in step 2 (Definition 1.1) before the semirandom perturbation was applied. Let where .
Consider the sample . Let be the covariance matrix of th Gaussian component; hence . The projection is a Gaussian with mean and variance . From Lemma A.1
Hence from a union bound over all samples, the lemma follows. ∎
The above lemma immediately implies the following lemma after a union bound over the directions given by the unit vectors along directions.
Lemma 2.6.
Consider any semirandom instance with parameters and clusters . Then with high probability we have
(3) 
We next state a lemma about how far the mean of the points in a component of a semirandom GMM can move away from the true parameters.
Lemma 2.7.
Consider any semirandom instance with points generated with parameters such that for all . Then with probability at least we have that
(4) 
Proof.
For each point in the semirandom GMM, let be the original point in the GMM that is modified to produce . Then, we know that where . Hence, , where is the matrix with columns as for , is a diagonal matrix with values , and is a unit length vector in the direction of . Then, we have that (from A.5). ∎
The next lemma argues about the variance of component around in a semirandom GMM.
Lemma 2.8.
Consider any semirandom instance with points generated with parameters such that for all . Then with probability at least we have that
(5) 
Proof.
Exactly as in the proof of Lemma 2.7, we can write . Furthermore, since are points from a Gaussian we know that with probability at least , for all , . Hence, the claim follows. ∎
We would also need to argue about the mean of a large subset of points from a component of a semirandom GMM.
Lemma 2.9.
Consider any semirandom instance with points generated with parameters and planted clustering such that for all . Let be such that where . Then, with probability at least , we have that
(6) 
Proof.
Let be the set of points in component and let be the mean of the points in . Notice that from Lemma 2.7 and the fact that the perturbation is semirandom, we have that with probability at least , . Also, because the component is a semirandom perturbation of a Gaussian, we have from Lemma 2.8 that with probability at least .
Hence, with probability at least we have that . To bound the second term notice that , where is a unit vector in the direction of . Using CauchySchwarz inequality, this is at most . Combining the two bounds gives us the result. ∎
Finally, we argue about the variance of the entire data matrix of a semirandom GMM.
Lemma 2.10.
Consider any semirandom instance with points generated with parameters such that for all . Let be the matrix of data points and let be the matrix composed of the means of the corresponding clusters. Then, with probability at least , we have that
(7) 
Proof.
Let be the matrix of true means corresponding to the cluster memberships. We can write . Using Lemma 2.7, we know that with probability at least , . Hence, . Furthermore, . From Lemma 2.8, with probability at least , we can bound the sum by at most . Hence, . Combining the two bounds we get the claim. ∎
The following lemma is crucial in analyzing the performance of the Lloyd’s algorithm. We would like to upper bound the inner product for every direction and sample , but this is impossible since can be aligned along . The following lemma however upper bounds the total number of points in the dataset that can have a large projection of (or above) onto any direction by at most . This involves a union bound over a net of all possible directions .
Lemma 2.11 (Points in Bad Directions).
Consider any semirandom instance with points having parameters and planted clustering , and suppose . Then there exists a universal constant s.t. for any , with probability at least , we have that
(8) 
Proof.
Set and . Consider an net over unit vectors in . Hence
Further, since and is an net, there exists some unit vector
(9) 
for our choice of . Consider a fixed and a fixed direction . Since the variance of is at most we have
The probability that points in satisfy (9) for a fixed direction is at most . Let represent the bad event that there exists a direction in such that more that points satisfy the bad event given by (9). The probability of is at most
since for our choice of parameters , and .
∎
3 Upper Bounds for Semirandom GMMs
In this section we prove the following theorem that provides algorithmic guarantees for the Lloyd’s algorithm with appropriate initialization, under the semirandom model for mixtures of Gaussians in Definition 1.1.
Theorem 3.1.
There exists a universal constant such that the following holds. There exists a polynomial time algorithm that for any semirandom instance on points with planted clustering generated by the semirandom GMM model (Definition 1.1) with parameters s.t.
(10) 
and finds w.h.p. a clustering such that
In Section 4 we show that the above error bound is close to the information theoretically optimal bound (up to the logarithmic factor). The Lloyd’s algorithm as described in Figure 1 consists of two stages, the initialization stage and an iterative improvement stage.
The initialization follows the same scheme as proposed by Kumar and Kannan in [KK10]. The initialization algorithm first performs a SVD of the data matrix followed by running the means++ algorithm [AV07] that uses sampling to compute seed centers. One can also use any constant factor approximation algorithm for means clustering in the projected space to obtain the initial centers [KMN02, ANSW16]. This approach works for clusters that are nearly balanced in size. However, when the cluster sizes are arbitrary, an appropriate transformation of the data is performed first that amplifies the separation between the centers. Following this transformation, the (SVD means++) is used to get the initial centers. The formal guarantee of the initialization procedure is encapsulated in the following proposition, whose proof is given in Section 3.2.
The main algorithmic contribution of this paper is an analysis of the Lloyd’s algorithm when the points come from the semirandom GMM model. For the rest of the analysis we will assume that the instance generated from the semirandom GMM model satisfies (1) to (8). These eight equations are shown to hold w.h.p. in Section 2.1 for instances generated from the model. Our analysis will in fact hold for any deterministic data set satisfying these equations. This helps to gracefully argue about performing many iterations of Lloyd’s on the same data set without the need to draw fresh samples at each step.
Proposition 3.2.
The analysis of the Lloyd’s iterations crucially relies on the following lemma that upper bounds the number of misclassified points when the current Lloyd’s iterative is relatively close to the true means.
Lemma 3.3 (Projection condition).
The following lemma quantifies the improvement in each step of the Lloyd’s algorithm. The proof uses Lemma 3.3 along with properties of semirandom Gaussians.
Lemma 3.4.
We now present the proof of Theorem 3.1.
Proof of Theorem 3.1.
Firstly, the eight deterministic conditions (1)(8) are shown to hold for instance w.h.p. in Section 2.1. The proof follows in a straightforward manner by combining Proposition 3.2, Lemma 3.4 and Lemma 3.3. Proposition 3.2 shows that for all . Applying Lemma 3.4, we have that after iterations we get for all w.h.p. Finally using Lemma 3.3 with , the theorem follows. ∎
3.1 Analyzing Lloyd’s Algorithm
We now analyze each iteration of the Lloyd’s algorithm and show that we make progress in each step by misclassifying fewer points with successive iterations. As a first step we begin with the proof of Lemma 3.3.
Proof of Lemma 3.3.
Set where .
Fix a sample and suppose and let be the corresponding point before the semirandom perturbation, and let , . For each , let be the unit vector along .
We first observe that by projecting the Gaussians around onto the direction along , we have that
(11) 
where the first inequality follows from (3), and the second inequality uses .
Suppose is misclassified i.e., for some . Then,
since . Hence, we have that if is misclassified by