Clustering Semi-Random Mixtures of Gaussians

Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural semi-random model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. In our model, a semi-random adversary is allowed to make arbitrary "monotone" or helpful changes to the data generated from the Gaussian mixture model. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching information-theoretic lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/09/2020

K-Means and Gaussian Mixture Modeling with a Separation Constraint

We consider the problem of clustering with K-means and Gaussian mixture ...
12/19/2017

Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds

Clustering mixtures of Gaussian distributions is a fundamental and chall...
12/08/2020

Algorithms for finding k in k-means

k-means Clustering requires as input the exact value of k, the number of...
12/29/2016

Quantum Clustering and Gaussian Mixtures

The mixture of Gaussian distributions, a soft version of k-means , is co...
09/01/2019

Gaussian mixture model decomposition of multivariate signals

We propose a greedy variational method for decomposing a non-negative mu...
10/03/2021

Information Elicitation Meets Clustering

In the setting where we want to aggregate people's subjective evaluation...
08/21/2015

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are efficient representations of data sets such that models tra...

1 Introduction

Clustering is a ubiquitous task in machine learning and data mining for partitioning a data set into groups of similar points. The -means clustering problem is arguably the most well-studied problem in machine learning. However, designing provably optimal -means clustering algorithms is a challenging task as the -means clustering objective is NP-hard to optimize [WS11] (in fact, it is also NP-hard to find near-optimal solutions [ACKS15, LSW17]). A popular approach to cope with this intractability is to study average-case models for the -means problem. The most widely used such statistical model for clustering is the Gaussian Mixture Model (GMM), that has a long and rich history [Tei61, Pea94, Das99, AK01, VW04, DS07, BV08, MV10, BS10, KK10].

In this model there are clusters, and the points from cluster are generated from a Gaussian in dimensions with mean , and covariance matrix with spectral norm . Each of the points in the instance is now generated independently at random, and is drawn from the

th component with probability

( are also called mixing weights). If the means of the underlying Gaussians are separated enough, the ground truth clustering is well defined111A separation of for suffices w.h.p.. The algorithmic task is to recover the ground truth clustering for any data set generated from such a model (note that the parameters of the Gaussians, mixing weights and the cluster memberships of the points are unknown).

Starting from the seminal work of Dasgupta [Das99], there have been a variety of algorithms to provably cluster data from a GMM model. Algorithms based on PCA and distance-based clustering [AK01, VW04, AM05, KSV08]

provably recover the clustering when there is adequate separation between every pair of components (parameters). Other algorithmic approaches include the method-of-moments 

[KMV10, MV10, BS10]

, and algebraic methods based on tensor decompositions 

[HK12, GVX14, BCMV14, ABG14, GHK15]. (Please see Section 1.1 for a more detailed comparison of the guarantees).

On the other hand, the method-of-choice in practice are iterative algorithms like the Lloyd’s algorithm (also called -means algorithm) [Llo82] and the -means++ algorithm of  [AV07] (Lloyd’s algorithm initialized with centers from distance-based sampling). In the absence of good worst-case guarantees, a compelling direction is to use beyond-worst-case paradigms like average-case analysis to provide provable guarantees. Polynomial time guarantees for recovering -means optimal clustering by the Lloyd’s algorithm and -means++ are known when the points are drawn from a GMM model under sufficient separation conditions [DS07, KK10, AS12].

Although the study of Gaussian mixture models has been very fruitful in designing a variety of efficient algorithms, real world data rarely satisfies such strong distributional assumptions. Hence, our choice of algorithm should be informed not only by its computational efficiency but also by its robustness to errors and model misspecification. As a first step, we need theoretical frameworks that can distinguish between algorithms that are tailored towards a specific probabilistic model and algorithms robust to modeling assumptions. In this paper we initiate such a study in the context of clustering, by studying a natural semi-random model that generalizes the GMM model and also captures robustness to certain adversarial dependencies in the data.

Semi-random models involve a set of adversarial choices in addition to the random choices of the probabilistic model, while generating the instance. These models have been successfully applied to study the design of robust algorithms for various optimization problems [BS95, FK98, MS10, KMM11, MMV12, MMV14] (see Section 1.1) In a typical semi-random model, there is a “planted” or “ground-truth” solution, and an instance is first generated according to a simple probabilistic model. An adversary is then allowed to make “monotone” or helpful changes to the instance that only make the planted solution more pronounced. For instance, in the semi-random model of Feige and Kilian [FK98] for graph partitioning, the adversary is allowed to arbitrarily add extra edges within each cluster or delete edges between different clusters of the planted partitioning. These adversarial choices only make the planted partition more prominent; however, the choices can be dependent and thwart algorithms that rely on the excessive independence or strong but unrealistic structural properties of these instances.

Hence, the study of semi-random models helps us understand and identify robust algorithms. Our motivation for studying semi-random models for clustering is two-fold: a) design algorithms that are robust to strong distributional data assumptions, and b) explain the empirical success of simple heuristics such as the Lloyd’s algorithm.

Semi-random mixtures of Gaussians

In an ideal clustering instance, each point in the th cluster is significantly closer to the mean than to any other mean for (for a general instance, in the optimal solution, ). Moving each point in toward its own mean only increases this gap between the distance to its mean and to any other mean. Hence, this perturbation corresponds to a monotone perturbation that only make this planted clustering even better. In our semi-random model, the points are first drawn from a mixture of Gaussians (this is the planted clustering). The adversary is then allowed to move each point in the th cluster closer to its mean . This allows the points to be even better clustered around their respective means, however these perturbations are allowed to have arbitrary dependencies. We now formally define the semi-random model.

Definition 1.1 (Semi-random GMM model).

Given a set of parameters and , a clustering instance on points is generated as follows.

  1. Adversary chooses an arbitrary partition of and let for all .

  2. For each and each , is generated independently at random according to a Gaussian with mean and covariance with

    i.e., variance at most

    in each direction.

  3. The adversary then moves each point towards the mean of its component by an arbitrary amount i.e., for each , the adversary picks arbitrarily in . (Note that these choices can be correlated arbitrarily.)

The instance is and is parameterized by with the planted clustering . We will denote by .

Data generated by mixtures of high-dimensional Gaussians have certain properties that are often not exhibited by real-world instances. High-dimensional Gaussians have strong concentration properties; for example, all the points generated from a high-dimensional Gaussian are concentrated at a reasonably far distance from the mean (they are far away w.h.p.). In many real-world datasets on the other hand, clusters in the ground-truth often contain dense “cores” that are close to the mean. Our semi-random model admits such instances by allowing points in a cluster to move arbitrarily close to the mean.

Our Results.

Our first result studies the Lloyd’s algorithm on the semi-random GMM model and gives an upper bound on the clustering error achieved by the Lloyd’s algorithm with the initialization procedure used in [KK10].

Informal Theorem 1.2.

Consider any semi-random instance with points generated by the semi-random GMM model (Definition 1.1) with planted clustering and parameters satisfying

and . There is polynomial time algorithm based on the Lloyd’s iterative algorithm that recovers the cluster memberships of all but points.

The in the above statement hides a and factor. Please see Theorem 3.1 for a formal statement. Furthermore, we show that in the above result the Lloyd’s iterations can be initialized using the popular -mean++ algorithm that uses -sampling [AV07]. The most closely related to our work is that of [KK10] and [AS12] who provided deterministic data conditions under which the Lloyd’s algorithm converges to the optimal clustering. Along these lines, our work provides further theoretical justification for the enormous empirical success that the Lloyd’s algorithm enjoys.

It is also worth noting that in spite of being robust to semi-random perturbations, the separation requirement of in our upper bound matches the separation requirement in the best guarantees [AS12] for Lloyd’s algorithm even in the absence of any semi-random errors or perturbations (see Section 1.1 for a comparison) 222We note that for clustering GMMs, the work of Brubaker and Vempala [BV08] give a qualitatively different separation condition that does not depend on the maximum variance, and can model Gaussian mixtures that look like “parallel pancakes”. However this separation condition is incomparable to [AS12], because of the potentially worse dependence on .

. We also remark that while the algorithm recovers a clustering of the given data that is very close to the planted clustering, this does not necessarily estimate the means of the original Gaussian components up to inverse polynomial accuracy (in fact the centers of the planted clustering after the semi-random perturbation may be

far from the original means). This differs from the recent body of work on parameter estimation in the presence of some adversarial noise (please refer to Section 1.1 for a comparison).

While the monotone changes allowed in the semi-random model should only make the clustering task easier, our next result shows that the error achieved by the Lloyd’s algorithm is in fact near optimal. More specifically, we provide a lower bound on the number of points that will be misclassified by any -means optimal solution for the instance.

Informal Theorem 1.3.

Given any (that is sufficiently large polynomial in ) and such that , there exists an instance on points in dimensions generated from the semi-random GMM model 1.1 with parameters , and planted clustering having separation s.t. any optimal -means clustering solution of misclassifies at least points with high probability.

The above lower bound also holds when the semi-random (monotone) perturbations are applied to points generated from a mixture of spherical Gaussians each with covariance and weight . Further, the lower bound holds not just for the optimal -means solution, but also for any “locally optimal” clustering solution. Please see Theorem 4.1 for a formal statement. These two results together show that the Lloyd’s algorithm essentially recovers the planted clustering up to the optimal error possible for any -means clustering based algorithm.

Unlike algorithmic results for other semi-random models, an appealing aspect of our algorithmic result is that it gives provable robust guarantees in the semi-random model for a simple, popular algorithm that is used in practice (Lloyd’s algorithm). Further, other approaches for clustering like distance-based clustering, method-of-moments and tensor decompositions seem inherently non-robust to these semi-random perturbations (see Section 1.1 for details). This robustness of the Lloyd’s algorithm suggests an explanation for its widely documented empirical success across different application domains.

Considerations in the choice of the Semi-random GMM model.

Here we briefly discuss different semi-random models, and considerations involved in favoring Definition 1.1. Another semi-random model that comes to mind is one that can move each point closer to the mean of its own cluster (closer just in terms of distance, regardless of direction). Intuitively this seems appealing since this improves the cost of the planted clustering. However, in this model the optimal -means clustering of the perturbed instance can be vastly different from the planted solution. This is because one can move many points in cluster in such a way that becomes closer to a different mean rather than . For high dimensional Gaussians it is easy to see that the distance of each point to its own mean will be on the order of . Hence, in our regime of interest, the inter mean separation of could be much smaller than the radius of any cluster (when ). Consider an adversary that moves a large fraction of the points in a given cluster to the mean of another cluster. While the distance of these points to their cluster mean has only decreased from roughly to around , these points now become closer to the mean of a different cluster! In the semi-random GMM model on the other hand, the adversary is only allowed to move the point along the direction of ; hence, each point becomes closer to its own mean than to the means of other clusters. Our results show that in such a model, the optimal clustering solution can change by at most points.

Challenges in the Semi-random GMM model and Overview of Techniques.

Lloyd’s algorithm has been analyzed in the context of clustering mixtures of Gaussians [KK10, AS12]. Any variant of the Lloyd’s algorithm consists of two steps — an initialization stage where a set of initial centers are computed, and the iterative algorithm which successively improves the clustering in each step. Kumar and Kannan [KK10] considered a variant of the Lloyd’s method where the initialization is given by using PCA along with a factor approximation to the -means optimization problem. The improved analysis of this algorithm in [AS12] leads to state of the art results that perfectly recovers all the clusters when the separation is of the order .

We analyze the variant of Lloyd’s algorithm that was introduced by Kumar and Kannan [KK10]. However, there are several challenges in extending the analysis of [AS12] to the semi-random setting. While the semi-random perturbations in the model only move points in a cluster closer to the mean , these perturbations can be co-ordinated in a way that can move the empirical mean of the cluster significantly. For instance, Lemma 4.3 gives a simple semi-random perturbation to the points in that moves the empirical mean of the points in to s.t. , for any desired direction . This shift in the empirical means may now cause some of the points in cluster to become closer to (in particular points that have a relatively large projection onto ) and vice-versa. In fact, the lower bound instance in Theorem 4.1 is constructed by applying such a semi-random perturbation given by Lemma 4.3 to the points in a cluster, along a carefully picked direction so that points are misclassified per cluster.

The main algorithmic contribution of the paper is an analysis of the Lloyd’s iterative algorithm when the points come from the semi-random GMM model. The key is to understand the number of points that can be misclassified in an intermediate step of the Lloyd’s iteration. We show in Lemma 3.3 that if in the current iteration of the Lloyd’s algorithm, each of the current estimates of the means is within from , then the number of misclassified points by the current iteration of Lloyd’s iteration is at most . This relies crucially on Lemma 2.11 which upper bounds the number of points in a cluster s.t. has a large inner product along any (potentially bad) direction .

The effect of these bad points has to be carefully accounted for when analyzing both stages of the algorithm – the initialization phase, and the iterative algorithm. Proposition 3.2 argues about the closeness of the initial centers to the true means. As in [KK10], these initial centers are obtained via a boosting technique that first maps the points to an expanded feature space and then uses the (-SVD + -means approximation) to get initial centers. When using this approach for semi-random data one needs to carefully argue about how the set of bad points behave in the expanded feature space. This is done in Lemmas B.1 and  B.2. Given the initial centers, it is not hard to see that the analysis of  [AS12] can be carried out to argue about the improvements made in the Lloyd’s iterative step; however, this leads to a bound that is sub-optimal by the factor of . Instead, we perform a much finer analysis for the semi-random model to control the effect of the bad points and achieve nearly optimal error bounds. This is done in Lemma 3.4.

1.1 Related Work

There has been a long line of algorithmic results on Gaussian mixture models starting from  [Tei61, Tei67, Pea94]. These results fall into two broad categories: (1) Clustering algorithms, which aim to recover the component/cluster memberships of the points and (2) Parameter estimation, where the goal is to estimate the parameters of the Gaussian components. When the components of the mixture are sufficiently well-separated, i.e., , then the Gaussians do not overlap w.h.p., and then the two tasks become equivalent w.h.p. We now review the different algorithms that have been designed for these two tasks, and comment on their robustness to semi-random perturbations.

Clustering Algorithms.

The first polynomial time algorithmic guarantees were given by Dasgupta [Das99], who showed how to cluster a mixture of Gaussians with identical covariance matrices when the separation between the cluster means is of the order , where denotes the maximum variance of any cluster along any direction333The term involves a dependence of either or . . Distance-based clustering algorithms that are based on strong distance-concentration properties of high-dimensional Gaussians improved the separation requirement between means and to be  [AK01, DS07], where denotes the maximum variance of points in cluster along any direction. Vempala and Wang [VW04] and subsequent results [KSV08, AM05] used PCA to project down to dimensions (when ), and then used the above distance-based algorithms to get state-of-the-art guarantees for many settings: for spherical Gaussians a separation of roughly suffices [VW04]. For non-spherical Gaussians, a separation of is known to suffice [AM05, KSV08]. Brubaker and Vempala [BV08] gave a qualitative improvement on the separation requirement for non-spherical Gaussians by having a dependence only on the variance along the direction of the line joining the respective means, as opposed to the maximum variance along any direction.

Recent work has also focused on provable guarantees for heuristics such as the Lloyd’s algorithm for clustering mixtures of Gaussians [KK10, AS12]. Iterative algorithms like the Lloyd’s algorithm (also called -means algorithm) [Llo82] and its variants like -means++ [AV07] are the method-of-choice for clustering in practice. The best known guarantee [AS12] along these lines requires a separation of order between any pair of means, where is the maximum variance among all clusters along any direction. To summarize, for a mixture of Gaussians in dimensions with variance of each cluster being bounded by in every direction, the state-of-the-art guarantees require a separation of roughly between the means of any two components [VW04] for spherical Gaussians, while a separation of is known to suffice for non-spherical Gaussians [AS12].

The techniques in many of the above works rely on strong distance concentration properties of high-dimensional Gaussians. For instance, the arguments of [AK01, VW04] that obtain a separation of order crucially rely on the tight concentration of the squared distance around , between any pair of points in the same cluster. These arguments do not seem to carry over to the semi-random model. Brubaker [Bru09] gave a robust algorithm for clustering a mixture of Gaussians when at most fraction of the points are corrupted arbitrarily. However, it is unclear if the arguments can be modified to work under the semi-random model, since the perturbations can potentially affect all the points in the instance. On the other hand, our results show that the Lloyd’s algorithm of Kumar and Kannan [KK10] is robust to these semi-random perturbations.

Parameter Estimation.

A different approach is to design algorithms that estimate the parameters of the underlying Gaussian mixture model, and then assuming the means are well separated, accurate clustering can be performed. A very influential line of work focuses on the method-of-moments [KMV10, MV10, BS10] to learn the parameters of the model when the number of clusters . Moment methods (necessarily) require running time (and sample complexity) of roughly , but do not assume any explicit separation between the components of the mixture. Recent work [HK13, BCV14, GVX14, BCMV14, ABG14, GHK15] uses uniqueness of tensor decompositions (of order and above) to implement the method of moments and give polynomial time algorithms assuming the means are sufficiently high dimensional, and do not lie in certain degenerate configurations [HK12, GVX14, BCMV14, ABG14, GHK15].

Algorithmic approaches based on method-of-moments and tensor decompositions rely heavily on the exact parametric form of the Gaussian distribution and the exact algebraic expressions to express various moments of the distribution in terms of the parameters. These algebraic methods can be easily foiled by a monotone adversary, since the adversary can perturb any subset to alter the moments significantly (for example, even the first moment, i.e., the mean of a cluster, can change by

).

Recent work has also focused on provable guarantees for heuristics such as Maximum Likelihood estimation and the Expectation Maximization (EM) algorithm for parameter estimation 

[DS07, BWY14, XHM16, DTZ16]. Very recently, [RV17] considered other iterative algorithms for parameter estimation, and studied the optimal order of separation required for parameter estimation. However, we are not aware of any existing analysis that shows that these iterative algorithms for parameter estimation are robust to modeling errors.

Another recent line of exciting work concerns designing robust high-dimensional estimators of the mean and covariance of a single Gaussian (and mixtures of Gaussians) when an fraction of the points are adversarially corrupted [DKK16, LRV16, CSV17]. However, these results and similar results on agnostic learning do not necessarily recover the ground-truth clustering. Further, they typically assume that only a fraction of the points are corrupted, while potentially all the points could be perturbed in the semi-random model. On the other hand, our work does not necessarily give guarantees for estimating the means of the original Gaussians (in fact the centers given by the planted clustering in the semi-random instance can be far from the original means). Hence, our semi-random model is incomparable to the model of robustness considered in these works.

Semi-random models for other optimization problems.

There has been a long line of work on the study of semi-random models for various optimization problems. Blum and Spencer [BS95] initiated the study of semi-random models, and studied the problem of graph coloring. Feige and Kilian [FK98] considered semi-random models involving monotone adversaries for various problems including graph partitioning, independent set and clique. Makarychev et al. [MMV12, MMV14] designed algorithms for more general semi-random models for various graph partitioning problems. The work of [MPW15] studied the power of monotone adversaries in the context of community detection (stochastic block models), while [MMV16] considered the robustness of community detection to monotone adversaries and different kinds of errors and model misspecification. Semi-random models have also been studied for correlation clustering [MS10, MMV15], noisy sorting [MMV13] and coloring [DF16].

2 Preliminaries and Semi-random model

We first formally define the Gaussian mixture model.

Definition 2.1.

(Gaussian Mixture Model). A Gaussian mixture model with components is defined by the parameters . Here is the mean for component and is the corresponding covariance matrix. is the mixing weight and we have that . An instance from the mixture is generated as follows: for each , sample a component independently at random with probability . Given the component, sample from . The points can be naturally partitioned into clusters where cluster corresponds to the points that are sampled from component . We will refer to this as the planted clustering or ground truth clustering.

Clustering data from a mixture of Gaussians is a natural average-case model for the -means clustering problem. Specifically, if the means of a Gaussian mixture model are well separated, then with high probability, the ground truth clustering of an instance sampled from the model corresponds to the -means optimal clustering.

Definition 2.2.

(-means clustering). Given an instance of points in , the -means problems is to find points such as to minimize .

The optimal means or centers naturally define a clustering of the data where each point is assigned to its closest cluster. A key property of the -means objective is that the optimal solution induces a locally optimal clustering.

Definition 2.3.

(Locally Optimal Clustering). A clustering of data points in is locally optimal if for each , and we have that . Here is the average of the points in .

Hence, given the optimal -means clustering, the optimal centers can be recovered by simply computing the average of each cluster. This is the underlying principle behind the popular Lloyd’s algorithm [Llo82] for -means clustering. The algorithm starts with a choice of initial centers. It then repeatedly computes new centers to be the average of the clusters induced by the current centers. Hence the algorithm converges to a locally optimal clustering. Although popular in practice, the worst case performance of Lloyd’s algorithm can be arbitrarily bad [AV05]. The choice of initial centers is very important in the success of the Lloyd’s algorithm. We show that our theoretical guarantees hold when the initialization is done via the popular -means++ algorithm [AV07]. There also exist more sophisticated constant factor approximation algorithms for the -means problem [KMN02, ANSW16] that can be used for seeding in our framework.

While the clustering typically represents a partition of the index set , we will sometimes abuse notation and use to also denote the set of points in that correspond to these indices in . Finally, many of the statements are probabilistic in nature depending on the randomness in the semi-random model. In the following section, w.h.p. will refer to a probability of at least (say ), unless specified otherwise.

2.1 Properties of Semi-random Gaussians

In this section we state and prove properties of semi-random mixtures that will be used throughout the analysis in the subsequent sections. We first start with a couple of simple lemmas that follow directly from the corresponding lemmas about high dimensional Gaussians.

Lemma 2.4.

Consider any semi-random instance with parameters and clusters . Then with high probability we have

(1)
Proof.

Let denote the point generated in the semi-random model in step 2 (Definition 1.1) before the semi-random perturbation was applied. Let where . We have

from Lemma A.3. ∎

Lemma 2.5.

Consider any semi-random instance with parameters and clusters , and let

be a fixed unit vector in

. Then with probability at least we have

(2)
Proof.

Let denote the point generated in the semi-random model in step 2 (Definition 1.1) before the semi-random perturbation was applied. Let where .

Consider the sample . Let be the covariance matrix of th Gaussian component; hence . The projection is a Gaussian with mean and variance . From Lemma A.1

Hence from a union bound over all samples, the lemma follows. ∎

The above lemma immediately implies the following lemma after a union bound over the directions given by the unit vectors along directions.

Lemma 2.6.

Consider any semi-random instance with parameters and clusters . Then with high probability we have

(3)

We next state a lemma about how far the mean of the points in a component of a semi-random GMM can move away from the true parameters.

Lemma 2.7.

Consider any semi-random instance with points generated with parameters such that for all . Then with probability at least we have that

(4)
Proof.

For each point in the semi-random GMM, let be the original point in the GMM that is modified to produce . Then, we know that where . Hence, , where is the matrix with columns as for , is a diagonal matrix with values , and is a unit length vector in the direction of . Then, we have that  (from A.5). ∎

The next lemma argues about the variance of component around in a semi-random GMM.

Lemma 2.8.

Consider any semi-random instance with points generated with parameters such that for all . Then with probability at least we have that

(5)
Proof.

Exactly as in the proof of Lemma 2.7, we can write . Furthermore, since are points from a Gaussian we know that with probability at least , for all , . Hence, the claim follows. ∎

We would also need to argue about the mean of a large subset of points from a component of a semi-random GMM.

Lemma 2.9.

Consider any semi-random instance with points generated with parameters and planted clustering such that for all . Let be such that where . Then, with probability at least , we have that

(6)
Proof.

Let be the set of points in component and let be the mean of the points in . Notice that from Lemma 2.7 and the fact that the perturbation is semi-random, we have that with probability at least , . Also, because the component is a semi-random perturbation of a Gaussian, we have from Lemma 2.8 that with probability at least .

Hence, with probability at least we have that . To bound the second term notice that , where is a unit vector in the direction of . Using Cauchy-Schwarz inequality, this is at most . Combining the two bounds gives us the result. ∎

Finally, we argue about the variance of the entire data matrix of a semi-random GMM.

Lemma 2.10.

Consider any semi-random instance with points generated with parameters such that for all . Let be the matrix of data points and let be the matrix composed of the means of the corresponding clusters. Then, with probability at least , we have that

(7)
Proof.

Let be the matrix of true means corresponding to the cluster memberships. We can write . Using Lemma 2.7, we know that with probability at least , . Hence, . Furthermore, . From Lemma 2.8, with probability at least , we can bound the sum by at most . Hence, . Combining the two bounds we get the claim. ∎

The following lemma is crucial in analyzing the performance of the Lloyd’s algorithm. We would like to upper bound the inner product for every direction and sample , but this is impossible since can be aligned along . The following lemma however upper bounds the total number of points in the dataset that can have a large projection of (or above) onto any direction by at most . This involves a union bound over a net of all possible directions .

Lemma 2.11 (Points in Bad Directions).

Consider any semi-random instance with points having parameters and planted clustering , and suppose . Then there exists a universal constant s.t. for any , with probability at least , we have that

(8)
Proof.

Set and . Consider an -net over unit vectors in . Hence

Further, since and is an -net, there exists some unit vector

(9)

for our choice of . Consider a fixed and a fixed direction . Since the variance of is at most we have

The probability that points in satisfy (9) for a fixed direction is at most . Let represent the bad event that there exists a direction in such that more that points satisfy the bad event given by (9). The probability of is at most

since for our choice of parameters , and .

3 Upper Bounds for Semi-random GMMs

In this section we prove the following theorem that provides algorithmic guarantees for the Lloyd’s algorithm with appropriate initialization, under the semi-random model for mixtures of Gaussians in Definition 1.1.

Theorem 3.1.

There exists a universal constant such that the following holds. There exists a polynomial time algorithm that for any semi-random instance on points with planted clustering generated by the semi-random GMM model (Definition 1.1) with parameters s.t.

(10)

and finds w.h.p. a clustering such that

In Section 4 we show that the above error bound is close to the information theoretically optimal bound (up to the logarithmic factor). The Lloyd’s algorithm as described in Figure 1 consists of two stages, the initialization stage and an iterative improvement stage.

Let be the data matrix with rows for . Use to compute initial centers as detailed in Proposition 3.2. Use these -centers to seed a series of Lloyd-type iterations. That is, for do: Set be the set of points for which the closest center among is . Set .

Figure 1: Lloyd’s Algorithm

The initialization follows the same scheme as proposed by Kumar and Kannan in [KK10]. The initialization algorithm first performs a -SVD of the data matrix followed by running the -means++ algorithm [AV07] that uses -sampling to compute seed centers. One can also use any constant factor approximation algorithm for -means clustering in the projected space to obtain the initial centers [KMN02, ANSW16]. This approach works for clusters that are nearly balanced in size. However, when the cluster sizes are arbitrary, an appropriate transformation of the data is performed first that amplifies the separation between the centers. Following this transformation, the (-SVD -means++) is used to get the initial centers. The formal guarantee of the initialization procedure is encapsulated in the following proposition, whose proof is given in Section 3.2.

The main algorithmic contribution of this paper is an analysis of the Lloyd’s algorithm when the points come from the semi-random GMM model. For the rest of the analysis we will assume that the instance generated from the semi-random GMM model satisfies (1) to (8). These eight equations are shown to hold w.h.p. in Section 2.1 for instances generated from the model. Our analysis will in fact hold for any deterministic data set satisfying these equations. This helps to gracefully argue about performing many iterations of Lloyd’s on the same data set without the need to draw fresh samples at each step.

Proposition 3.2.

In the above notation for any , suppose we are given an instance on points satisfying (1)-(8) such that and assume that . Then after the initialization step, for every there exists such that , where .

The analysis of the Lloyd’s iterations crucially relies on the following lemma that upper bounds the number of misclassified points when the current Lloyd’s iterative is relatively close to the true means.

Lemma 3.3 (Projection condition).

In the above notation, consider an instance satisfying (1)-(8) and (10) and suppose we are given satisfying and . Then there exists a set such that for any we have

The following lemma quantifies the improvement in each step of the Lloyd’s algorithm. The proof uses Lemma 3.3 along with properties of semi-random Gaussians.

Lemma 3.4.

In the above notation, suppose we are given an instance on points with for all satisfying (1)-(8). Furthermore, suppose we are given centers such that where . Then the centers obtained after one Lloyd’s update satisfy for all .

We now present the proof of Theorem 3.1.

Proof of Theorem 3.1.

Firstly, the eight deterministic conditions (1)-(8) are shown to hold for instance w.h.p. in Section 2.1. The proof follows in a straightforward manner by combining Proposition 3.2, Lemma 3.4 and Lemma 3.3. Proposition 3.2 shows that for all . Applying Lemma 3.4, we have that after iterations we get for all w.h.p. Finally using Lemma 3.3 with , the theorem follows. ∎

3.1 Analyzing Lloyd’s Algorithm

We now analyze each iteration of the Lloyd’s algorithm and show that we make progress in each step by misclassifying fewer points with successive iterations. As a first step we begin with the proof of Lemma 3.3.

Proof of Lemma 3.3.

Set where .

Fix a sample and suppose and let be the corresponding point before the semi-random perturbation, and let , . For each , let be the unit vector along .

We first observe that by projecting the Gaussians around onto the direction along , we have that

(11)

where the first inequality follows from (3), and the second inequality uses .

Suppose is misclassified i.e., for some . Then,

since . Hence, we have that if is misclassified by