Clustering subgaussian mixtures by semidefinite programming

02/22/2016 ∙ by Dustin G. Mixon, et al. ∙ 0

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei. The algorithm interprets the SDP output as a denoised version of the original data and then rounds this output to a hard clustering. We provide a generic method for proving performance guarantees for this algorithm, and we analyze the algorithm in the context of subgaussian mixture models. We also study the fundamental limits of estimating Gaussian centers by k-means clustering in order to compare our approximation guarantee to the theoretically optimal k-means clustering solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following mixture model: For each , let

be an unknown subgaussian probability distribution over

, with first moment

and second moment matrix with largest eigenvalue

. For each , an unknown number of random points is drawn independently from . Given the points along with the model order , the goal is to approximate the centers . How large must be relative to , and how large must be relative to , in order to have sufficient signal for successful approximation?

For the most popular instance of this problem, where the subgaussian distributions are Gaussians, theoretical guarantees date back to the work of Dasgupta [Das99]. Dasgupta introduced an algorithm based on random projections and showed that this algorithm well-approximates centers of Gaussians in that are separated by

. Since Dasgupta’s seminal work, performance guarantees for several algorithmic alternatives have emerged, including expectation maximization

[DS07], spectral methods [VW04, KK10, AS12], projections (random and deterministic) [MV10, AK05], and the method of moments [MV10]. Every existing performance guarantee has one two forms:

  • the algorithm correctly clusters all points according to Gaussian mixture component, or

  • the algorithm well-approximates the center of each Gaussian (a la Dasgupta [Das99]).

Results of type (a), which include [VW04, KK10, AS12, AM07], require the minimum separation between the Gaussians centers to have a multiplicative factor of , where is the total number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is well-posed. We note that in the case of spherical Gaussians, such highly separated Gaussian components may be truncated so as to match a different data model known as the stochastic ball model, where the semidefinite program we use in this paper is already known to be tight with high probability [ABC15, IMPV15].

Results of type (b) tend to be specifically tailored to exploit unique properties of the Gaussian distribution, and are thus not easily generalizable to other data models. For instance, if

has distribution , then , and concentration of measure implies that in high dimensions, most of the points will reside in a thin shell with center and radius about . This sort of behavior can be exploited to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that perform well even with no separation between centers require a sample complexity which is exponential in [MV10].

In this paper, we provide a performance guarantee of type (b), but our approach is model-free. In particular, we consider the -means clustering objective:

minimize (1)
subject to

Letting denote the matrix defined entrywise by , then a straightforward calculation gives the following “lifted” expression for the -means objective:

(2)

The matrix necessarily satisfies various convex constraints, and relaxing to such constraints leads to the following semidefinite relaxation of (1), first introduced by Peng and Wei in [PW07]:

minimize (3)
subject to

Here, means that is entrywise nonnegative, whereas means that is symmetric and positive semidefinite.

As mentioned earlier, this semidefinite relaxation is known to be tight for a particular data model called the stochastic ball model [NW15, ABC15, IMPV15]

. In this paper, we study its performance under subgaussian mixture models, which include the stochastic ball model and the Gaussian mixture model as special cases. The SDP is not typically tight under this general model, but the optimizer can be interpreted as a denoised version of the data and can be rounded in order to produce a good estimate for the centers (and therefore produce a good clustering).

To see this, let denote the matrix whose columns are the points . Notice that whenever has the form (2), then for each , has columns equal to the centroid of points assigned to . In particular, if is -means-optimal, then reports the -means-optimal centroids (with appropriate multiplicities). Next, we note that every SDP-feasible matrix satisfies , and so

is a stochastic matrix, meaning each column of

is still a weighted average of columns from . Intuitively, if the SDP relaxation (3) were close to being tight, then the SDP-optimal would make the columns of close to the -means-optimal centroids. Empirically, this appears to be the case (see Figure 1 for an illustration). Overall, we may interpret as a denoised version of the original data , and we leverage this strengthened signal to identify good estimates for the -means-optimal centroids.

Figure 1: (a) Draw points at random from each of three spherical Gaussians over . These points form the columns of a matrix . (b) Compute the distance-squared matrix from the data in (a), and solve the -means semidefinite relaxation (3) using SDPNAL+v0.3 [YST15]. (The computation takes about 16 seconds on a standard MacBook Air laptop.) Given the optimizer , compute and plot the columns. We interpret this as a denoised version of the original data . (c) The points in (b) land in three particular locations with particularly high frequency. Take these locations to be estimators of the centers of the original Gaussians. (d) Use the estimates for the centers in (c) to partition the original data into three subsets, thereby estimating the -means-optimal partition.

What follows is a summary of our relax-and-round procedure for (approximately) solving the -means problem (1):

Relax-and-round -means clustering procedure. Given and data matrix , do: Compute distance-squared matrix defined by . Solve -means semidefinite program (3), resulting in optimizer . Cluster the columns of the denoised data matrix .

For step (iii), we find there tends to be vectors that appear as columns in with particularly high frequency, and so we are inclined to use these as estimators for the -mean-optimal centroids (see Figure 1, for example). Running Lloyd’s algorithm for step (iii) also works well in practice. To obtain theoretical guarantees, we instead find the columns of for which the unit balls of a certain radius centered at these points in contain the most columns of (see Theorem 15 for more details). An implementation of our procedure is available on GitHut [Vil16].

Our contribution. We study performance guarantees for the -means semidefinite relaxation (3) when the point cloud is drawn from a subgaussian mixture model. Adapting ideas from Guédon and Vershynin [GV14], we obtain approximation guarantees comparable with the state of the art for learning mixtures of Gaussians despite the fact that our algorithm is a generic -means solver and uses no model assumptions. To the best of our knowledge, no convex relaxation has been used before to provide theoretical guarantees for clustering mixtures of Gaussians. We also provide conditional lower bounds on how well a -means solver can approximate the centers of Gaussians.

Organization of this paper. In Section 2, we present a summary of our results and give a high-level explanation of our proof techniques. We also illustrate the performance of our relax-and-round algorithm on the MNIST dataset of handwritten digits. Our theoretical results consist of an approximation theorem for the SDP (proved in Section 3), a denoising consequence of the approximation (explained in Section 4), and a rounding step (presented in Section 5). We also study the fundamental limits for estimating Gaussian centers by -means clustering (see Section 2.2).

2 Summary of results

This paper has two main results. First, we present a relax-and-round algorithm for -means clustering that well-approximates the centers of sufficiently separated subgaussians. Second, we provide a conditional result on the minimum separation necessary for Gaussian center approximation by -means clustering. The first result establishes that the -means SDP (3) performs well with noisy data (despite not being tight), and the second result helps to illustrate how sharp our analysis is. This section discusses these results, and then applies our algorithm to the MNIST dataset [LC10].

2.1 Performance guarantee for the -means SDP

Our relax-and-round performance guarantee consists of three steps.

Step 1: Approximation. We adapt an approach used by Guédon and Vershynin to provide approximation guarantees for a certain semidefinite program under the stochastic block model for graph clustering [GV14].

Given the points drawn independently from , consider the squared-distance matrix and the corresponding minimizer of the SDP (3). We first construct a “reference” matrix such that the SDP (3) is tight when with optimizer . To this end, take , let denote the minimizer of (3), and let denote the minimizer of (3) when is replaced by the reference matrix defined as

(4)

where , and is a parameter to be chosen later. Indeed, this choice of reference is quite technical, as an artifact of the entries in being statistically dependent. Despite its lack of beauty, our choice of reference enjoys the following important property:

Lemma 1.

Let denote the indicator function for the indices corresponding to points drawn from the th subgaussian. If whenever , then .

Proof.

Let be feasible for the the SDP (3). Replacing with in (3), we may use the SDP constraints and to obtain the bound

Furthermore, since whenever , and since , we have that equality occurs precisely for the such that equals zero whenever . The other constraints on then force to have the claimed form (i.e., is the unique minimizer). ∎

Now that we know that is the solution we want, it remains to demonstrate regularity in the sense that provided the subgaussian centers are sufficiently separated. For this, we use the following scheme:

  • If then is small (Lemma 8).

  • If (in some specific sense) then (Lemmas 9 and 10).

  • If the centers are separated by , then .

What follows is the result of this analysis:

Theorem 2.

Fix . There exist universal constants such that if

then with probability provided

where is the minimal cluster center separation.

See Section 3 for the proof. Note that if we remove the assumption , we obtain the result .

Step 2: Denoising. Suppose we solve the SDP (3) for an instance of the subgaussian mixture model where is sufficiently large. Then Theorem 2 ensures that the solution is close to the ground truth. At this point, it remains to convert into an estimate for the centers . Let denote the matrix whose th column is . Then is an matrix whose th column is , the centroid of the th cluster, which converges to as , (and does not change when varies, for a fixed ), and so one might expect to have its columns be close to the ’s. In fact, we can interpret the columns of as a denoised version of the points (see Figure 1).

To illustrate this idea, assume the points come from in for each . Then we have

(5)

Letting denote the th column of (i.e., the th estimate of ), in Section 4 we obtain the following denoising result:

Corollary 3.

If , then

with high probability as .

Note that Corollary 3 guarantees denoising in the regime . This is a corollary of a more technical result (Theorem 11), which guarantees denoising for certain configurations of subgaussians (e.g., when the ’s are vertices of a regular simplex) in the regime .

At this point, we comment that one might expect this level of denoising from principal component analysis (PCA) when the mixture of subgaussians is sufficiently nice. To see this, suppose we have spherical Gaussians of equal entrywise variance

centered at vertices of a regular simplex. Then in the large-sample limit, we expect PCA to approach the -dimensional affine subspace that contains the centers. Projecting onto this affine subspace will not change the variance of any Gaussian in any of the principal components, and so one expects the mean squared deviation of the projected points from their respective Gaussian centers to be .

By contrast, we find that in practice, the SDP denoises substantially more than PCA does. For example, Figures 1 and 3 illustrate cases in which PCA would not change the data, since the data already lies in -dimensional space, and yet the SDP considerably enhances the signal. In fact, we observe empirically that the matrix has low rank and that has repeated columns. This doesn’t come as a complete surprise, considering SDP optimizers are known to exhibit low rank [Sha82, Bar95, Pat98]. Still, we observe that the optimizer tends to have rank when clustering points from the mixture model. This is not predicted by existing bounds, and we did not leverage this feature in our analysis, though it certainly warrants further investigation.

Step 3: Rounding. In Section 5, we present a rounding scheme that provides a clustering of the original data from the denoised results of the SDP (Theorem 15). In general, the cost of rounding is a factor of in the average squared deviation of our estimates. Under the same hypothesis as Corollary 3, we have that there exists a permutation on such that

(6)

where is what our algorithm chooses as the th center estimate. Much like the denoising portion, we also have a more technical result that allows one to replace the right-hand side of (6) with for sufficiently nice configurations of subgaussians. As such, we can estimate Gaussian centers with mean squared error provided the centers have pairwise distance . This contrasts with the seminal work of Dasgupta [Das99], which gives mean squared error when the pairwise distances are . As such, our guarantee replaces dimensionality dependence with model order–dependence, which is an improvement to the state of the art in the regime . In the next section, we indicate that model order–dependence cannot be completely removed when using -means to estimate the centers.

Before concluding this section, we want to clarify the nature of our approximation guarantee (6). Since centroids correspond to a partition of Euclidean space, our guarantee says something about how “close” our -means partition is to the “true” partition. By contrast, the usual approximation guarantees for relax-and-round algorithms compare values of the objective function (e.g., the -means value of the algorithm’s output is within a factor of 2 of minimum). Also, the latter sort of optimal value–based approximation guarantee cannot be used to produce the sort of optimizer-based guarantee we want. To illustrate this, imagine a relax-and-round algorithm for -means that produces a near-optimal partition with for data coming from a single spherical Gaussian. We expect every subspace of co-dimension 1 to separate the data into a near-optimal partition, but the partitions are very different from each other when the dimension , and so a guarantee of the form (6) will not hold.

2.2 Fundamental limits of -means clustering

In Section 5, we provide a rounding scheme that, when applied to the output of the -means SDP, produces estimates of the subgaussian centers. But how good is our guarantee? Observe the following two issues: (i) The amount of tolerable noise and our bound on the error both depend on . (ii) Our bound on the error does not vanish with .

In this section, we give a conditional result that these issues are actually artifacts of -means; that is, both of these would arise if one were to estimate the Gaussian centers with the -means-optimal centroids (though these centroids might be computationally difficult to obtain). The main trick in our argument is that, in some cases, so-called “Voronoi means” appear to serve as a good a proxy for the -means-optimal centroids. This trick is useful because the Voronoi means are much easier to analyze. We start by providing some motivation for the Voronoi means.

Given , let denote any minimizer of the -means objective

and define the -means-optimal centroids by

(Note that the -means minimizer is unique for generic .) Given , then for each , consider the Voronoi cell

Given a probability distribution over , define the Voronoi means by

Finally, we say is a stable isogon if

  • ,

  • the symmetry group of acts transitively on , and

  • for each , the stabilizer has the property that

(Examples of stable isogons include regular and quasi-regular polyhedra, as well as highly symmetric frames [BW13].) With this background, we formulate the following conjecture:

Conjecture 4 (Voronoi Means Conjecture).

Draw points independently from a mixture of equally weighted spherical Gaussians of equal variance centered at the points in a stable isogon . Then

converges to zero in probability as , i.e., the -means-optimal centroids converge to the Voronoi means.

Our conditional result will only require the Voronoi Means Conjecture in the special case where the ’s form an orthoplex (see Lemma 6). Figure 2 provides some numerical evidence in favor of this conjecture (specifically in the orthoplex case). We note that the hypothesis that be a stable isogon appears somewhat necessary. For example, simulations suggest that the conjecture does not hold when is three equally spaced points on a line (in this case, the -means-optimal centroids appear to be more separated than the Voronoi means). The following theorem provides some insight as to why the stable isogon hypothesis is reasonable:

Figure 2: Evidence in favor of the Voronoi Means Conjecture. For each , the following experiment produces (a), (b) and (c), respectively. Perform trials of the following: For each , draw points independently from , and then for these points, run MATLAB’s built-in implementation of -means++ with for independent initializations; of the resulting choices of centroids, store the ones of minimal -means value (this serves as our proxy for the -means-optimal centroids). Plot these 30 collections of centroids in black dots, along with the Voronoi means in red circles. The Voronoi means were computed numerically in Mathematica as . Importantly, the -means-optimal centroids appear to converge toward the Voronoi means, not the Gaussian centers, as .
Theorem 5.

Let be a mixture of equally weighted spherical Gaussians of equal variance centered at the points of a stable isogon . Then there exists such that for each .

See Section 6 for the proof. To interpret Theorem 5, consider -means optimization over the distribution instead of a large sample drawn from . This optimization amounts to finding points in that minimize

(7)

Intuitively, the optimal is a good proxy for the -means-optimal centroids when is large (and one might make this rigorous using the plug-in principle with the Glivenko–Cantelli Theorem). What Theorem 5 provides is that, when is a stable isogon, the Voronoi means have the same Voronoi cells as do . As such, if one were to initialize Lloyd’s algorithm at the Gaussian centers to solve (7), the algorithm converges to the Voronoi means in one step. Overall, one should interpret Theorem 5 as a statement about how the Voronoi means locally minimize (7), whereas the Voronoi Means Conjecture is a statement about global minimization.

As indicated earlier, we will use the Voronoi Means Conjecture in the special case where is an orthoplex:

Lemma 6.

The standard orthoplex of dimension , given by the first columns of and of , is a stable isogon.

Proof.

First, we have that , implying (si1). Next, every can be reached from any with the appropriate signed transposition, which permutes , and is therefore in the symmetry group ; as such, we conclude (si2). For (si3), pick and let denote its nonzero entry. Consider the matrix , where denotes the th identity basis element. Then

, and the eigenspace of

with eigenvalue is , and so we are done. ∎

What follows is the main result of this subsection:

Theorem 7.

Let be even, and let denote the standard orthoplex of dimension . Then for every , either

where denotes the mixture of equally weighted spherical Gaussians of entrywise variance centered at the members of .

See Section 7 for the proof. In words, Theorem 7 establishes that one must accept -dependence in either the data’s noise or the estimate’s error. It would be interesting to investigate whether other choices of stable isogons lead to stronger -dependence.

2.3 Numerical example: Clustering the MNIST dataset

In this section, we apply our clustering algorithm to the NMIST handwritten digits dataset [LC10]. This dataset consists of 70,000 different grayscale images, reshaped as vectors; 55,000 of them are considered training set, 10,000 are test set, and the remaining 5,000 are validation set.

Clustering the raw data gives poor results (due to 4’s and 9’s being similar, for example), so we first learn meaningful features, and then cluster the data in feature space. To simplify feature extraction, we used the first example from the TensorFlow tutorial 

[AAB15]

. This consists of a one-layer neural network

, where is the softmax function, is a matrix to learn, and is a vector to learn. As the tutorial shows, the neural network is trained for 1,000 iterations, each iteration using batches of 100 random points from the training set.

Training the neural network amounts to finding and that fit the training set well. After selecting these parameters, we run the trained neural network on the first 1,000 elements of the test set, obtaining , where each is a vector representing the probabilities of being each digit. Since is a probability vector, its entries sum to , and so the feature space is actually -dimensional.

For this experiment, we cluster with two different algorithms: (i) MATLAB’s built-in implementation of -means++, and (ii) our relax-and-round algorithm based on the -means semidefinite relaxation (3). (The results of the latter alternative are illustrated in Figure 3.)

Figure 3: (a) After applying TensorFlow [AAB15] to learn a -dimensional feature space of MNIST digits [LC10], determine the features of the first 1,000 images in the MNIST test set, compute the matrix of squared distances in feature space, and then solve the -means semidefinite relaxation (3) using SDPNAL+v0.3 [YST15]. (The computation takes about 6 minutes on a standard MacBook Air laptop.) Convert the SDP-optimizer to a grayscale image such that white pixels denote zero entries. By inspection, this matrix is not exactly of the form (2), but it looks close, and it certainly appears to have low rank. (b) Letting denote the matrix whose columns are the feature vectors to cluster, compute the denoised data and identify the most popular locations in (denoted by red circles) among the columns of (denoted by black dots). For the plot, we project the -dimensional data onto a random -dimensional subspace. (c) The most popular locations form our estimates for the centers of digits in feature space. We plot these locations relative to the original data, projected in the same -dimensional subspace as (b).

Since each run of -means++ uses a random initialization that impacts the partition, we ran this algorithm 100 times. In fact, the -means value of the output varied quite a bit: the all-time low was , the all-time high was , and the median was ; the all-time low was reached in out of the trials. Since our relax-and-round alternative has no randomness, the outcome is deterministic, and its -means value was , i.e., identical to the all-time low from -means++. By comparison, the -means value of the planted solution (i.e., clustering according to the hidden digit label) was 103.5430, and the value of the SDP (which serves as a lower bound on the optimal -means value) was . As such, not only did our relax-and-round alternative produce the best clustering that -means++ could find, it also provided a certificate that no clustering has a -means value that is 1.5% better.

Recalling the nature of our approximation guarantees, we also want to know well the relax-and-round algorithm’s clustering captures the ground truth. To evaluate this, we determined a labeling of the clusters for which the resulting classification exhibited a minimal misclassification rate. (This amounts to minimizing a linear objective over all permutation matrices, which can be relaxed to a generically tight linear program over doubly stochastic matrices.) For

-means++, the all-time low misclassification rate was (again, accomplished by of the trials), the all-time high was , and the median was . As one might expect, the relax-and-round output had a misclassification rate of .

3 Proof of Theorem 2

By the following lemma, it suffices to bound :

Lemma 8.

.

Proof.

First, by Lemma 1, we have . We also claim that . To see this, first note that and , and so the th entry of can be interpreted as a convex combination of the entries of . Let

be an eigenvector of

with eigenvalue , and let index the largest entry of (this entry is positive without loss of generality). Then , implying that . Since the eigenvalues of lie in , we may conclude that . As such,

(8)

We will bound (8) in two different ways, and a convex combination of these bounds will give the result. For both bounds, we let denote the indices in the diagonal blocks, and the indices in the off-diagonal blocks, and denote the indices in the diagonal block for the cluster . In particular, denotes the matrix that equals on the diagonal blocks and is zero on the off-diagonal blocks. For the first bound, we use that , and that has non-negative entries (since both and have non-negative entries, and ). Recalling that , we have

(9)

For the second bound, we first write . Since , we then have

where the last and second-to-last steps use that . Next, and , and so we may continue:

(10)

where again, the last step uses the fact that . At this point, we have bounds of the form with and with (explicitly, (9) and (10)), and we seek a bound of the form . As such, we take the convex combination for such that and

Taking and and combining with (8) then gives

choosing sufficiently small and simplifying yields the result. ∎

We will bound in terms of the following: For each real symmetric matrix , let denote the value of the following program:

maximum (11)
subject to
Lemma 9.

Put and . Then .

Proof.

Since and are both feasible in (11), we have

and adding followed by reverse triangle inequality gives

(12)

Write . Note that implies , and so

Similarly, , and so

where the last step follows from the optimality of . Similarly, , and so (12) implies the result. ∎

Now it suffices to bound . For an matrix , consider the matrix norm

The following lemma will be useful:

Lemma 10.

where .

Proof.

The first bound follows from the classical version of Hölder’s inequality (recalling that and by design):

The second bound is a consequence of Von Neumann’s trace inequality: if the singular values of

and are respectively and then

Since is feasible in (11) we have and . Using that we get

Proof of Theorem 2.

Write . Then

Furthermore,

where is the matrix whose th column is , and is the column vector whose th entry is . Recall that

Then where

Considering Lemma 10 and that we will bound

(13)

For the first term:

Note that if the rows , of come from a distribution with second moment matrix , then has the same distribution as , where