# Hidden Integrality of SDP Relaxation for Sub-Gaussian Mixture Models

We consider the problem of estimating the discrete clustering structures under Sub-Gaussian Mixture Models. Our main results establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded in terms of the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models, and our results reveal the "global-to-local" mechanism that drives the performance of the SDP relaxation. A corollary of our results shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our results establish sufficient conditions for the SDP to correctly recover the cluster memberships of (1-δ) fraction of the points for any δ∈(0,1). As a special case, we show that under the d-dimensional Stochastic Ball Model, SDP achieves non-trivial (sometimes exact) recovery when the center separation is as small as √(1/d), which complements previous exact recovery results that require constant separation.

## Authors

• 3 publications
• 34 publications
• ### Cutoff for exact recovery of Gaussian mixture models

We determine the cutoff value on separation of cluster centers for exact...
01/05/2020 ∙ by Xiaohui Chen, et al. ∙ 0

• ### When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming

Given a set of data, one central goal is to group them into clusters bas...
10/16/2017 ∙ by Xiaodong Li, et al. ∙ 0

• ### Partial recovery bounds for clustering with the relaxed Kmeans

We investigate the clustering performances of the relaxed Kmeans in the ...
07/19/2018 ∙ by Christophe Giraud, et al. ∙ 0

• ### A Robust Spectral Clustering Algorithm for Sub-Gaussian Mixture Models with Outliers

We consider the problem of clustering datasets in the presence of arbitr...
12/16/2019 ∙ by Prateek R. Srivastava, et al. ∙ 0

• ### Maximum a-Posteriori Estimation for the Gaussian Mixture Model via Mixed Integer Nonlinear Programming

We present a global optimization approach for solving the classical maxi...
11/08/2019 ∙ by Patrick Flaherty, et al. ∙ 0

• ### Exponential error rates of SDP for block models: Beyond Grothendieck's inequality

In this paper we consider the cluster estimation problem under the Stoch...
05/23/2017 ∙ by Yingjie Fei, et al. ∙ 0

• ### Statistical Robust Chinese Remainder Theorem for Multiple Numbers: Wrapped Gaussian Mixture Model

Generalized Chinese Remainder Theorem (CRT) has been shown to be a power...
11/28/2018 ∙ by Nan Du, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the Sub-Gaussian Mixture Models (SGMMs), where one is given random points drawn from a mixture of

sub-Gaussian distributions with different means/centers. SGMMs, particularly its special case Gaussian Mixture Models (GMMs), are widely used in a broad range of applications including speaker identification, background modeling and online recommendations systems. In these applications, one is typically interested in two types of inference problems under SGMMs:

• Clustering: (approximately) identify the cluster membership of each point, that is, which of the mixture components generates a given point;

• Center estimation: estimate the centers of a mixture, that is, the means of the components.

Standard approaches to these problems, such as k-means clustering, typically lead to integer programming problems that are non-convex and NP-hard to optimize

(Aloise et al., 2009; Jain et al., 2002; Mahajan et al., 2009)

. Consequently, much work has been done in developing computationally tractable algorithms for SGMMs, including expectation maximization

(Dempster et al., 1977), Lloyd’s algorithm (Lloyd, 1982), spectral methods (Vempala and Wang, 2004)

, the method of moments

(Pearson, 1936)

, and many more. Among them, convex relaxations, including those based on linear programming (LP) and semidefinite programming (SDP), have emerged as an important approach for clustering SGMMs. This approach has several attractive properties: (a) it is solvable in polynomial time, and does not require a good initial solution to be provided; (b) it has the flexibility to incorporate different quality metrics and additional constraints; (c) it is not restricted to specific forms of SGMMs (such as Gaussian distributions), and is robust against model misspecification

(Peng and Xia, 2005; Peng and Wei, 2007; Nellore and Ward, 2015); (d) it can provide a certificate for optimality.

Theoretical performance guarantees for convex relaxation methods have been studied in a body of classical and recent work. As will be discussed in the related work section (Section 2), these existing results often have one of the two forms:

1. How well the (rounded) solution of a relaxation optimizes a particular objective function (e.g., the k-means or k-medians objective) compared to the original integer program, as captured by an approximation factor (Charikar et al., 1999; Kanungo et al., 2004; Peng and Wei, 2007; Li and Svensson, 2016);

2. When the solution of a relaxation corresponds exactly to the ground-truth clustering, a phenomenon known as exact recovery and studied in a more recent line of work (Nellore and Ward, 2015; Awasthi et al., 2015; Mixon et al., 2017; Iguchi et al., 2017; Li et al., 2017).

In many practical scenarios, optimizing a particular objective function, and designing approximation algorithms for doing so, are often only a means to solving the two inference problems above, namely learning the true underlying model that generates the observed data. Results on exact recovery are more directly relevant to this goal; however, such results often require very stringent conditions on the separation or signal-to-noise ratio (SNR) of the model. In practice, convex relaxation solutions are rarely exact, even when the data are generated from the assumed model. On the other hand, researchers have observed that the solutions, while not exact or integer-valued, are often a good approximation to the desired solution that represents the ground truth (Mixon et al., 2017). Such a phenomenon is not captured by the results on exact recovery.

In this paper, we aim to strengthen our understanding of the convex relaxation approach to SGMMs. In particular, we study the regime where solutions of convex relaxations are not integral in general, and seek to directly characterize the estimation errors of the solutions—namely, their distance to desired integer solution corresponding to the true underlying model.

### 1.1 Our contributions

For a class of SDP relaxations for SGMMs, our results reveal a perhaps surprising property of them: while the SDP solutions are not integer-valued in general, their errors can be controlled by that of the solutions of an idealized integer program (IP), in which one tries to estimate cluster memberships when an oracle reveals the true centers of the SGMM. We refer to the latter program as the Oracle Integer Program. In particular, we show that, in a precise sense to be formalized later, the estimation errors of the SDP and Oracle IP satisfy the relationship (Theorem 1):

 error(SDP)≲error(IP)

under certain conditions. We refer to this property as hidden integrality of the SDP relaxation; its proof in fact involves showing that the optimal solutions of certain intermediate linear optimization problems are integral. We then further upper bound the error of the Oracle IP and show that it decays exponentially in terms of the SNR (Theorem 2):

 error(IP)≲exp[−Ω(SNR2)],

where the SNR is defined as the ratio of the center separation and the standard deviation of the sub-Gaussian components. Combining these two results immediately leads to explicit bounds on the error of the SDP solutions (Corollary

1).

#### 1.1.1 Consequences

When the SNR is sufficiently large, the above results imply that the SDP solutions are integral and exact up to numerical errors, hence recovering (sometimes improving) existing results on exact recovery as a special case. Moreover, if the SNR is low and the SDP solutions are fractional, one may obtain an explicit clustering from the SDP solutions via a simple, optimization-free rounding procedure. We show that the error of this explicit clustering (in terms of the fraction of points misclassified) is also bounded by the error of the Oracle IP and hence also decays exponentially in the SNR (Theorem 3). As a consequence, we obtain sufficient conditions for misclassifying at most fraction of the points for any given . Finally, we show that the SDP solutions also lead to an efficient estimator of the cluster centers, for which estimation error bounds are established (Theorem 4).

Significantly, our results often match and sometimes improve upon state-of-the-art performance guarantees in settings for which known results exist, and lead to new guarantees in other less studied settings of SGMMs. For instance, a corollary of our results shows that under the Stochastic Ball Model, SDP achieves meaningful (sometimes exact) recovery even when the center separation is as small as , where is the dimension (Section 4.5). For high dimensional settings, this bound generalizes existing results that focus on exact recovery and require . Detailed discussions of the implications of our results and comparison with existing ones will be provided after we state our main theorems.

#### 1.1.2 The “global-to-local” phenomenon

Our results above are obtained in two steps: (a) relating the SDP to the Oracle IP, and (b) bounding the Oracle IP errors. Conceptually, this two-step approach allows us to decouple two types of mechanisms that determine the performance of the SDP relaxation approach.

• On the one hand, step (a) is done by leveraging the structures of the entire dataset with points. In particular, certain global spectral properties of the data ensure that the error of the SDP is non-trivial and bounded (in terms of the Oracle IP error). This step is relatively insensitive to the specific structure of the SGMM.

• On the other hand, as shall become clear in the sequel, the Oracle IP essentially reduces to independent clustering problems, one for each data point. Knowing the true cluster centers, the Oracle IP is optimal in terms of the clustering errors. No other algorithms (including SDP relaxations) can achieve a strictly better error, due to the inherent randomness of individual data points. Step (b) above hence captures the local mechanism that determines fine-grained error rates as a function of the SNR.

Our two-step analysis establishes the hidden integrality property as the interface between these two types of mechanisms. As a clustering algorithm, the SDP approach is powerful enough to capture these two mechanisms simultaneously, without requiring a good initial solution or sophisticated pre-processing/post-processing steps.

We note that recent work on clustering under Stochastic Block Models reveals a related “local-to-global amplification” phenomenon, which connects the original clustering problem with that of recovering a single point when the memberships of the others are known (Abbe and Sandon, 2015; Abbe et al., 2016; Abbe, 2017). Our results share similar spirits with this line of work, though our models, algorithms and proof techniques are quite different.

### 1.2 Paper Organization

The remainder of the paper is organized as follows. In Section 2, we discuss related work on SGMMs and its special cases. In Section 3, we describe the problem setup for SGMMs and provide a summary of our clustering algorithms. In Section 4, we present our main results, discuss some of their consequences and compare them with existing results. The paper is concluded with a discussion of future directions in Section 5. The proofs of our main theorems are deferred to the Appendix.

## 2 Related work

The study of SGMMs has a long history and is still an active area of research. Here we review the most relevant results with theoretical guarantees.

Dasgupta (1999) is among the first to obtain performance guarantees for GMMs. Subsequent work has obtained improved guarantees, achieved by various algorithms including spectral methods. These results often establish sufficient conditions, in terms of the separation between the cluster centers (or equivalently the SNR), for achieving (near)-exact recovery of the cluster memberships. Vempala and Wang (2004) obtain one of the best results and require , which is later generalized and extended in a long line of work including Achlioptas and McSherry (2005); Kumar and Kannan (2010); Awasthi and Sheffet (2012). We compare these results with ours in Section 4.

Expectation-Maximization (EM) and Lloyd’s algorithm are among the most popular methods for GMMs. Despite their empirical effectiveness, non-asymptotic statistical guarantees are established only recently. In particular, convergence and center estimation error bounds for EM under GMMs with two components are derived in Balakrishnan et al. (2017); Klusowski and Brinda (2016), with extension to multiple components given in Yan et al. (2017). The work of Lu and Zhou (2016) provides a general convergence analysis for Lloyd’s algorithm, which implies clustering and center estimation guarantees for random models including SGMMs. All these results assume that one has access to a sufficiently good initial solution, typically obtained by spectral methods. A recent breakthrough was made by Daskalakis et al. (2016); Xu et al. (2016), who establish global convergence of randomly-initialized EM for GMMs with two symmetric components. Complementarily, Jin et al. (2016) show that EM may fail to converge under GMMs with components due to the existence of bad local minima. Robustness of the Lloyd’s algorithm under a semi-random GMM is studied in Awasthi and Vijayaraghavan (2017).

Most relevant to us are work on convex relaxation methods for GMMs and k-means/median problems. A class of SDP relaxations are developed in the seminal work by Peng and Xia (2005); Peng and Wei (2007). Thanks to convexity, these methods do not suffer from the issues of bad local minima faced by EM and Lloyd’s, though it is far from trivial to round their (typically fractional) solutions into valid clustering solutions with provable quality guarantees. In this direction, Awasthi et al. (2015); Iguchi et al. (2017); Li et al. (2017) establish conditions for LP/SDP relaxations to achieve exact recovery. The work of Mixon et al. (2017) considers SDP relaxations as a denoising method, and proves error bounds for a form of approximate recovery. Most of these results are directly comparable to ours, and we discuss them in more details in Section 4 after presenting our main theorems.

Clustering problems under Stochastic Block Models (SBMs) have also witnessed fruitful progress on understanding convex relaxation methods; see Abbe (2017) for a survey and further references. Much work has been done on exact recovery guarantees for SDP relaxations of SBMs (Krivelevich and Vilenchik, 2006; Oymak and Hassibi, 2011; Ames and Vavasis, 2014; Chen et al., 2014; Amini and Levina, 2018). A more recent line of work establishes approximate recovery guarantees of the SDPs (Guédon and Vershynin, 2016; Montanari and Sen, 2016) in the low SNR regime. Particularly relevant to us is the work by Fei and Chen (2017), who also establish exponentially decaying error bounds. Despite the apparent similarity in the forms of the error bounds, our results require very different analytical techniques, due to the fundamental difference between the geometric and probabilistic structures of SBMs and SGMMs; moreover, our results reveal the more subtle hidden integrality property of SDP relaxations, which we believe holds more broadly beyond specific models like SBMs and SGMMs.

## 3 Models and algorithms

In this section, we formally set up the clustering problem under SGMMs and describe our SDP relaxation approach.

### 3.1 Notations

We first introduce some notations. Vectors and matrices are denoted by bold letters such as

and . For a vector , we denote by its -th entry. For a matrix , denotes its trace, its -th entry, the vector of its diagonal entries, its entry-wise norm, its -th row and its -th column. We write if is symmetric positive semidefinite (psd). The trace inner product between two matrices and of the same dimension is denoted by . For a number , means . We denote by the all-one column vector of dimension . For a positive integer , let . For two non-negative sequences and , we write if there exists a universal constant such that for all , and write if and .

We recall that the sub-Gaussian norm of a random variable

is defined as

 ∥X∥ψ2\coloneqqinf{t>0:Eexp(X2/t2)≤2},

and is called sub-Gaussian if . Note that Normal and bounded random variables are sub-Gaussian. Denote by the -dimensional unit sphere. A random vector is sub-Gaussian if the one dimensional marginals are sub-Gaussian random variables for all . The sub-Gaussian norm of is defined as .

### 3.2 Sub-Gaussian Mixture Models

We focus on Sub-Gaussian Mixture Models (SGMMs) with balanced clusters.

###### Model 1 (Sub-Gaussian Mixture Models).

Let be unknown cluster centers. We observe random points in of the form

 hi\coloneqqμσ∗(i)+gi,i∈[n]

where is the unknown cluster label of the -th point, and are independent sub-Gaussian random vectors with sub-Gaussian norms .111More explicitly, the sub-Gaussian assumption is equivalent to

We assume that the ground-truth clusters have equal sizes, that is, for each .

Note that we do not require to be identically distributed or isotropic. Model 1 includes several important mixture models as special cases:

• The spherical GMM, where are Gaussian with the covariance matrix .

• More general GMMs with non-identical and non-diagonal covariance matrices for the clusters.

• The Stochastic Ball Model (Nellore and Ward, 2015), where the distributions of are supported on the unit ball in ; we discuss this model in details in Section 4.5

Throughout the paper we assume and to avoid degeneracy. Let be the vector of the true cluster labels; that is, its -th coordinate is (we use them interchangeably throughout the paper.) The task is to estimate the underlying clustering given the observed data . The separation of the centers of clusters and is denoted by , and is the minimum separation of the centers. Playing a crucial role in our results is the quantity

 s\coloneqqΔτ, (1)

which is a measure of the SNR of the SGMM.

### 3.3 Semidefinite programming relaxation

We now describe our SDP relaxation for clustering SGMMs. To begin, note that each candidate clustering of points into clusters can be represented using an assignment matrix where

 Fia={1if point i is assigned to cluster a0otherwuse.

Let be the set of all possible assignment matrices. Given the points to be clustered, a natural approach is to find an assignment that minimizes the total within-cluster pairwise distance. Arranging the pairwise squared distance as a matrix with

 Aij=∥hi−hj∥22,for each (i,j)∈[n]×[n],

we can express the above objective as

 ∑i,jAij1{points i and j are % assigned to the same cluster}=∑i,jAij(FF⊤)ij=⟨FF⊤,A⟩.

Therefore, the approach described above is equivalent to solving the integer program (2) below:

 minF ⟨FF⊤,A⟩ (2) s.t. F∈F (2) 1⊤nF=nk1⊤k (2) minY ⟨Y,A⟩ (3) s.t. Y1n=nk1n (3) Y is psd (3) diag(M)=1n (3) Y∈{0,1}n×n;rank(Y)=k. (3)

In program (2) the additional constraint enforces that all clusters have the same size , as we are working with an SGMM whose true clusters are balanced. Under this balanced model, it is not hard to see that the program (2) is equivalent to the classical k-means formulation. With a change of variable , we may lift the program (2) to the space of matrices and obtain the equivalent formulation (3). Both programs (2) and (3) involve non-convex combinatorial constraints and are computationally hard to solve. To obtain a tractable formulation, we drop the non-convex rank constraint in (3) and replace the integer constraint with the box constraint (the constraint is in fact redundant). These lead to the following SDP relaxation:

 ˆY∈argminY∈Rn×n ⟨Y,A⟩ (4) s.t. Y1n=nk1n Y is psd diag(M)=1n Y≥0.

The performance of the SDP is measured against the true cluster matrix , where for each ,

 Y∗ij={1if σ∗(i)=σ∗(j), i.e., points% i and j are in the same cluster,0if σ∗(i)=σ∗(j), i.e., points i and j are in % different clusters,

with the convention . The true cluster matrix encodes the ground-truth clustering , and is feasible to program (4) . We view an optimal solution to (4) as an estimate of the true clustering . Our goal is to characterize the cluster recovery/estimation error in terms of the number of points , the number of clusters , the data dimension and the SNR defined in (1). Note that here we measure the error of in metric; as we shall see later, this metric is directly related to the clustering error (i.e., the fraction of misclassified points).

We remark that the SDP (4) is somewhat different from the more classical and well-known SDP relaxation of k-means proposed by Peng and Wei (2007). The SDP (4) is closely related to the one considered by Amini and Levina (2018) in the context of Stochastic Block Models, though it seems to be less studied under SGMMs, with the notable exception of Li et al. (2017).

### 3.4 Explicit clustering

Our main results in the next section directly concern the SDP solution , which is not integral in general and hence does not directly correspond to an explicit clustering. In case an explicit clustering is desired, we may easily extract cluster memberships from the solution using a simple procedure.

The procedure consists of two steps given as Algorithms 1 and 2, respectively. In the first step, we treat the rows of as points in , and consider the balls centered at each row with a certain radius. The ball that contains the most rows is identified, and the indices of the rows in this ball are output and removed. The process continues iteratively with the remaining rows of . This step outputs a number of sets whose sizes are no larger than but may not equal to each other.

In the second step, we convert the sets output by Algorithm 1 into equal-size clusters. This is done by identifying the largest sets among them, and distributing points in the remaining sets across the chosen sets so that each of these sets contains exactly points.

Combining the above two algorithms gives our final algorithm, cluster, for extracting an explicit clustering from the SDP solution . This procedure is given as Algorithm 3.

The output of the above procedure

 ˆσ\coloneqqcluster(ˆY,n,k)

is a vector in such that point is assigned to the -th cluster. We are interested in controlling the clustering error of relative to the ground-truth clustering . Let denote the symmetric group consisting of all permutations of . The clustering error is defined by

 err(ˆσ,σ∗)\coloneqqminπ∈Sk1n∣∣{i∈[n]:ˆσi≠π(σ∗i)}∣∣, (5)

which is the proportion of points that are misclassified, modulo permutations of the cluster labels.

Variants of the above cluster procedure have been considered before by Makarychev et al. (2016); Mixon et al. (2017). Our results in the next section establish that the clustering error is always upper bounded by the error of the SDP solution .

## 4 Main results

In this section, we establish the connection between the estimation error of the SDP relaxation (4) and that of what we call the Oracle Integer Program. Using this connection, we derive explicit bounds on the error of the SDP, and explore their implications for clustering and center estimation.

### 4.1 Oracle Integer Program

Consider an idealized setting where an oracle reveals the true cluster centers . Moreover, we are given the data points , where for and are the same realizations of the random variables in the original SGMM. In other words, are the same as the original data points , except that the standard deviation (or more generally, the sub-Gaussian norm) of the noise is scaled by .

To cluster in this idealized setting, a natural approach is to simply assign each point to the closest cluster center, so that the total distance of the points to their assigned centers is minimized. We may formulate this procedure as an integer program, by representing each candidate clustering assignment using an assignment matrix as before. Then, for each assignment matrix , the quantity

 η(F)\coloneqq∑j∑a∥¯hj−μa∥22Fja

is exactly the sum of the distances of each point to its assigned cluster center. The clustering procedure above thus amounts to solving the following Oracle Integer Program (IP):

 minF η(F),s.t.F∈F. (6)

As this program is separable across the rows of , it can be reduced to independent optimization problems, one for each data point .

Let be the assignment matrix associated with the true underlying clustering of the SGMM; that is, for each . For each feasible solution to the Oracle IP, it is easy to see that the quantity is exactly the number of points that are assigned differently in and , and hence measures the clustering error of with respect to the ground truth .

A priori, there is no obvious connection between the estimation error of a solution to the above Oracle IP and that of a solution to the SDP. In particular, the latter involves a continuous relaxation, for which the solutions are fractional in general and the true centers are unknown. Surprisingly, we are able to establish a formal relationship between the two, and in particular show that the error of the SDP is bounded by the error of the IP in an appropriate sense.

### 4.2 Errors of SDP relaxation and Oracle IP

To establish the connection between the SDP and Oracle IP, we begin with the following observation: for a solution to potentially be an optimal solution of the Oracle IP (6), it must satisfy since is feasible to (6). Consequently, the quantity

 max{12∥F−F∗∥1:F∈F,η(F)≤η(F∗)} (7)

represents the worst-case error of a potentially optimal solution to the Oracle IP. This quantity turns out to be an upper bound of the error of the optimal solution to the SDP relaxation, as is shown in the theorem below.

###### Theorem 1 (IP bounds SDP).

Under Model 1, there exist some universal constants for which the following holds. If the SNR satisfies

 s2≥Cs(√kdnlogn+kdn+k), (8)

then we have

 ∥ˆY−Y∗∥1∥Y∗∥1≤2⋅max{∥F−F∗∥1∥F∗∥1:η(F)≤η(F∗),F∈F}

with probability at least

.

The proof is given in Section B, and consists of two main steps: showing that with high probability the SDP error is upper bounded by the objective value of a linear program (LP), and showing that the LP admits an integral optimal solution and relating this solution to the quantity (7). We note that the key step , which involves establishing certain hidden integrality properties, is completely deterministic. The SNR condition (8) is required only in the probabilistic step . As we elaborate in Sections 4.34.5, our SNR condition holds even in the regime where exact recovery is impossible, and is often milder than existing results on convex relaxations. Sharper analysis in step above will lead to potentially more relaxed conditions on the SNR.

To obtain an explicit bound on the SDP error, it suffices to upper bound the error of the Oracle IP. This turns out to be a relatively easy task compared to directly controlling the error of the SDP. The reason is that the Oracle IP has only finitely many feasible solutions, allowing one to use a union-bound-like argument. Our analysis establishes that the error of the Oracle IP decays exponentially in the SNR, as summarized in the theorem below.

###### Theorem 2 (Exponential rate of IP).

Under Model 1, there exist universal constants for which the following holds. If , then we have

 max{∥F−F∗∥1∥F∗∥1:η(F)≤η(F∗),F∈F}≤Cgexp[−s2Ce]

with probability at least .

The proof is given in Section C. An immediate consequence of Theorems 1 and 2 is that the SDP (4) also achieves an exponentially decaying error rate.

###### Corollary 1 (Exponential rate of SDP).

Under Model 1 and the SNR condition (8), there exist universal constants such that

 ∥ˆY−Y∗∥1∥Y∗∥1≤Cmexp[−s2Ce]

with probability at least .

Our last theorem concerns the explicit clustering extracted from using the cluster procedure described in Section 3.4. In particular, we show that the number of misclassified points is upper bounded by the error in and in turn by the error of the Oracle IP; consequently, misclassifies an exponentially-decaying number of points.

###### Theorem 3 (Clustering error).

The error rate in is always upper bounded by the error in :

 err(ˆσ,σ∗)≲∥ˆY−Y∗∥1∥Y∗∥1.

Consequently, under Model 1 and the SNR condition (8), there exist universal constants such that

 err(ˆσ,σ∗)≤Cmexp[−s2Ce]

with probability at least .

The proof is given in Section E. Note that the above bound, in terms of the clustering error, is optimal (up to a constant in the exponent) in view of the minimax results in Lu and Zhou (2016).

### 4.3 Consequences

We explore the consequences of our error bounds in Corollary 1 and Theorem 3.

• Exact recovery: If the SNR satisfies the condition (8) and moreover , then Theorem 3 guarantees that , which means that and the true underlying clustering is recovered exactly. Note that these conditions can be simplified to when . In fact, in this case Corollary 1 guarantees the SDP solution satisfies the bound , so simply rounding element-wise produces the ground-truth cluster matrix . Therefore, the SDP relaxation is able to achieve exact recovery (sometimes called strong consistency in the literature on Stochastic Block Models (Abbe, 2017)) of the underlying clusters when the SNR is sufficiently large.

In fact, our results are applicable even in regimes with a lower SNR, for which exact recovery of the clusters is impossible due to the overlap between points from different clusters. In such regimes, Corollary 1 and Theorem 3 imply approximate recovery guarantees for the SDP relaxation:

• Almost exact recovery: If satisfies the condition (8) and , then Theorem 3 implies that . That is, the SDP recovers the cluster memberships of almost all points asymptotically, which is sometimes called weak consistency.

• Recovery with -error: More generally, for any number , Theorem 3 implies the following non-asymptotic recovery guarantee: If satisfies the condition (8) and , then . That is, the SDP correctly recovers the cluster memberships of at least fraction of the points.

We compare the above results with existing ones in Section 4.4 to follow.

### 4.4 Comparison with existing results

Table 1 summarizes several most representative results in the literature on clustering SGMM/GMM. Most of them are in terms of SNR conditions required to achieve exact recovery of the underlying clusters. Note that our results imply sufficient conditions for both exact and approximate recovery.

Most relevant to us is the work of Li et al. (2017), which considers similar SDP relaxation formulations. They show that exact recovery is achieved when and . In comparison, a special case of our Corollary 1 guarantees exact recovery whenever and , which is milder than the condition in Li et al. (2017).

The work in Lu and Zhou (2016) also proves an exponentially decaying clustering error rate, but for a different algorithm (Lloyd’s algorithm). To achieve non-trivial approximate recovery of the clusters, they require and as . Our SNR condition in (8) has milder dependency on , though dependency on and are a bit more subtle. We do note that under their more restricted SNR condition, Lu and Zhou (2016) are able to obtain tight constants in the exponent of the error rate.

Finally, the work of Mixon et al. (2017) considers the SDP relaxation introduced by Peng and Wei (2007) and provides bounds on center estimation when . An intermediate result of theirs concerns errors of the SDP solutions; under the setting of balanced clusters, their error bound can be compared with ours after appropriate rescaling. In particular, their result implies the error bound when is sufficiently large. This bound is non-trivial when since . Under the same conditions on and , our results imply the exponential error bound

 ∥ˆY−Y∗∥2F≤∥ˆY−Y∗∥1≲n2ke−s2,

which is strictly better.

To sum up, corollaries of our results provide more relaxed conditions for exact or approximate recovery compared to most of the existing results listed in Table 1. Our results are weaker by a factor than the one in Vempala and Wang (2004)

, which considers spectral clustering methods and focuses on exact recovery under spherical Gaussian mixtures. In comparison, our results apply to the more general sub-Gaussian setting, and imply exponential error bounds for approximate recovery guarantees under more general SNR conditions.

### 4.5 Stochastic Ball Model

We illustrate the power of our main results for the Stochastic Ball Model introduced in Nellore and Ward (2015).

###### Model 2 (Stochastic Ball Model).

Under Model 1, we assume in addition that each is sampled from a rotationally invariant distribution supported on the unit ball in .

Under the Stochastic Ball Model, each data point is sampled from the unit ball around its cluster center in a rotationally invariant fashion. This model is a special case of SGMM with its sub-Gaussian norm given below:

###### Fact 1.

Under Model 2, each has sub-Gaussian norm for some universal constant .

For completeness we prove this claim in Section F. Specializing Corollary 1 and Theorem 3 to the Stochastic Ball Model, we obtain the following sufficient conditions on the minimum center separation for various types of recovery:

The state-of-the-art results for Stochastic Ball Models are given in Awasthi et al. (2015); Iguchi et al. (2017); Li et al. (2017), which establish that SDP achieves exact recovery when is sufficiently large and for some non-negative function . Regardless of the values of and , these results all require the separation to satisfy and thus the balls to be disjoint. In contrast, our results above are applicable to a small-separation regime that is not covered by these results. In particular, when is large and , our results guarantee that SDP achieves approximate recovery when , which can be arbitrarily smaller than when the dimension grows. Moreover, the recovery is exact if , which can again be arbitrarily small as long as does not grow exponentially fast (i.e., ). Therefore, in the high dimensional setting, our results guarantee strong performance of the SDP even when the centers are very close and the balls overlap with each other.

It may appear a bit counter-intuitive that exact/approximate recovery is achievable when the separation is so small. Such a result is a manifestation of the geometry in high dimensions: the relative volume of the intersection of two balls vanishes as the dimension grows. As a passing note, our exact recovery result above does not contradict the necessary condition given in Li et al. (2017, Corollary 4.3), as they allow to grow arbitrarily fast, in which case with high probability some points will land in the intersection.

### 4.6 Cluster center estimation

To conclude this section, we show that our main theorems also imply a guarantee for center estimation. In particular, given the estimated cluster labels produced by the SDP relaxation, we may obtain an estimate of the cluster centers by

 ^μa\coloneqqkn∑i:ˆσi=ahi,a∈[k].

That is, we simply compute the empirical means of the points within each estimated clusters. As a corollary of our bounds on clustering errors, we obtain the following center estimation guarantee.

###### Theorem 4 (Cluster center estimation error).

Suppose that for some universal constant . Under Model 1 and the SNR condition (8), there exist universal constants such that

 maxa∈[k]minπ∈Sk∥^μa−μπ(a)∥2≤Cmτ⎛⎝√k(d+logn)n+(√d+logn)⋅exp[−s2Ce]⎞⎠

with probability at least .

The proof is given in Section G. Note that the error is measured again up to permutation of the cluster labels. Our error bound in Theorem 4 consists of two terms. The first term, , corresponds to the error of estimating a -dimensional cluster center vector using the data points from that cluster. This error is unavoidable even when the true cluster labels are known. On the other hand, the second term captures the error due to incorrect cluster labels for some of the points. When and , we achieve the minimax optimal rate for center estimation.

## 5 Conclusion

In this paper, we consider clustering problems under SGMMs using an SDP relaxation. Our analysis consists of two steps: (a) we establish the main result of this paper that the clustering error of the SDP can be controlled by that of the idealized Oracle IP provided that an SNR condition is satisfied; (b) we show that the error of the Oracle IP decays exponentially in the SNR. As immediate corollaries, we obtain sufficient conditions for the SDP to achieve exact and approximate recovery, as well as error bounds for estimating the mixture centers.

As mentioned, this two-step approach allows for a certain decoupling of the computational and statistical mechanism that determines the performance of the SDP approach. We expect that further progress in understanding of the SDP relaxations are likely to come from improvements in step (a). On the other hand, by modifying and sharpening step (b), one may generalize our results to other variants of SGMMs.

Our work points to several interesting future directions. An immediate problem is extending our results to the case of imbalanced clusters. It is also of interest to study the robustness of SDP relaxations for SGMMs by considering various semi-random models that allow for adversarial attacks, arbitrary outliers and model misspecification

(Awasthi and Vijayaraghavan, 2017). Other directions worth exploring include obtaining better constants in error bounds, identifying sharp thresholds for different types of recovery, and obtaining tight localized proximity conditions in the lines of Li et al. (2017).

## Acknowledgement

Y. Fei and Y. Chen were partially supported by the National Science Foundation CRII award 1657420 and grant 1704828.

We define the shorthand . For a matrix , we write as its entry-wise norm, and

as its spectral norm (maximum singular value). We let

and be the identity matrix and all-one matrix, respectively. For a real number , denotes its ceiling. We denote by the set of indices of points in cluster , and we define .

## Appendix B Proof of Theorem 1

In this section, we prove Theorem 1, which relates the errors of the SDP and Oracle IP formulations.

### b.1 Preliminaries

We may decompose the input matrix of pairwise squared distances as

 A=C+C⊤−2HH⊤,

where is the matrix whose -th row is the point , and is the matrix where the entries in the -th row are identical and equal to . The row-sum constraint in the program (4) ensures that the matrix has zero row sum. Consequently, we have which implies .

Let be the centered version of . We can compute

 HH⊤ =(G+EH)(G+EH)⊤ =GG⊤+G(EH)⊤+(EH)G⊤+(EH)(EH)⊤

and

 EHH⊤=EGG⊤+(EH)(EH)⊤.

Therefore, we have

 HH⊤−EHH⊤=(GG⊤−EGG⊤)+G(EH)⊤+(EH)G⊤.

Let be the matrix of the left singular vectors of ; note that is simply a scaled version of the true assignment matrix and takes the form

 Uia=1√ℓF∗ia=1√ℓ⋅∗1{σ∗(i)=a}.

For each , define the projection and its orthogonal complement . The fact that is optimal and is feasible to the program (4), implies that

 0 ≤−12⟨ˆY−Y∗,A⟩ =⟨ˆY−Y∗,GG⊤−EGG⊤+G(EH)⊤+(EH)G⊤⟩+⟨ˆY−Y∗,EHH⊤⟩ (i)=⟨ˆY−Y∗,GG⊤⟩+⟨ˆY−Y∗,G(EH)⊤+(EH)G⊤⟩+⟨ˆY−Y∗,EHH⊤⟩ =⟨ˆY−Y∗,PT(GG⊤)⟩+⟨ˆY−Y∗,PT⊥(GG⊤)⟩ +2⟨ˆY−Y∗,G(EH)⊤⟩+⟨ˆY−Y∗,EHH⊤⟩ \eqqcolonS1+S2+2S3+S4,

where step holds since is a diagonal matrix and has zero diagonal. The following propositions control the terms , and .

###### Proposition 1.

If for some universal constant , then with probability at least .

###### Proposition 2.

If for some universal constant , then with probability at least .

###### Proposition 3.

We have , where for each .

The proofs are given in Sections B.4, B.5 and B.6, respectively. Combining the above propositions, we have and therefore

 0≤S3+14S4 (9)

with probability at least for some universal constant .

Let us take a closer look at the quantity . Let . We have

 S3 =∑j∑a∑i∈CaBji⟨μa,gj⟩ =ℓ∑j∑a⟨μa,gj⟩⎛⎝1ℓ∑i∈C∗aBji⎞⎠ =ℓ∑j∑a≠σ∗(j)⟨μa−μσ∗(j),gj⟩⎛⎝1ℓ∑i∈C∗aBji⎞⎠,

where the last step holds since for each , which follows from the row-sum constraint of program (4). By Proposition 3, we have

 S4 =−ℓ∑j∑a≠σ∗(j)12Δ2σ∗(j),a⎛⎝1ℓ∑i∈C∗aBji⎞⎠.

Therefore, together with the inequality (9), we obtain that

 0 ≤1ℓ(S3+14S4) =∑j∑a≠σ∗(j)(⟨μa−μσ∗(j),gj⟩−cΔ2σ∗(j),a)⎛⎝1ℓ∑i∈C∗aBji⎞⎠ =∑j∑a≠σ∗(j)βja⎛⎝1ℓ∑i∈C∗aBji⎞⎠, (10)

where and .

To control the RHS of (10), we recall that and observe that the constraints of the SDP (4) implies that

 ∑j∈[n]∑a≠σ∗(j)⎛⎝1ℓ∑i∈C∗aBji⎞⎠=γ2ℓ∈(0,n].

On the other hand, consider the linear program

 maxX∑j∑a≠σ∗(j)βjaXja s.t. 0≤Xja≤1,∀a≠σ∗(j),j∈[n] ∑a≠σ∗(j)Xja≤1,∀j∈[n] (11) ∑j∑a≠σ∗(j)Xja=R.

The above program is parameterized by the number . Let us denote by the optimal value of the above program (with the convention if the program is infeasible). Inspecting the equation (10) and the program (11), we find that the RHS of (10) is upper bounded by . We therefore conclude that

 0≤1ℓ(S3+14S4)≤V(γ2ℓ). (12)

### b.2 Controlling γ by a linear program

We next convert the inequality (12) into an upper bound on in terms of the objective value of an LP that is related to the above program (11).

If , then the conclusion of Theorem 1 holds trivially. For , we consider the following two cases:

1. If , it follows from equation (12) that the error must satisfy

 0