# Clustering subgaussian mixtures by semidefinite programming

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei. The algorithm interprets the SDP output as a denoised version of the original data and then rounds this output to a hard clustering. We provide a generic method for proving performance guarantees for this algorithm, and we analyze the algorithm in the context of subgaussian mixture models. We also study the fundamental limits of estimating Gaussian centers by k-means clustering in order to compare our approximation guarantee to the theoretically optimal k-means clustering solution.

## Authors

• 27 publications
• 19 publications
• 29 publications
• ### Cutoff for exact recovery of Gaussian mixture models

We determine the cutoff value on separation of cluster centers for exact...
01/05/2020 ∙ by Xiaohui Chen, et al. ∙ 0

• ### Monte Carlo approximation certificates for k-means clustering

Efficient algorithms for k-means clustering frequently converge to subop...
10/03/2017 ∙ by Dustin G. Mixon, et al. ∙ 0

• ### Sketching semidefinite programs for faster clustering

Many clustering problems enjoy solutions by semidefinite programming. Th...
08/10/2020 ∙ by Dustin G. Mixon, et al. ∙ 0

• ### Improved Guarantees for k-means++ and k-means++ Parallel

In this paper, we study k-means++ and k-means++ parallel, the two most p...
10/27/2020 ∙ by Konstantin Makarychev, et al. ∙ 0

• ### Achieving Approximate Soft Clustering in Data Streams

In recent years, data streaming has gained prominence due to advances in...
07/26/2012 ∙ by Vaneet Aggarwal, et al. ∙ 0

• ### Robust Bregman Clustering

Using a trimming approach, we investigate a k-means type method based on...
12/11/2018 ∙ by Claire Brécheteau, et al. ∙ 0

• ### When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming

Given a set of data, one central goal is to group them into clusters bas...
10/16/2017 ∙ by Xiaodong Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the following mixture model: For each , let

be an unknown subgaussian probability distribution over

, with first moment

and second moment matrix with largest eigenvalue

. For each , an unknown number of random points is drawn independently from . Given the points along with the model order , the goal is to approximate the centers . How large must be relative to , and how large must be relative to , in order to have sufficient signal for successful approximation?

For the most popular instance of this problem, where the subgaussian distributions are Gaussians, theoretical guarantees date back to the work of Dasgupta [Das99]. Dasgupta introduced an algorithm based on random projections and showed that this algorithm well-approximates centers of Gaussians in that are separated by

. Since Dasgupta’s seminal work, performance guarantees for several algorithmic alternatives have emerged, including expectation maximization

[DS07], spectral methods [VW04, KK10, AS12], projections (random and deterministic) [MV10, AK05], and the method of moments [MV10]. Every existing performance guarantee has one two forms:

• the algorithm correctly clusters all points according to Gaussian mixture component, or

• the algorithm well-approximates the center of each Gaussian (a la Dasgupta [Das99]).

Results of type (a), which include [VW04, KK10, AS12, AM07], require the minimum separation between the Gaussians centers to have a multiplicative factor of , where is the total number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is well-posed. We note that in the case of spherical Gaussians, such highly separated Gaussian components may be truncated so as to match a different data model known as the stochastic ball model, where the semidefinite program we use in this paper is already known to be tight with high probability [ABC15, IMPV15].

Results of type (b) tend to be specifically tailored to exploit unique properties of the Gaussian distribution, and are thus not easily generalizable to other data models. For instance, if

has distribution , then , and concentration of measure implies that in high dimensions, most of the points will reside in a thin shell with center and radius about . This sort of behavior can be exploited to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that perform well even with no separation between centers require a sample complexity which is exponential in [MV10].

In this paper, we provide a performance guarantee of type (b), but our approach is model-free. In particular, we consider the -means clustering objective:

 minimize k∑t=1∑i∈At∥∥∥xi−1|At|∑j∈Atxj∥∥∥22 (1) subject to A1∪⋯∪Ak={1,…,N},Ai∩Aj=∅∀i,j∈[k], i≠j

Letting denote the matrix defined entrywise by , then a straightforward calculation gives the following “lifted” expression for the -means objective:

 (2)

The matrix necessarily satisfies various convex constraints, and relaxing to such constraints leads to the following semidefinite relaxation of (1), first introduced by Peng and Wei in [PW07]:

 minimize Tr(DX) (3) subject to Tr(X)=k, X1=1, X≥0, X⪰0

Here, means that is entrywise nonnegative, whereas means that is symmetric and positive semidefinite.

As mentioned earlier, this semidefinite relaxation is known to be tight for a particular data model called the stochastic ball model [NW15, ABC15, IMPV15]

. In this paper, we study its performance under subgaussian mixture models, which include the stochastic ball model and the Gaussian mixture model as special cases. The SDP is not typically tight under this general model, but the optimizer can be interpreted as a denoised version of the data and can be rounded in order to produce a good estimate for the centers (and therefore produce a good clustering).

To see this, let denote the matrix whose columns are the points . Notice that whenever has the form (2), then for each , has columns equal to the centroid of points assigned to . In particular, if is -means-optimal, then reports the -means-optimal centroids (with appropriate multiplicities). Next, we note that every SDP-feasible matrix satisfies , and so

is a stochastic matrix, meaning each column of

is still a weighted average of columns from . Intuitively, if the SDP relaxation (3) were close to being tight, then the SDP-optimal would make the columns of close to the -means-optimal centroids. Empirically, this appears to be the case (see Figure 1 for an illustration). Overall, we may interpret as a denoised version of the original data , and we leverage this strengthened signal to identify good estimates for the -means-optimal centroids.

What follows is a summary of our relax-and-round procedure for (approximately) solving the -means problem (1):

Relax-and-round -means clustering procedure. Given and data matrix , do: Compute distance-squared matrix defined by . Solve -means semidefinite program (3), resulting in optimizer . Cluster the columns of the denoised data matrix .

For step (iii), we find there tends to be vectors that appear as columns in with particularly high frequency, and so we are inclined to use these as estimators for the -mean-optimal centroids (see Figure 1, for example). Running Lloyd’s algorithm for step (iii) also works well in practice. To obtain theoretical guarantees, we instead find the columns of for which the unit balls of a certain radius centered at these points in contain the most columns of (see Theorem 15 for more details). An implementation of our procedure is available on GitHut [Vil16].

Our contribution. We study performance guarantees for the -means semidefinite relaxation (3) when the point cloud is drawn from a subgaussian mixture model. Adapting ideas from Guédon and Vershynin [GV14], we obtain approximation guarantees comparable with the state of the art for learning mixtures of Gaussians despite the fact that our algorithm is a generic -means solver and uses no model assumptions. To the best of our knowledge, no convex relaxation has been used before to provide theoretical guarantees for clustering mixtures of Gaussians. We also provide conditional lower bounds on how well a -means solver can approximate the centers of Gaussians.

Organization of this paper. In Section 2, we present a summary of our results and give a high-level explanation of our proof techniques. We also illustrate the performance of our relax-and-round algorithm on the MNIST dataset of handwritten digits. Our theoretical results consist of an approximation theorem for the SDP (proved in Section 3), a denoising consequence of the approximation (explained in Section 4), and a rounding step (presented in Section 5). We also study the fundamental limits for estimating Gaussian centers by -means clustering (see Section 2.2).

## 2 Summary of results

This paper has two main results. First, we present a relax-and-round algorithm for -means clustering that well-approximates the centers of sufficiently separated subgaussians. Second, we provide a conditional result on the minimum separation necessary for Gaussian center approximation by -means clustering. The first result establishes that the -means SDP (3) performs well with noisy data (despite not being tight), and the second result helps to illustrate how sharp our analysis is. This section discusses these results, and then applies our algorithm to the MNIST dataset [LC10].

### 2.1 Performance guarantee for the k-means SDP

Our relax-and-round performance guarantee consists of three steps.

Step 1: Approximation. We adapt an approach used by Guédon and Vershynin to provide approximation guarantees for a certain semidefinite program under the stochastic block model for graph clustering [GV14].

Given the points drawn independently from , consider the squared-distance matrix and the corresponding minimizer of the SDP (3). We first construct a “reference” matrix such that the SDP (3) is tight when with optimizer . To this end, take , let denote the minimizer of (3), and let denote the minimizer of (3) when is replaced by the reference matrix defined as

 (Rab)ij:=ξ+Δ2ab/2+max{0,Δ2ab/2+2⟨ra,i−rb,j,γa−γb⟩} (4)

where , and is a parameter to be chosen later. Indeed, this choice of reference is quite technical, as an artifact of the entries in being statistically dependent. Despite its lack of beauty, our choice of reference enjoys the following important property:

###### Lemma 1.

Let denote the indicator function for the indices corresponding to points drawn from the th subgaussian. If whenever , then .

###### Proof.

Let be feasible for the the SDP (3). Replacing with in (3), we may use the SDP constraints and to obtain the bound

 Tr(RX) =N∑i=1N∑j=1RijXij≥N∑i=1N∑j=1ξXij=N∑i=1ξN∑j=1Xij=Nξ=Tr(RXR)

Furthermore, since whenever , and since , we have that equality occurs precisely for the such that equals zero whenever . The other constraints on then force to have the claimed form (i.e., is the unique minimizer). ∎

Now that we know that is the solution we want, it remains to demonstrate regularity in the sense that provided the subgaussian centers are sufficiently separated. For this, we use the following scheme:

• If then is small (Lemma 8).

• If (in some specific sense) then (Lemmas 9 and 10).

• If the centers are separated by , then .

What follows is the result of this analysis:

###### Theorem 2.

Fix . There exist universal constants such that if

 α=nmax/nmin≲k≲mandN>max{c1m,c2log(2/η),log(c3/η)},

then with probability provided

 Δ2min≥Cϵk2ασ2max

where is the minimal cluster center separation.

See Section 3 for the proof. Note that if we remove the assumption , we obtain the result .

Step 2: Denoising. Suppose we solve the SDP (3) for an instance of the subgaussian mixture model where is sufficiently large. Then Theorem 2 ensures that the solution is close to the ground truth. At this point, it remains to convert into an estimate for the centers . Let denote the matrix whose th column is . Then is an matrix whose th column is , the centroid of the th cluster, which converges to as , (and does not change when varies, for a fixed ), and so one might expect to have its columns be close to the ’s. In fact, we can interpret the columns of as a denoised version of the points (see Figure 1).

To illustrate this idea, assume the points come from in for each . Then we have

 E[1Nk∑a=1n∑i=1∥xa,i−γa∥22]=mσ2. (5)

Letting denote the th column of (i.e., the th estimate of ), in Section 4 we obtain the following denoising result:

###### Corollary 3.

If , then

 1Nk∑a=1n∑i=1∥ca,i−~γa∥22≲K2σ2

with high probability as .

Note that Corollary 3 guarantees denoising in the regime . This is a corollary of a more technical result (Theorem 11), which guarantees denoising for certain configurations of subgaussians (e.g., when the ’s are vertices of a regular simplex) in the regime .

At this point, we comment that one might expect this level of denoising from principal component analysis (PCA) when the mixture of subgaussians is sufficiently nice. To see this, suppose we have spherical Gaussians of equal entrywise variance

centered at vertices of a regular simplex. Then in the large-sample limit, we expect PCA to approach the -dimensional affine subspace that contains the centers. Projecting onto this affine subspace will not change the variance of any Gaussian in any of the principal components, and so one expects the mean squared deviation of the projected points from their respective Gaussian centers to be .

By contrast, we find that in practice, the SDP denoises substantially more than PCA does. For example, Figures 1 and 3 illustrate cases in which PCA would not change the data, since the data already lies in -dimensional space, and yet the SDP considerably enhances the signal. In fact, we observe empirically that the matrix has low rank and that has repeated columns. This doesn’t come as a complete surprise, considering SDP optimizers are known to exhibit low rank [Sha82, Bar95, Pat98]. Still, we observe that the optimizer tends to have rank when clustering points from the mixture model. This is not predicted by existing bounds, and we did not leverage this feature in our analysis, though it certainly warrants further investigation.

Step 3: Rounding. In Section 5, we present a rounding scheme that provides a clustering of the original data from the denoised results of the SDP (Theorem 15). In general, the cost of rounding is a factor of in the average squared deviation of our estimates. Under the same hypothesis as Corollary 3, we have that there exists a permutation on such that

 1kk∑i=1∥vi−~γπ(i)∥22≲kK2σ2, (6)

where is what our algorithm chooses as the th center estimate. Much like the denoising portion, we also have a more technical result that allows one to replace the right-hand side of (6) with for sufficiently nice configurations of subgaussians. As such, we can estimate Gaussian centers with mean squared error provided the centers have pairwise distance . This contrasts with the seminal work of Dasgupta [Das99], which gives mean squared error when the pairwise distances are . As such, our guarantee replaces dimensionality dependence with model order–dependence, which is an improvement to the state of the art in the regime . In the next section, we indicate that model order–dependence cannot be completely removed when using -means to estimate the centers.

Before concluding this section, we want to clarify the nature of our approximation guarantee (6). Since centroids correspond to a partition of Euclidean space, our guarantee says something about how “close” our -means partition is to the “true” partition. By contrast, the usual approximation guarantees for relax-and-round algorithms compare values of the objective function (e.g., the -means value of the algorithm’s output is within a factor of 2 of minimum). Also, the latter sort of optimal value–based approximation guarantee cannot be used to produce the sort of optimizer-based guarantee we want. To illustrate this, imagine a relax-and-round algorithm for -means that produces a near-optimal partition with for data coming from a single spherical Gaussian. We expect every subspace of co-dimension 1 to separate the data into a near-optimal partition, but the partitions are very different from each other when the dimension , and so a guarantee of the form (6) will not hold.

### 2.2 Fundamental limits of k-means clustering

In Section 5, we provide a rounding scheme that, when applied to the output of the -means SDP, produces estimates of the subgaussian centers. But how good is our guarantee? Observe the following two issues: (i) The amount of tolerable noise and our bound on the error both depend on . (ii) Our bound on the error does not vanish with .

In this section, we give a conditional result that these issues are actually artifacts of -means; that is, both of these would arise if one were to estimate the Gaussian centers with the -means-optimal centroids (though these centroids might be computationally difficult to obtain). The main trick in our argument is that, in some cases, so-called “Voronoi means” appear to serve as a good a proxy for the -means-optimal centroids. This trick is useful because the Voronoi means are much easier to analyze. We start by providing some motivation for the Voronoi means.

Given , let denote any minimizer of the -means objective

 k∑t=1∑i∈At∥∥∥xi−1|At|∑j∈Atxj∥∥∥22,

and define the -means-optimal centroids by

 c(X)t:=1|A(X)t|∑j∈A(X)txj.

(Note that the -means minimizer is unique for generic .) Given , then for each , consider the Voronoi cell

 V(Γ)a:={x∈Rm:∥x−γa∥2<∥x−γb∥2 ∀b≠a}.

Given a probability distribution over , define the Voronoi means by

 μ(Γ,D)t:=EX∼D[X∣∣X∈V(Γ)t].

Finally, we say is a stable isogon if

• ,

• the symmetry group of acts transitively on , and

• for each , the stabilizer has the property that

 {x∈Rm:Qx=x  ∀Q∈Gγ}=span{γ}.

(Examples of stable isogons include regular and quasi-regular polyhedra, as well as highly symmetric frames [BW13].) With this background, we formulate the following conjecture:

###### Conjecture 4 (Voronoi Means Conjecture).

Draw points independently from a mixture of equally weighted spherical Gaussians of equal variance centered at the points in a stable isogon . Then

 minπ:[k]→[k]permutationmaxt∈{1,…,k}∥∥c(X)t−μ(Γ,D)π(t)∥∥2

converges to zero in probability as , i.e., the -means-optimal centroids converge to the Voronoi means.

Our conditional result will only require the Voronoi Means Conjecture in the special case where the ’s form an orthoplex (see Lemma 6). Figure 2 provides some numerical evidence in favor of this conjecture (specifically in the orthoplex case). We note that the hypothesis that be a stable isogon appears somewhat necessary. For example, simulations suggest that the conjecture does not hold when is three equally spaced points on a line (in this case, the -means-optimal centroids appear to be more separated than the Voronoi means). The following theorem provides some insight as to why the stable isogon hypothesis is reasonable:

###### Theorem 5.

Let be a mixture of equally weighted spherical Gaussians of equal variance centered at the points of a stable isogon . Then there exists such that for each .

See Section 6 for the proof. To interpret Theorem 5, consider -means optimization over the distribution instead of a large sample drawn from . This optimization amounts to finding points in that minimize

 k∑t=1EX∼D[∥X−ct∥22∣∣X∈V(C)t]PX∼D(X∈V(C)t) (7)

Intuitively, the optimal is a good proxy for the -means-optimal centroids when is large (and one might make this rigorous using the plug-in principle with the Glivenko–Cantelli Theorem). What Theorem 5 provides is that, when is a stable isogon, the Voronoi means have the same Voronoi cells as do . As such, if one were to initialize Lloyd’s algorithm at the Gaussian centers to solve (7), the algorithm converges to the Voronoi means in one step. Overall, one should interpret Theorem 5 as a statement about how the Voronoi means locally minimize (7), whereas the Voronoi Means Conjecture is a statement about global minimization.

As indicated earlier, we will use the Voronoi Means Conjecture in the special case where is an orthoplex:

###### Lemma 6.

The standard orthoplex of dimension , given by the first columns of and of , is a stable isogon.

###### Proof.

First, we have that , implying (si1). Next, every can be reached from any with the appropriate signed transposition, which permutes , and is therefore in the symmetry group ; as such, we conclude (si2). For (si3), pick and let denote its nonzero entry. Consider the matrix , where denotes the th identity basis element. Then

, and the eigenspace of

with eigenvalue is , and so we are done. ∎

What follows is the main result of this subsection:

###### Theorem 7.

Let be even, and let denote the standard orthoplex of dimension . Then for every , either

 σ≲Δmin/√logkormint∈{1,…,k}∥μ(Γ,D)t−γt∥2≳σ√logk,

where denotes the mixture of equally weighted spherical Gaussians of entrywise variance centered at the members of .

See Section 7 for the proof. In words, Theorem 7 establishes that one must accept -dependence in either the data’s noise or the estimate’s error. It would be interesting to investigate whether other choices of stable isogons lead to stronger -dependence.

### 2.3 Numerical example: Clustering the MNIST dataset

In this section, we apply our clustering algorithm to the NMIST handwritten digits dataset [LC10]. This dataset consists of 70,000 different grayscale images, reshaped as vectors; 55,000 of them are considered training set, 10,000 are test set, and the remaining 5,000 are validation set.

Clustering the raw data gives poor results (due to 4’s and 9’s being similar, for example), so we first learn meaningful features, and then cluster the data in feature space. To simplify feature extraction, we used the first example from the TensorFlow tutorial

[AAB15]

. This consists of a one-layer neural network

, where is the softmax function, is a matrix to learn, and is a vector to learn. As the tutorial shows, the neural network is trained for 1,000 iterations, each iteration using batches of 100 random points from the training set.

Training the neural network amounts to finding and that fit the training set well. After selecting these parameters, we run the trained neural network on the first 1,000 elements of the test set, obtaining , where each is a vector representing the probabilities of being each digit. Since is a probability vector, its entries sum to , and so the feature space is actually -dimensional.

For this experiment, we cluster with two different algorithms: (i) MATLAB’s built-in implementation of -means++, and (ii) our relax-and-round algorithm based on the -means semidefinite relaxation (3). (The results of the latter alternative are illustrated in Figure 3.)

Since each run of -means++ uses a random initialization that impacts the partition, we ran this algorithm 100 times. In fact, the -means value of the output varied quite a bit: the all-time low was , the all-time high was , and the median was ; the all-time low was reached in out of the trials. Since our relax-and-round alternative has no randomness, the outcome is deterministic, and its -means value was , i.e., identical to the all-time low from -means++. By comparison, the -means value of the planted solution (i.e., clustering according to the hidden digit label) was 103.5430, and the value of the SDP (which serves as a lower bound on the optimal -means value) was . As such, not only did our relax-and-round alternative produce the best clustering that -means++ could find, it also provided a certificate that no clustering has a -means value that is 1.5% better.

Recalling the nature of our approximation guarantees, we also want to know well the relax-and-round algorithm’s clustering captures the ground truth. To evaluate this, we determined a labeling of the clusters for which the resulting classification exhibited a minimal misclassification rate. (This amounts to minimizing a linear objective over all permutation matrices, which can be relaxed to a generically tight linear program over doubly stochastic matrices.) For

-means++, the all-time low misclassification rate was (again, accomplished by of the trials), the all-time high was , and the median was . As one might expect, the relax-and-round output had a misclassification rate of .

## 3 Proof of Theorem 2

By the following lemma, it suffices to bound :

.

###### Proof.

First, by Lemma 1, we have . We also claim that . To see this, first note that and , and so the th entry of can be interpreted as a convex combination of the entries of . Let

be an eigenvector of

with eigenvalue , and let index the largest entry of (this entry is positive without loss of generality). Then , implying that . Since the eigenvalues of lie in , we may conclude that . As such,

 ∥XD−XR∥2F =∥XD∥2F+∥XR∥2F−2Tr(XDXR) ≤2k−2Tr(XDXR) =2k+2Tr((XR−XD)XR)−2∥XR∥2F =2Tr((XR−XD)XR). (8)

We will bound (8) in two different ways, and a convex combination of these bounds will give the result. For both bounds, we let denote the indices in the diagonal blocks, and the indices in the off-diagonal blocks, and denote the indices in the diagonal block for the cluster . In particular, denotes the matrix that equals on the diagonal blocks and is zero on the off-diagonal blocks. For the first bound, we use that , and that has non-negative entries (since both and have non-negative entries, and ). Recalling that , we have

 2Tr((XR−XD)XR) =k∑t=12Tr((XR−XD)(11⊤)Ωt1nt) ≥2nmaxTr((XR−XD)(11⊤)Ω) =−2ξnmaxTr((XD−XR)RΩ) (9)

For the second bound, we first write . Since , we then have

 2Tr((XR−XD)XR) =2nminTr((XD−XR)((11⊤)Ωc+k∑t=1(1−nminnt)(11⊤)Ωt−11⊤)) =2nminTr((XD−XR)((11⊤)Ωc+k∑t=1(1−nminnt)(11⊤)Ωt)) ≤2nminTr((XD−XR)(11⊤)Ωc) =2nminTr(XD(11⊤)Ωc),

where the last and second-to-last steps use that . Next, and , and so we may continue:

 2Tr((XR−XD)XR) ≤2nmin(ξ+Δ2min/2)Tr(XDRΩc) =2nmin(ξ+Δ2min/2)Tr((XD−XR)RΩc), (10)

where again, the last step uses the fact that . At this point, we have bounds of the form with and with (explicitly, (9) and (10)), and we seek a bound of the form . As such, we take the convex combination for such that and

 x≤a−1a−1+b−1ay1+b−1a−1+b−1by2=1a−1+b−1(y1+y2).

Taking and and combining with (8) then gives

 ∥XD−XR∥2F≤2Tr((XR−XD)XR)≤(ξ2(nmin−nmax)+nmin4Δ2min)−1Tr((XD−XR)(RΩ+RΩc)),

choosing sufficiently small and simplifying yields the result. ∎

We will bound in terms of the following: For each real symmetric matrix , let denote the value of the following program:

 F(M)= maximum |Tr(MX)| (11) subject to Tr(X)=k, X1=1, X≥0, X⪰0

Put and . Then .

###### Proof.

Since and are both feasible in (11), we have

 −Tr(˜DXD)+Tr(˜RXD)≤|Tr((˜D−˜R)XD)| ≤F(˜D−˜R), Tr(˜DXR)−Tr(˜RXR)≤|Tr((˜D−˜R)XR)| ≤F(˜D−˜R),

and adding followed by reverse triangle inequality gives

 2F(˜D−˜R)≥(Tr(˜DXR)−Tr(˜DXD))+(Tr(˜RXD)−Tr(˜RXR)). (12)

Write . Note that implies , and so

 Tr(˜DXD)=Tr(DX˜D)=Tr(D(XD−(1/N)11⊤))=Tr(DXD)+1N1⊤D1.

Similarly, , and so

 Tr(˜DXR)−Tr(˜DXD)=Tr(DXR)−Tr(DXD)≥0,

where the last step follows from the optimality of . Similarly, , and so (12) implies the result. ∎

Now it suffices to bound . For an matrix , consider the matrix norm

 ∥X∥1,∞:=n1∑i=1max1≤j≤n2|Xi,j|=n1∑i=1∥Xi,.∥∞.

The following lemma will be useful:

where .

###### Proof.

The first bound follows from the classical version of Hölder’s inequality (recalling that and by design):

 |Tr(MX)|≤N∑i=1N∑j=1|Mi,jXi,j|≤N∑i=1∥Mi,.∥∞(N∑j=1|Xi,j|)=N∑i=1∥Mi,.∥∞

The second bound is a consequence of Von Neumann’s trace inequality: if the singular values of

and are respectively and then

 |Tr(MX)|≤N∑i=1αiβi

Since is feasible in (11) we have and . Using that we get

 |Tr(MX)|≤k∑i=1βi≤min{k,r}∥M∥2→2\qed
###### Proof of Theorem 2.

Write . Then

 (Dab)ij =∥xa,i−xb,j∥22 =∥(ra,i+γa)−(rb,j+γb)∥22=∥ra,i−rb,j∥22+2⟨ra,i−rb,j,γa−γb⟩+∥γa−γb∥22.

Furthermore,

 ∥ra,i−rb,j∥22=∥ra,i∥22−2⟨ra,i,rb,j⟩+∥rb,j∥22=((μ1⊤+G⊤G+1μ⊤)ab)ij,

where is the matrix whose th column is , and is the column vector whose th entry is . Recall that

 (Rab)ij=ξ+Δ2ab/2+max{0,Δ2ab/2+2⟨ra,i−rb,j,γa−γb⟩}

Then where

 (Fab)ij={Δ2ab/2+2⟨ra,i−rb,j,γa−γb⟩if 2⟨ra,i−rb,j,γa−γb⟩≤−Δ2ab/20otherwise.

Considering Lemma 10 and that we will bound

 F(M)≤min{k,m}∥P1⊥G⊤GP1⊥∥2→2+1nmin∥P1⊥FP1⊥∥1,∞. (13)

For the first term:

 ∥P1⊥G⊤GP1⊥∥2→2≤∥G⊤G∥2→2=∥G⊤∥22→2.

Note that if the rows , of come from a distribution with second moment matrix , then has the same distribution as , where