1 Introduction
Learning Mixture with Missing Values. Consider a mixture of distributions over with mixing weights , where and . That is to say, each sample (in ) is drawn from
with probability
for . In this work, we are interested in the problem of clustering data into groups so that subsequently we can learn the distribution ’s.There has been a very large body of works providing algorithms for clustering mixture models. A line of works shows that spectral algorithms can recover the underlying clustering as long as each cluster has bounded covariance and the cluster centers are sufficiently separated. Other lines of works study mixtures with the cluster means in general position, but those works require certain stringent assumptions on the distribution –Gaussians, product distributions, etc. Such assumptions are crucial in most of the works and the works are unlikely to extend to general distributions.
Challenge. In addition, in many practical applications, the data sample could have missing values especially when the data is high dimensional. To the best of our knowledge, none of the prior works in clustering mixtures consider this challenge. In this work, our primary interest is in overcoming this limitation.
Matrix Estimation. The goal is to find an estimate of an unknown matrix based on a noisy observation of with potentially missing values. Without any structural assumptions on , there is no hope for restoring faithfully as each entry in can take arbitrary values. One prominent example of such structural assumptions is the lowrank assumption; that is to say, the underlying matrix has relatively small rank compared to its size (). There has been numerous works on matrix estimation with lowrank assumptions, especially in the context of matrix completion where the objective is to fill in the missing values of the observation.
Matrix estimation is the protagonist of this work as it provides a disciplined way for coping with observations with missing values. In particular, we transform the question of clustering observations from a mixture distribution with missing values to that of matrix estimation as follows: first, stack the
samples (row vectors in
) to form a data matrix and run a matrix estimation algorithm to obtain ; next, cluster the ‘trimmed’ samples to recover the hidden mixture structure. The main claim of this work is that the samples can be clustered from a mixture even when there are missing values in observation, so long as the cluster means are reasonably separated from each other.Challenge. In order to argue that the above mentioned procedure yields desirable clustering, there are two challenges. One, the entries across columns in a given row are correlated while the entirety of matrix estimation literature considers entries of observed matrix to be independent. Two, to control the estimation error for each row of the matrix separately (recall that the rows of the matrix of our interest represent sample instances). However, the traditional measures such as only provides a control over the average estimation error. Specifically, we wish to bound rowwise error of estimation, for , instead of the error averaged over the rows. The maximum row norm of matrix estimation is a perfect fit for the purpose.
In this work, our interest is overcoming these two challenges in the matrix estimation literature to both advance the stateofart for matrix estimation as well as achieve the goal of learning generic mixture distribution from observations with missing values.
1.1 Our Contribution
As the main contribution of our work, we resolve the above mentioned challenges for both matrix estimation and learning mixtures.
In the context of matrix estimation: (a) We extend the noise model by allowing correlation across the columns, while the independence across the rows is still assumed; (b) we analyze the effect of missing values in matrix estimation to conclude that it is dominated by a subgaussian distribution when the parameter matrix is sufficiently incoherent; and (c) we provide an upper bound on maximum row norm of (hard or soft) singular value thresholding matrix estimation, which is a stronger guarantee than the traditional mean squared error (=expectation of squared Frobenius norm) bound.
In the context of mixture learning, we introduce and analyze a spectral clustering algorithm for mixture models with missing values in data. Specifically, by simply viewing the matrix obtained by stacking observations (with missing values) as rows of a matrix and applying matrix estimation to this matrix equipped with above mentioned generalizations, we argue that all the row indices (i.e. observations) can be clustered correctly so that we can subsequently learn mixture parameters from it.
In summary, by extending the analysis of matrix estimation from literature and by connecting mixture distribution learning to matrix estimation, we are able to provide a method for learning (clustering, to be precise) generic mixture distribution when observations have missing values.
Informal version of results. We obtain a vanishing upper bound on the normalized max row norm if the fraction of observed entries is not too small. Let be the fraction of observed entries in the matrix, , and . Then (see (9) for more details)
If we assume and , then .
For comparison, is required to obtain vanishing Frobenius norm error, and the minimax rate for Frobenius norm error is [KLT11]. We do not know whether this discrepancy implies achieving vanishing row norm is strictly harder than achieving vanishing Frobenius norm error (worstcase error vs averagecase error), or it implies the suboptimality of our result.
For mixture learning, we show that perfect clustering is possible with the minimum mean separation of with high probability.
1.2 Related Work
Mixture model. Learning mixtures has a long history, dating back to Pearson [Pea94]. Most works on this problem is based on the assumption that the cluster centers are sufficiently separated.
Since an algorithmic question of how to provably learn the true parameters of a Gaussian mixture based on a polynomial number of random samples is posed in the seminal work of Dasgupta [Das99], many algorithms were suggested and analyzed [SK01, VW02, AM05, BV08, KK10, AS12, BWY17, DTZ16, KSV08, KMV10, MV10, BS10, HK13, BCMV14, ABG14, GHK15].
One prominent line of works is based on the clusteringandthenlearn approach. Vempala and Wang [VW02] suggests a two step procedure of PCA on data matrix followed by distancebased clustering to learn mixtures of spherical Gaussians, which naturally extends to isotropic and logconcave distributions. This result is generalized in subsequent works [AM05, BV08, KK10, AS12]. For example, the work of Kumar and Kannan [KK10], further improved by the subsequent work of Awasthi and Sheffet [AS12] shows that a variant of spectral clustering (PCA followed by Lloyd’s algorithm) can recover the hidden components as long as the cluster centers are separated by and each cluster has bounded covariance. This is the best known result in spectral clustering.
Some known results can improve this minimum separation requirement to [SK01, DS07]
. Also, there are other lines of influential works, based on the methodofmoments or EM algorithms, which study mixtures with the cluster means in general position
[DS07, BWY17, DTZ16, KSV08, KMV10, MV10, BS10, HK13, BCMV14, ABG14, GHK15]. However, the assumptions, such as Gaussianity or independence, is crucial in the algorithm and analysis, hence, it is unlikely to generalize beyond such assumptions for a more general class of distributions.Recently, Regev and Vijayaraghavan [RV17] show efficient learning of Gaussian mixture based on a polynomial number of samples is possible with separation of via an iterative method. There are other works based on SumofSquare proofs and robust estimation ideas [DKK16, KS17, HL18], which operate under similar minimum separation requirements. These works break the barrier for mean separation by utilizing certificates for the behavior of higherorder (higher than 2) moments at the expense of increased sample and time complexity.
Nevertheless, none of the aforementioned works consider solving mixtures with missing values. To the best of our knowledge, our work is the first to addresses this challenge.
Matrix estimation. There has been a plethora of works on matrix completion and estimation. Early works utilizing spectral analysis can be found in several literature in engineering [AFK01, AM07], and they are followed by a large number of works leading to [KMO10a, KMO10b].
On the other hand, there has been attempts to solve matrix completion problems under appropriate model assumptions [Faz02, Sre04, RV07]. Possibly, the concurrent success of compressed sensing [CRT06, Don06] has led to rapid advances in this line of works. Beginning with the pioneering work of Candès and Recht [CR09], the technique of matrix completion by solving a convex optimization problem is introduced [CCS10, CP10, CT10] and gain substantial popularity [DPVDBW14, K11, KLT11, MHT10, NW11, RT11, C15].
This type of method has the advantage of exactly recovering the matrix entries in the noiseless setting, as long as the underlying matrix has low rank and a certain ‘incoherence’ condition is satisfied. However, we cannot expect exact recovery in the noisy setting where observed entries are corrupted with noise. Typically, the recovery guarantee is given in terms of the Frobenius norm (or socalled mean squared error, which is the expectation of squared Frobenius norm), which is the matrix analogue of norm.
2 Model and Problem Statement
2.1 Notation
For a matrx
, we write the Singular Value Decomposition (SVD) of
as , where with the singular values of , in the descending order. The spectral norm of a matrix is given as .For a positive integer , we let . For each , we let denote the th row of the matrix , viewed as a rowvector. For each , we let denote the th column of the matrix , viewed as a columnvector.
When
is viewed as a random matrix, let
. Then, is the th row of . We shall use as the column representation of , i.e. . We will utilize to represent mean of mixture component . And we shall utilize notation to represent in row form, for .2.2 Generative Model
We start by describing a generic data generative model. The mixture model with missing values will be a special instance of this model.
Generic Model. Consider a random matrix . The rows of are independent random vectors in . Let be the mean of row of and let denote the mean matrix. Note that is possibly unknown, but is deterministic in our model.
The data is observed through a ‘mask’. Precisely, let where be a mask matrix with being such that
Here, is a symbol to denote the value is unknown. We shall assume that is a Bernoulli() mask matrix for , i.e. with probability and (i.e. ) with probability ; all s are independent of each other. The generative model is illustrated with a hierarchical diagram in Figure 1.
Mixture model as a special instance. Consider a mixture of distributions over with mixing weights , where and . Let each is drawn as per this mixture distribution. Then the matrix has rows, each of which is from the set of size , where is the mean of random vector when drawn per mixture , . That is, not only has rank , the rows themselves are one of the different possibility.
2.3 Model Assumptions
We assume . We assume the masking matrix , the parameter matrix and the distribution for random matrix satisfy the following conditions.

(i.i.d. Bernoulli mask) For all and ,

(boundedness) such that .

(subgaussianity) There exists such that the following inequality holds for all :
The subgaussian assumption (4) implies upper bounds on the central moments. In particular, the second central moment (=covariance) satisfies (c.f. Lemma G.2)
Remarks. We remark on the assumptions made here. The Bernoulli mask is standard in the literature. The boundedness assumption or similar is required to constrain the ‘model complexity’ for estimation. The subGaussianity assumption is not very stringent (or, is standard) in the pertinent literature of matrix estimation as well as mixture learning. The significant SNR assumption is natural as well. For example, suppose , i.e., is a rank1 matrix with each entry of it being . That is,
. On the other hand, random matrix with Gaussian noise of variance
will have spectral norm with high probability. Therefore, requiring singular values of to be is simply requiring it to be above ‘noise’ by (at least) factor. Indeed, removing this benign need of additional factor remains important open direction beyond this work.2.4 Problem Statement
Here are two problems of our interest in this work. [matrix estimation with dependent noise] Can we achieve a control over the worst case estimation error (in the sense of the maximum row norm) in matrix estimation? If so, is it still possible to obtain a ‘good’ matrix estimator even when the noise is dependent across the columns?
[clustering mixture with missing values] Can we cluster general (=nonisotropic) subgaussian mixture when a ‘reasonable’ separation condition is satisfied?
3 Effect of Missing Values
In this section, we argue that the data matrix can be viewed as the sum of scaled parameter matrix and a perturbation. Specifically, we define
as the perturbation matrix. Then has independent, mean zero, subgaussian rows.
First of all, it is easy to verify (Lemma A.1). Next, we identify the covariance of . Given a generative model for with sampling probability as described in section 2.2, we refer to
as the covariance matrix of . We show in Lemma A.1 that
The main message in this section is the following proposition about an upper bound on . For every ,
where
(1) 
Here, is an absolute constant (see section G.1).
4 Matrix Estimation: Bounding Max Row Norm
In this section, we argue that the singular value thresholding (soft and hard) algorithm known in the literature has the desired property — that is, it bounds the max row norm. We do so by providing novel analysis of the algorithm summarized as the main result in this section. This, in effect, resolves Problem 1 stated in Section 2.4.
Singular Value Thresholding Algorithm.
Given input matrix , the algorithm produces estimated matrix as follows:

Let .

Take the singular value decomposition of .

Output or as defined in Eqs. (12) or (13) using for a sufficiently large constant ^{1}^{1}1 Later, we will see in Corollary C.3 that having in the range
Constants and Parameters.
In this work, we utilize various constants consistently throughout. To begin with, we utilize
(2)  
(3)  
(4) 
The constant utilized in the algorithm description needs to be sufficiently large. Specifically, we assume
(5) 
while stating our results in the main body of this paper, where is an absolute constant that appears in Lemma H.1.
Result.
We state a simplified version of the main result about the bound on the max row norm for the singular value thresholding algorithm (soft and hard) under the setup described in sections 2.22.3. For a complete exposition of our analysis, including the choice of used in algorithm and the complete version of the main theorem, please see Appendices B through E.
Suppose that with as described in Algorithm. Then for appropriate constants (precise choice differ for softSVT and hardSVT) and ,
(6)  
(7) 
with probability at least . This theorem is obtained as a Corollary of a general version of the Theorem (Theorem E.1), which is stated and proved in Appendix E. The proof of Theorem E.1 utilizes technical lemmas stated and proved in Appendix C and intermediate results stated and proved in Appendix D.
4.1 Interpretation of Theorem 4
Eq. (6) and (7) suggest that the estimation error with respect to the maximum row norm is bounded above by sum of the four terms (two in (6) and two in (7)). Note that every term decreases as .
Our analysis takes divide and conquer strategy to obtain a deviation inequality for the error with aid of probabilistic machinery, e.g., the union bound. Each term but the third represents a threshold required for highprobability conditioning at certain stage in our analysis.
For example, the first term, , is introduced to ensure the stability of top singular subspace (see in (50) and Proposition D.2). This term quickly becomes insignificant as grows; specifically, provided that .
The second term, is introduced to assure with high probability. Observe that this term is as long as .
The third term, , is the essential component in our upper bound. We observed is subgaussian with in Proposition 3. Therefore, is a dimensional subgaussian random vector with . The typical norm of the projection of such a random vector onto an dimensional subspace is with high probability. At an extra expense of , we obtain a high probability upper bound on the maximum norm of the projection among independent random vectors.
However, the preceding argument requires the subspace to be independent of the random vector. In our setup, the top singular subspace is dependent on for all . The last term, , is introduced as an amendment to decouple the randomness in a target data vector and the randomness in the dimensional subspace to project on. We take ‘leaveoneout’ style approach for decoupling (see in (33) and Proposition D.1). We suspect this analysis is suboptimal and there could be room for improvement. At any rate, we observe that
(8) 
provided that .
In short, when is sufficiently large,
(9) 
We obtain a vanishing upper bound on the normalized max row norm if the fraction of observed entries is not too small.
Now suppose that and . Then
Also, in such a setting, the condition can be simplified to
Assuming , is required. Lastly, we remark (8) reduces to
(10) 
A Simpler Version of Theorem 4 Assuming .
When is sufficiently large, we can prove a simpler (and sharper) version of Theorem 4. The following theorem is obtained as a corollary of Theorem E.1.1.
Suppose that with as described in Algorithm. Then for appropriate constants (precise choice differ for softSVT and hardSVT) and ,
with probability at least .
Why Do We Need Both Theorems 4 and 4.1?
The upper bounds stated in each theorems carry adhoc terms: in Theorem 4; and in Theorem 4.1. Both terms become when is sufficiently large. However, neither of those terms is superior to the other over the entire range of and . Specifically, Theorem 4 ensures a vanishing upper bound on the normalized max norm error for a wider range of , whereas Theorem 4.1 provides a sharper (and more intuitive) upper bound when .
4.2 More on Singular Value Thresholding
Details on Singular Value Thresholding.
Here we describe the well known soft and hard Singular Value Thresholding (SVT) operation for completeness. To that end, let the SVD of and be
Since for , we want our estimator to have a similar lowrank structue. However, the rank of is not known a priori. Therefore, instead of limiting the number of components to retain, many practical algorithms choose a thresholding value, say, and keep the components of with . There are two types of singular value thresholding (SVT): soft SVT and hard SVT.
The soft SVD of at level is
(12) 
and the hard SVD of is
(13) 
Induced Thresholding Operator.
The soft and hard thresholding per Eqs. (12) and (13) suggests the following operators . Specifically, given matrix and threshold , consider and defined as
(14)  
(15) 
where denotes the projection onto ; specifically, when as a row vector representation, . If we take the column vector representation, it would be natural to consider ^{2}^{2}2These are just two different representations of the same projection operation, and we shall switch between the row vector representation (left multiplication) and the column vector representation (right multiplication) for convenience of explanation; however it will be clear from the context whether we are using row or column representation.. Now, observe that for all ,
That is, the soft and hard thresholding are simply the corresponding operators applied on the rows of .
5 Mixture Clustering with Missing Values
In this section, we describe the algorithm for clustering observations (=data vectors) generated from mixture distribution. We argue that algorithm clusters the observations correctly, i.e. all observations in any learned cluster are from the same mixture component even when the observations have missing values as long as sufficiently many data points are observed. This answers the Problem 2.4 as stated in Section 2.4.
Some definitions.
We recap some definitions here.

Weight: .

Cluster mean: . We write .

Covariance:

norm:
Recall that and for all from our subgaussian assumption, and does not grow as the dimension increases.
Notations.
Let denote an oracle map such that denotes the index of the component which the th row of is drawn from. In other words, for all .
Distancebased Clustering Algorithm.
As defined in Section 2.2, we observe random vectors with missing values denoted as the rows of matrix . The goal of the algorithm is to cluster these row indices into disjoint clusters, i.e. sets so that , for all . The algorithm does not know the number of mixture components, . The ideal output would be where and contains precisely the row that correspond to mixture component (of course, up to relabeling of the mixture components).
We propose a clustering algorithm that performs in two steps: one, matrix estimation on to produce , and two, cluster rows of based on distance. The details are as follows.

Clustering: initialize .

Set the cutoff value
where , and are estimated upper bounds for (when are not known a priori).

Choose one and construct its neighbor

Update .

Repeat (b)  (c) when ; are left.

For which belongs to multiple ’s:

Find .

for .

Update , for all .


Result.
Let . We state the following intermediate result about clustering algorithm. Let as in algorithm described above. If and
(16) 
where are the same constants as in Theorem 4 (precise choice differ for softSVT and hardSVT), then
with probability at least . The proof of Theorem 5 can be found in section F. The following is an immediate Corollary of Theorem 5.
Under the same setup as in Theorem 5, the distancebased clustering algorithm correctly group the indices with probability at least for both soft and hard thresholding.
Minimum Requirement for and .
Suppose that . Now we consider is fixed in (16); then it follows that
(17) 
Since , it is clear that and are required. If we rewrite the asymptotic inequalities for , they read as
(18) 
We suspect the second inequality is suboptimal due to our analysis.