1 Introduction
Consider the following generative model for sampling from a mixture of product distributions over discrete domains. We use to denote the number of components in the mixture, to denote the size of the discrete output alphabet in each coordinate, and to denote the total number of coordinates. Each sample belongs to one of components, and conditioned on its component the dimensional discrete sample is drawn from some distribution . Precisely, the model is represented by the nonnegative weights of the components that sum to one, and the distributions . We use an
dimensional binary random vector
to represent a sample . For , the th coordinate is an dimensional binary random vector such thatwhere for some is the standard coordinate basis vector.
When a sample is drawn, the type of the sample is drawn from such that it has type
with probability
. Conditioned on this type, the sample is distributed according to , such that ’s are independent, hence it is a product distribution, and distributed according towhere is the th entry of the vector . Note that using the binary encoding, , and . Also, we let represent the distribution in the th coordinate such that . Then, the discrete distribution can be represented by the matrix and the weights .
This mixture distribution (of wise discrete distributions over product spaces) captures as special cases the models used in several problems in domains such as crowdsourcing [DS79], genetics [SRH07], and recommendation systems [TM10]. For example, in the crowdsourcing application, this model is same as the popular Dawid and Skene [DS79] model: represents answer of the th worker to a multiple choice question (or task) of type . Given the ground truth label , each of the worker is assumed to answer independently. The goal is to find out the “quality” of the workers (i.e. learn ) and/or to learn the type of each question (clustering).
We are interested in the following two closely related problems:

Learn mixture parameters and accurately and efficiently.

Cluster the samples accurately and efficiently?
Historically, however, different algorithms have been proposed depending on which question is addressed. Also, for each of the problems, distinct measures of performances have been used to evaluate the proposed solution. In this paper, we propose an efficient method to address both questions.
The first question of estimating the underlying parameters of the mixture components has been addressed in [KMR94, FM99, FOS08], where the error of a given algorithm is measured as the KLdivergence between the true distribution and the estimated distribution. More precisely, a mixture learning algorithm is said to be an accurate learning algorithm, if it outputs a mixture of product distribution such that the following holds with probability at least :
where are any given constants, and denote the random vectors distributed according to the true and the estimated mixture distribution, respectively. Furthermore, the algorithm is said to efficient if its time complexity is polynomial in , , , , and .
This Probably Approximately Correct (PAC) style framework was first introduced by Kearns et al. [KMR94], where they provided the first analytical result for a simpler problem of learning mixtures of Hamming balls, which is a special case of our model with . However, the running time of the proposed algorithm is superpolynomial and also assumes that one can obtain the exact probability of a sample . Freund and Mansour [FM99] were the first to addressed the sample complexity, but for the restrictive case of and . For this case, their method has running time and sample complexity . Feldman, O’Donnell, and Servedio in [FOS08] generalized approach of [FM99] to arbitrary number of types and arbitrary number of output labels . For general , their algorithm requires running time scaling as . Hence, the proposed algorithm is an efficient learning algorithm only for finite values of and .
A breakthrough in Feldman et al.’s result is that their result holds for all problem instances, with no dependence on the minimum weight or the condition number , where is the
th singular value of
, and is a diagonal matrix with the weights in the diagonals. However, this comes at a cost of running time scaling exponentially in both and , which is unacceptable in practice for any value of beyond two. Further, the running time is exponential for all problem instances, even when the problem parameters are wellbehaved, with finite condition number.In this paper, we alleviate this issue by proposing an efficient algorithm for wellbehave mixture distributions. In particular, we give an algorithm with polynomial running time, and prove that it gives accurate estimate for any problem instance that satisfy the following two conditions: ) the weight is strictly positive for all ; and ) the condition number is bounded as per hypotheses in Theorem 3.3.
The existence of an efficient learning algorithm for all problem instances and parameters still remains an open problem, and it is conjectured in [FOS08] that “solving the mixture learning problem for any would require a major breakthrough in learning theory”.
General and  
WAM[FOS08], Algorithm 1  Algorithm 1  
General cond. number  WAM [FOS08]  Open 
The second question finding the clusters has been addressed in [CHRZ07, CR08]. Chaudhuri et al. in [CHRZ07] introduced an iterative clustering algorithm but their method is restricted to the case of a mixture of two product distributions with binary outputs, i.e. and . Chaudhuri and Rao in [CR08] proposed a spectral method for general . However, for the algorithm to correctly recover cluster of each sample w.h.p, the underlying mixture distribution should satisfy a certain ‘spreading’ condition. Moreover, the algorithm need to know the parameters characterizing the ‘spread’ of the distribution, which typically is not available apriori. Although it is possible to estimate the mixture distribution, once the samples are clustered, Chaudhuri et al. provides no guarantees for estimating the distribution. As is the case for the first problem, for clustering also, we provide an efficient algorithm for general , under the assumption that the condition number of to be bounded. This condition is not directly comparable with the spreading condition assumed in previous work. Our algorithm first estimates the mixture parameters and then uses the distance based clustering method of [AK01].
Our method for estimating the mixture parameters is based on the moment matching technique from [AHK12, AGMS12]. Typically, second and third (and sometimes fourth) moments of the true distribution are estimated using the given samples. Then, using the spectral decomposition of the second moment one develops certain whitening operators that reduce the higherorder moment tensors to orthogonal tensors. Such higher order tensors are then decomposed using a powermethod based method [AGH12] to obtain the required distribution parameters.
While such a technique is generic and applies to several popular models [HK13, AGH12], for many of the models the moments themselves constitute the “correct” intermediate quantity that can be used for whitening and tensor decomposition. However, because there are dependencies in the wise model (for example, to are correlated), the higherorder moments are “incomplete” versions of the intermediate quantities that we require (see (1), (2)). Hence, we need to complete these moments so as to use them for estimating distribution parameters .
Completion of the “incomplete” second moment, can be posed as a lowrank matrix completion problem where the blockdiagonal elements are missing. For this problem, we propose an alternating minimization based method and, borrowing techniques from the recent work of [JNS13], we prove that alternating minimization is able to complete the second moment exactly. We would like to note that our alternating minimization result also solves a generalization of the lowrank+diagonal decomposition problem of [SCPW12]. Moreover, unlike tracenorm based method of [SCPW12]
, which in practice is computationally expensive, our method is efficient, requires only one Singular Value Decomposition (SVD) step, and is robust to noise as well.
We reduce the completion of the “incomplete” third moment to a simple least squares problem that is robust as well. Using techniques from our second moment completion method, we can analyze an alternating minimization method also for the third moment case as well. However, for the mixture problem we can exploit the structure to reduce the problem to an efficient least squares problem with closed form solution.
Next, we present our method (see Algorithm 1) that combines the estimates from the above mentioned steps to estimate the distribution parameters (see Theorem 3.2, Theorem 3.3). After estimating the model parameters , and , we also show that the KLdivergence measure and the clustering error measure can also be shown to be small. In fact the excess error vanishes as the number of samples grow (see Corollary 3.4, Corollary 3.5).
2 Related Work
Learning mixtures of distributions is an important problem with several applications such as clustering, crowdsourcing, community detection etc. One of the most well studied problems in this domain is that of learning a mixture of Gaussians. There is a long list of interesting recent results, and discussing the literature in detail is out side of the scope of this paper. Our approach is inspired by both spectral and momentmatching based techniques that have been successfully applied in learning a mixture of Gaussians [VW04, AK01, MV10, HK13].
Another popular mixture distribution arises in topic models, where each word is selected from a sized dictionary. Several recent results show that such a model can also be learned efficiently using spectral as well as moments based methods [RSS12, AHK12, AGM12]. However, there is a crucial difference between the general mixture of product distribution that we consider and the topic model distribution. Given a topic (or question) , each of the words in the topic model have exactly the same probability. That is, for all . In contrast, for our problem, , in general.
Learning mixtures of discrete distribution over product spaces has several practical applications such as crowdsourcing, recommendation systems, etc. However, as discussed in the previous section, most of the existing results for this problem are designed for the case of small alphabet size or the number of mixture components . For several practical problems [KOS13], can be large and hence existing methods either do not apply or are very inefficient. In this work, we propose first provably efficient method for learning mixture of discrete distributions for general and .
Our method is based on tensor decomposition methods for moment matching that have recently been made popular for learning mixture distributions. For example, [HK13] provided a method to learn mixture of Gaussians without any separation assumption. Similarly, [AHK12]
introduced a method for learning mixture of HMMs, and also for topic models. Using similar techniques, another interesting result has been obtained for the problem of independent component analysis (ICA)
[AGMS12, GR12, HK13].Typically, tensor decomposition methods proceed in two steps. First, obtain a whitening operator using the second moment estimates. Then, use this whitening operator to construct a tensor with orthogonal decomposition, which reveals the true parameters of the distribution. However, in a mixture of way distribution that we consider, the second or the third moment do not reveal all the “required” entries, making it difficult to find the standard whitening operator. We handle this problem by posing it as a matrix completion problem and using an alternating minimization method to complete the second moment. Our proof for the alternating minimization method closely follows the analysis of [JNS13]. However, [JNS13] handled a matrix completion problem where the entries are missing uniformly at random, while in our case the block diagonal elements are missing.
2.1 Notation
Typically, we denote a matrix or a tensor by an uppercase letter (e.g. ) while a vector is denoted by a smallcase letter (e.g. ). denotes the th column of matrix . denotes the th entry of matrix and denotes the th entry of the third order tensor . denotes the transpose of matrix , i.e., . denotes the set of first integers. denotes the th standard basis vector.
If , then () denotes the th block of , i.e., to th rows of . The operator denotes the outer product. For example, denote a rankone tensor such that . For a symmetric thirdorder tensor , define an dimensional operation with respect to a matrix as
denotes the spectral norm of a tensor . That is, . denotes the Frobenius norm of , i.e., . We use to denote the singular value decomposition (SVD) of , where denotes the th singular value of . Also, wlog, assume that .
3 Main results
In this section, we present our main results for estimating the mixture weights and the probability matrix of the mixture distribution. Our estimation method is based on the momentmatching technique that has been popularized by several recent results [AHK12, HKZ12, HK13, AGH12]. However, our method differs from the existing methods in the following crucial aspects: we propose a matrix completion approach to estimate the second moments from samples (Algorithm 2); and a least squares approach with an appropriate change of basis to estimate the third moments from samples (Algorithm 3). These approaches provide robust algorithms to estimating the moments and might be of independent interest to a broad range of applications in the domain of learning mixture distributions.
The key step in our method is estimation of the following two quantities:
(1)  
(2) 
where is a diagonal matrix s.t. .
Now, as is standard in the moment based methods, we exploit spectral structure of to recover the latent parameters and . The following theorem presents a method for estimating , assuming are estimated exactly:
Theorem 3.1.
Let be as defined in (1), (2). Also, let
be the eigenvalue decomposition of
. Now, define . Let ,be the eigenvectors and eigenvalues obtained by the orthogonal tensor decomposition of
(see [AGH12]), i.e., . Then,where is a diagonal matrix with .
The above theorem reduces the problem of estimation of mixture parameters to that of estimating and . Typically, in moment based methods, tensors corresponding to and can be estimated directly using the second moment or third moment of the distribution, which can be estimated efficiently using the provided data samples. In our problem, however, the blockdiagonal entries of and cannot be directly computed from these sample moments. For example, the expected value of a diagonal entry at th coordinate is , where as the corresponding entry for is .
To recover these unknown blockdiagonal entries of , we use an alternating minimization algorithm. Our algorithm writes in a bilinear form and solves for each factor of the bilinear form using the computed offdiagonal blocks of . We then prove that this algorithm exactly recovers the missing entries when we are given the exact second moment. For estimating , we reduce the problem of estimating unknown blockdiagonal entries of to a least squares problem that can be solved efficiently.
Concretely, to get a consistent estimate of , we pose it as a matrix completion problem, where we use the offblockdiagonal entries of the second moment, which we know are consistent, to estimate the missing entries. Precisely, let
be the indices of the offblockdiagonal entries, and define a masking operator as:
(3) 
Now, using the fact that has rank at most , we find a rank estimate that explains the offblockdiagonal entries using an alternating minimization algorithm defined in Section 4.
(4) 
where is the set of observed samples, and is the number of iterations. We use the first half of the samples to estimate and the rest to estimate the thirdorder tensor.
Similarly for the tensor , the sample third moment does not converge to . However, the offblock diagonal entries do converge to the corresponding entries of . That is, let
be the indices of the offblockdiagonal entries, and define the following masking operator:
(5) 
Then, we have consistent estimates for from the sample third moment.
Now, in the case of , we do not explicitly compute . Instead, we estimate a dimensional tensor (cf. Theorem 3.1), using a least squares formulation that uses only offdiagonal blocks of . That is,
where is the singular value decomposition of the rank matrix . After estimation of , similar to Theorem 3.1, we use the whitening and tensor decomposition to estimate . See Algorithm 1 for a pseudocode of our approach.
Remark: Note that we use a new set of samples to estimate the third moment. This subsampling helps us in our analysis, as it ensures independence of the samples from the output of the alternating minimization step (4).
The next theorem shows that the moment matching approach (Algorithm 1) is consistent. Let and denote the estimates obtained using Algorithm 1. Also, let denote the blockincoherence of as defined in (7).
Theorem 3.2.
Assume that the sample second and the third moments are exact, i.e.,
and . Also, let for the MatrixAltMinprocedure and let , for a global constant . Then, there exists a permutation over such that, for all ,
We now provide a finite sample version of the above theorem.
Theorem 3.3 (Finite sample bound).
There exists positive constants , , , and a permutation on such that if then for any and for a large enough sample size:
the following holds for all , with probability at least :
Further, Algorithm 1 runs in time .
Note that, the estimated ’s and ’s using Algorithm 1 do not necessarily define a valid
probability measure: they can take negative values and might not sum to one. We can process the estimates further to get a valid probability distribution, and show that the estimated mixture distribution is close in KullbackLeibler divergence to the original one. Let
. We first setand set mixture weights . Similarly, let and set
for all , , and , and normalize it to get valid distributions . Let denote a random vector in obtained by first selecting a random type with probability and then drawing from a random vector according to .
Corollary 3.4 (KLdivergence bound).
Under the hypotheses of Theorem 3.3, there exists a positive constant such that if , then Algorithm 1 with the above postprocessing produces a mixture distribution that, with probability at least , satisfies : .
Moreover, we can show that the “type” of each data point can also be recovered accurately.
4 Algorithm
(6) 
In this section, we describe the proposed approach in detail and provide finite sample performance guarantees for each components: MatrixAltMin and TensorLS. These results are crucial in proving the finite sample bound in Theorem 3.3. As mentioned in the previous section, the algorithm first estimates using the alternating minimization procedure. Recall that the second moment of the data given by cannot estimate the blockdiagonal entries of . That is, even in the case of infinite samples, we only have consistency in the offblockdiagonal entries: . However, to apply the “whitening” operator to the third order tensor (see Theorem 3.1) we need to estimate .
In general it is not possible to estimate from as one can fill any entries in the blockdiagonal entries. Fortunately, we can avoid such a case since is guaranteed to be of rank . However, even a lowrank assumption is not enough to recover back . For example, if , then and one cannot recover back . Hence, we make an additional standard assumption that is blockincoherent, where a symmetric rank matrix with singular value decomposition is blockincoherent if the operator norm of all blocks of are upper bounded by
(7) 
where is an sub matrix of which is defined by the block from the th row to the th row. For a given matrix , the smallest value of that satisfy the above condition is referred to as the blockincoherence of .
Now, assuming that satisfies two assumptions, and is block incoherent, we provide an alternating minimization method that provably recovers . In particular, we model explicitly using a bilinear form with variables and . We iteratively solve for for fixed , and use QR decomposition to orthonormalize to get . Note that the QRdecomposition is not required for our method but we use it only for ease of analysis. Below, we give the precise recovery guarantee for the alternating minimization method (Algorithm 2).
Theorem 4.1 (Matrix completion using alternating minimization).
For an symmetric rank matrix with blockincoherence , we observe offblockdiagonal entries corrupted by noise:
Let denote the output after iterations of MatrixAltMin. If , the noise is bounded by , and each column of the noise is bounded by , , then after iterations of MatrixAltMin, the estimate satisfies:
for any . Further, is incoherent with .
For estimating , the noise in the offblockdiagonal entries are due to insufficient sample size. We can precisely bound how large the sampling noise is in the following lemma.
Lemma 4.2.
The above theorem shows that can be recovered exactly from infinite many samples, if . Furthermore, using Lemma 4.2, can be recovered approximately, with sample size . Now, recovering recovers the leftsingular space of , i.e., range(). However, we still need to recover and the rightsingular space of , i.e., range().
To this end, we can estimate the tensor , “whiten” the tensor using (recall that, ), and then use tensor decomposition techniques to solve for . However, we show that estimating is not necessary, we can directly estimate the “whitened” tensor by solving a system of linear equations. In particular, we design an operator such that , where
(8) 
Moreover, we show that is nearlyisometric. Hence, we can efficiently estimate , using the following system of equations: