 # Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

We study the problem of learning a distribution from samples, when the underlying distribution is a mixture of product distributions over discrete domains. This problem is motivated by several practical applications such as crowd-sourcing, recommendation systems, and learning Boolean functions. The existing solutions either heavily rely on the fact that the number of components in the mixtures is finite or have sample/time complexity that is exponential in the number of components. In this paper, we introduce a polynomial time/sample complexity method for learning a mixture of r discrete product distributions over {1, 2, ..., ℓ}^n, for general ℓ and r. We show that our approach is statistically consistent and further provide finite sample guarantees. We use techniques from the recent work on tensor decompositions for higher-order moment matching. A crucial step in these moment matching methods is to construct a certain matrix and a certain tensor with low-rank spectral decompositions. These tensors are typically estimated directly from the samples. The main challenge in learning mixtures of discrete product distributions is that these low-rank tensors cannot be obtained directly from the sample moments. Instead, we reduce the tensor estimation problem to: a) estimating a low-rank matrix using only off-diagonal block elements; and b) estimating a tensor using a small number of linear measurements. Leveraging on recent developments in matrix completion, we give an alternating minimization based method to estimate the low-rank matrix, and formulate the tensor completion problem as a least-squares problem.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the following generative model for sampling from a mixture of product distributions over discrete domains. We use to denote the number of components in the mixture, to denote the size of the discrete output alphabet in each coordinate, and to denote the total number of coordinates. Each sample belongs to one of components, and conditioned on its component the dimensional discrete sample is drawn from some distribution . Precisely, the model is represented by the non-negative weights of the components that sum to one, and the distributions . We use an

dimensional binary random vector

to represent a sample . For , the -th coordinate is an dimensional binary random vector such that

 xi=ej if and only if yi=j,

where for some is the standard coordinate basis vector.

When a sample is drawn, the type of the sample is drawn from such that it has type

with probability

. Conditioned on this type, the sample is distributed according to , such that ’s are independent, hence it is a product distribution, and distributed according to

 (πq)(i,j)=P(yi=j|y belong to % component q),

where is the -th entry of the vector . Note that using the binary encoding, , and . Also, we let represent the distribution in the -th coordinate such that . Then, the discrete distribution can be represented by the matrix and the weights .

This mixture distribution (of -wise discrete distributions over product spaces) captures as special cases the models used in several problems in domains such as crowdsourcing [DS79], genetics [SRH07], and recommendation systems [TM10]. For example, in the crowdsourcing application, this model is same as the popular Dawid and Skene [DS79] model: represents answer of the -th worker to a multiple choice question (or task) of type . Given the ground truth label , each of the worker is assumed to answer independently. The goal is to find out the “quality” of the workers (i.e. learn ) and/or to learn the type of each question (clustering).

We are interested in the following two closely related problems:

• Learn mixture parameters and accurately and efficiently.

• Cluster the samples accurately and efficiently?

Historically, however, different algorithms have been proposed depending on which question is addressed. Also, for each of the problems, distinct measures of performances have been used to evaluate the proposed solution. In this paper, we propose an efficient method to address both questions.

The first question of estimating the underlying parameters of the mixture components has been addressed in [KMR94, FM99, FOS08], where the error of a given algorithm is measured as the KL-divergence between the true distribution and the estimated distribution. More precisely, a mixture learning algorithm is said to be an accurate learning algorithm, if it outputs a mixture of product distribution such that the following holds with probability at least :

 DKL(X||ˆX)≡∑xP(X=x)log(P(X=x)/P(ˆX=x))≤ε,−5pt

where are any given constants, and denote the random vectors distributed according to the true and the estimated mixture distribution, respectively. Furthermore, the algorithm is said to efficient if its time complexity is polynomial in , , , , and .

This Probably Approximately Correct (PAC) style framework was first introduced by Kearns et al. [KMR94], where they provided the first analytical result for a simpler problem of learning mixtures of Hamming balls, which is a special case of our model with . However, the running time of the proposed algorithm is super-polynomial and also assumes that one can obtain the exact probability of a sample . Freund and Mansour [FM99] were the first to addressed the sample complexity, but for the restrictive case of and . For this case, their method has running time and sample complexity . Feldman, O’Donnell, and Servedio in [FOS08] generalized approach of [FM99] to arbitrary number of types and arbitrary number of output labels . For general , their algorithm requires running time scaling as . Hence, the proposed algorithm is an efficient learning algorithm only for finite values of and .

A breakthrough in Feldman et al.’s result is that their result holds for all problem instances, with no dependence on the minimum weight or the condition number , where is the

-th singular value of

, and is a diagonal matrix with the weights in the diagonals. However, this comes at a cost of running time scaling exponentially in both and , which is unacceptable in practice for any value of beyond two. Further, the running time is exponential for all problem instances, even when the problem parameters are well-behaved, with finite condition number.

In this paper, we alleviate this issue by proposing an efficient algorithm for well-behave mixture distributions. In particular, we give an algorithm with polynomial running time, and prove that it gives -accurate estimate for any problem instance that satisfy the following two conditions: ) the weight is strictly positive for all ; and ) the condition number is bounded as per hypotheses in Theorem 3.3.

The existence of an efficient learning algorithm for all problem instances and parameters still remains an open problem, and it is conjectured in [FOS08] that “solving the mixture learning problem for any would require a major breakthrough in learning theory”.

The second question finding the clusters has been addressed in [CHRZ07, CR08]. Chaudhuri et al. in [CHRZ07] introduced an iterative clustering algorithm but their method is restricted to the case of a mixture of two product distributions with binary outputs, i.e. and . Chaudhuri and Rao in [CR08] proposed a spectral method for general . However, for the algorithm to correctly recover cluster of each sample w.h.p, the underlying mixture distribution should satisfy a certain ‘spreading’ condition. Moreover, the algorithm need to know the parameters characterizing the ‘spread’ of the distribution, which typically is not available apriori. Although it is possible to estimate the mixture distribution, once the samples are clustered, Chaudhuri et al. provides no guarantees for estimating the distribution. As is the case for the first problem, for clustering also, we provide an efficient algorithm for general , under the assumption that the condition number of to be bounded. This condition is not directly comparable with the spreading condition assumed in previous work. Our algorithm first estimates the mixture parameters and then uses the distance based clustering method of [AK01].

Our method for estimating the mixture parameters is based on the moment matching technique from [AHK12, AGMS12]. Typically, second and third (and sometimes fourth) moments of the true distribution are estimated using the given samples. Then, using the spectral decomposition of the second moment one develops certain whitening operators that reduce the higher-order moment tensors to orthogonal tensors. Such higher order tensors are then decomposed using a power-method based method [AGH12] to obtain the required distribution parameters.

While such a technique is generic and applies to several popular models [HK13, AGH12], for many of the models the moments themselves constitute the “correct” intermediate quantity that can be used for whitening and tensor decomposition. However, because there are dependencies in the -wise model (for example, to are correlated), the higher-order moments are “incomplete” versions of the intermediate quantities that we require (see (1), (2)). Hence, we need to complete these moments so as to use them for estimating distribution parameters .

Completion of the “incomplete” second moment, can be posed as a low-rank matrix completion problem where the block-diagonal elements are missing. For this problem, we propose an alternating minimization based method and, borrowing techniques from the recent work of [JNS13], we prove that alternating minimization is able to complete the second moment exactly. We would like to note that our alternating minimization result also solves a generalization of the low-rank+diagonal decomposition problem of [SCPW12]. Moreover, unlike trace-norm based method of [SCPW12]

, which in practice is computationally expensive, our method is efficient, requires only one Singular Value Decomposition (SVD) step, and is robust to noise as well.

We reduce the completion of the “incomplete” third moment to a simple least squares problem that is robust as well. Using techniques from our second moment completion method, we can analyze an alternating minimization method also for the third moment case as well. However, for the mixture problem we can exploit the structure to reduce the problem to an efficient least squares problem with closed form solution.

Next, we present our method (see Algorithm 1) that combines the estimates from the above mentioned steps to estimate the distribution parameters (see Theorem 3.2, Theorem 3.3). After estimating the model parameters , and , we also show that the KL-divergence measure and the clustering error measure can also be shown to be small. In fact the excess error vanishes as the number of samples grow (see Corollary 3.4, Corollary 3.5).

## 2 Related Work

Learning mixtures of distributions is an important problem with several applications such as clustering, crowdsourcing, community detection etc. One of the most well studied problems in this domain is that of learning a mixture of Gaussians. There is a long list of interesting recent results, and discussing the literature in detail is out side of the scope of this paper. Our approach is inspired by both spectral and moment-matching based techniques that have been successfully applied in learning a mixture of Gaussians [VW04, AK01, MV10, HK13].

Another popular mixture distribution arises in topic models, where each word is selected from a -sized dictionary. Several recent results show that such a model can also be learned efficiently using spectral as well as moments based methods [RSS12, AHK12, AGM12]. However, there is a crucial difference between the general mixture of product distribution that we consider and the topic model distribution. Given a topic (or question) , each of the words in the topic model have exactly the same probability. That is, for all . In contrast, for our problem, , in general.

Learning mixtures of discrete distribution over product spaces has several practical applications such as crowdsourcing, recommendation systems, etc. However, as discussed in the previous section, most of the existing results for this problem are designed for the case of small alphabet size or the number of mixture components . For several practical problems [KOS13], can be large and hence existing methods either do not apply or are very inefficient. In this work, we propose first provably efficient method for learning mixture of discrete distributions for general and .

Our method is based on tensor decomposition methods for moment matching that have recently been made popular for learning mixture distributions. For example, [HK13] provided a method to learn mixture of Gaussians without any separation assumption. Similarly, [AHK12]

introduced a method for learning mixture of HMMs, and also for topic models. Using similar techniques, another interesting result has been obtained for the problem of independent component analysis (ICA)

[AGMS12, GR12, HK13].

Typically, tensor decomposition methods proceed in two steps. First, obtain a whitening operator using the second moment estimates. Then, use this whitening operator to construct a tensor with orthogonal decomposition, which reveals the true parameters of the distribution. However, in a mixture of -way distribution that we consider, the second or the third moment do not reveal all the “required” entries, making it difficult to find the standard whitening operator. We handle this problem by posing it as a matrix completion problem and using an alternating minimization method to complete the second moment. Our proof for the alternating minimization method closely follows the analysis of [JNS13]. However, [JNS13] handled a matrix completion problem where the entries are missing uniformly at random, while in our case the block diagonal elements are missing.

### 2.1 Notation

Typically, we denote a matrix or a tensor by an upper-case letter (e.g. ) while a vector is denoted by a small-case letter (e.g. ). denotes the -th column of matrix . denotes the -th entry of matrix and denotes the -th entry of the third order tensor . denotes the transpose of matrix , i.e., . denotes the set of first integers. denotes the -th standard basis vector.

If , then () denotes the -th block of , i.e., to -th rows of . The operator denotes the outer product. For example, denote a rank-one tensor such that . For a symmetric third-order tensor , define an dimensional operation with respect to a matrix as

 T[R,R,R]≡∑i1,i2,i3∈[d]Ti1,i2,i3Ri1,j1Ri2,j2Ri3,j3(ej1⊗ej2⊗ej3).−5pt

denotes the spectral norm of a tensor . That is, . denotes the Frobenius norm of , i.e., . We use to denote the singular value decomposition (SVD) of , where denotes the -th singular value of . Also, wlog, assume that .

## 3 Main results

In this section, we present our main results for estimating the mixture weights and the probability matrix of the mixture distribution. Our estimation method is based on the moment-matching technique that has been popularized by several recent results [AHK12, HKZ12, HK13, AGH12]. However, our method differs from the existing methods in the following crucial aspects: we propose a matrix completion approach to estimate the second moments from samples (Algorithm 2); and a least squares approach with an appropriate change of basis to estimate the third moments from samples (Algorithm 3). These approaches provide robust algorithms to estimating the moments and might be of independent interest to a broad range of applications in the domain of learning mixture distributions.

The key step in our method is estimation of the following two quantities:

 M2 ≡ ∑q∈[r]wq(πq⊗πq)=ΠWΠT∈Rℓn×ℓn, (1) M3 ≡ ∑q∈[r]wq(πq⊗πq⊗πq)∈Rℓn×ℓn×ℓn, (2)

where is a diagonal matrix s.t. .

Now, as is standard in the moment based methods, we exploit spectral structure of to recover the latent parameters and . The following theorem presents a method for estimating , assuming are estimated exactly:

###### Theorem 3.1.

Let be as defined in (1), (2). Also, let

be the eigenvalue decomposition of

. Now, define . Let ,

be the eigenvectors and eigenvalues obtained by the orthogonal tensor decomposition of

(see [AGH12]), i.e., . Then,

 Π=UM2Σ1/2M2VGΛG, and W=(ΛG)−2,

where is a diagonal matrix with .

The above theorem reduces the problem of estimation of mixture parameters to that of estimating and . Typically, in moment based methods, tensors corresponding to and can be estimated directly using the second moment or third moment of the distribution, which can be estimated efficiently using the provided data samples. In our problem, however, the block-diagonal entries of and cannot be directly computed from these sample moments. For example, the expected value of a diagonal entry at -th coordinate is , where as the corresponding entry for is .

To recover these unknown block-diagonal entries of , we use an alternating minimization algorithm. Our algorithm writes in a bi-linear form and solves for each factor of the bi-linear form using the computed off-diagonal blocks of . We then prove that this algorithm exactly recovers the missing entries when we are given the exact second moment. For estimating , we reduce the problem of estimating unknown block-diagonal entries of to a least squares problem that can be solved efficiently.

Concretely, to get a consistent estimate of , we pose it as a matrix completion problem, where we use the off-block-diagonal entries of the second moment, which we know are consistent, to estimate the missing entries. Precisely, let

 Ω2≡{(i,j)⊆[ℓn]×[ℓn]|⌈iℓ⌉≠⌈jℓ⌉},−5pt

be the indices of the off-block-diagonal entries, and define a masking operator as:

 PΩ2(A)i,j ≡ {Ai,j, if (i,j)∈Ω2,0, otherwise . (3)

Now, using the fact that has rank at most , we find a rank- estimate that explains the off-block-diagonal entries using an alternating minimization algorithm defined in Section 4.

 ˆM2 ≡ \sc{MatrixAltMin}⎛⎝2|S|∑t∈[|S|/2]xtxTt,Ω2,r,T⎞⎠, (4)

where is the set of observed samples, and is the number of iterations. We use the first half of the samples to estimate and the rest to estimate the third-order tensor.

Similarly for the tensor , the sample third moment does not converge to . However, the off-block diagonal entries do converge to the corresponding entries of . That is, let

 Ω3≡{(i,j,k)⊆[ℓn]×[ℓn]×[ℓn]|⌈iℓ⌉≠⌈jℓ⌉≠⌈kℓ⌉≠⌈iℓ⌉},−5pt

be the indices of the off-block-diagonal entries, and define the following masking operator:

 PΩ3(A)i,j,k ≡ {Ai,j,k, if (i,j,k)∈Ω3,0, otherwise . (5)

Then, we have consistent estimates for from the sample third moment.

Now, in the case of , we do not explicitly compute . Instead, we estimate a dimensional tensor (cf. Theorem 3.1), using a least squares formulation that uses only off-diagonal blocks of . That is,

 ˆG ≡ \sc TensorLS(2|S||S|∑t=1+|S|/2xt⊗xt⊗xt,Ω3,ˆUM2,ˆΣM2),

where is the singular value decomposition of the rank- matrix . After estimation of , similar to Theorem 3.1, we use the whitening and tensor decomposition to estimate . See Algorithm 1 for a pseudo-code of our approach.

Remark: Note that we use a new set of samples to estimate the third moment. This sub-sampling helps us in our analysis, as it ensures independence of the samples from the output of the alternating minimization step (4).

The next theorem shows that the moment matching approach (Algorithm 1) is consistent. Let and denote the estimates obtained using Algorithm 1. Also, let denote the block-incoherence of as defined in (7).

###### Theorem 3.2.

Assume that the sample second and the third moments are exact, i.e.,
and . Also, let for the MatrixAltMinprocedure and let , for a global constant . Then, there exists a permutation over such that, for all ,

 πq=ˆπP(q) and wq=^wP(q).

We now provide a finite sample version of the above theorem.

###### Theorem 3.3 (Finite sample bound).

There exists positive constants , , , and a permutation on such that if then for any and for a large enough sample size:

 |S| ≥ C2μ6r6wminσ1(M2)6n3σr(M2)9log(n/δ)ε2M,

the following holds for all , with probability at least :

 |^wP(q)−wq| ≤ εM, ∥ˆπP(q)−πq∥ ≤ εM√rwmaxσ1(M2)wmin.

Further, Algorithm 1 runs in time .

Note that, the estimated ’s and ’s using Algorithm 1 do not necessarily define a valid

probability measure: they can take negative values and might not sum to one. We can process the estimates further to get a valid probability distribution, and show that the estimated mixture distribution is close in Kullback-Leibler divergence to the original one. Let

. We first set

 ~w′q = {^wq if ^wq≥εw,εw if ^wq<εw,

and set mixture weights . Similarly, let and set

 ~π′(j)q,p =

for all , , and , and normalize it to get valid distributions . Let denote a random vector in obtained by first selecting a random type with probability and then drawing from a random vector according to .

###### Corollary 3.4 (KL-divergence bound).

Under the hypotheses of Theorem 3.3, there exists a positive constant such that if , then Algorithm 1 with the above post-processing produces a -mixture distribution that, with probability at least , satisfies : .

Moreover, we can show that the “type” of each data point can also be recovered accurately.

###### Corollary 3.5 (Clustering bound).

Define:

 ~ε ≡ maxi,j∈[r]{∥πi−πj∥2−2∥Π∥F√2log(r/δ)(∥πi−πj∥+2√2log(r/δ))r1/2}.

Under the hypotheses of Theorem 3.3, there exists a positive numerical constant such that if and , then with probability at least , the distance based clustering algorithm of [AK01] computes a correct clustering of the samples.

## 4 Algorithm

In this section, we describe the proposed approach in detail and provide finite sample performance guarantees for each components: MatrixAltMin and TensorLS. These results are crucial in proving the finite sample bound in Theorem 3.3. As mentioned in the previous section, the algorithm first estimates using the alternating minimization procedure. Recall that the second moment of the data given by cannot estimate the block-diagonal entries of . That is, even in the case of infinite samples, we only have consistency in the off-block-diagonal entries: . However, to apply the “whitening” operator to the third order tensor (see Theorem 3.1) we need to estimate .

In general it is not possible to estimate from as one can fill any entries in the block-diagonal entries. Fortunately, we can avoid such a case since is guaranteed to be of rank . However, even a low-rank assumption is not enough to recover back . For example, if , then and one cannot recover back . Hence, we make an additional standard assumption that is -block-incoherent, where a symmetric rank- matrix with singular value decomposition is -block-incoherent if the operator norm of all blocks of are upper bounded by

 ∥∥U(i)∥∥2 ≤ μ√rn, for all i∈[n], (7)

where is an sub matrix of which is defined by the block from the -th row to the -th row. For a given matrix , the smallest value of that satisfy the above condition is referred to as the block-incoherence of .

Now, assuming that satisfies two assumptions, and is -block incoherent, we provide an alternating minimization method that provably recovers . In particular, we model explicitly using a bi-linear form with variables and . We iteratively solve for for fixed , and use QR decomposition to orthonormalize to get . Note that the QR-decomposition is not required for our method but we use it only for ease of analysis. Below, we give the precise recovery guarantee for the alternating minimization method (Algorithm 2).

###### Theorem 4.1 (Matrix completion using alternating minimization).

For an symmetric rank- matrix with block-incoherence , we observe off-block-diagonal entries corrupted by noise:

 ˆMij = {Mij+Eij if ⌈iℓ⌉≠⌈jℓ⌉,0 otherwise.

Let denote the output after iterations of MatrixAltMin. If , the noise is bounded by , and each column of the noise is bounded by , , then after iterations of MatrixAltMin, the estimate satisfies:

 ∥M−ˆM(τ)∥2 ≤ ε+9∥M∥F√rσr(M)∥PΩ2(E)∥2,

for any . Further, is -incoherent with .

For estimating , the noise in the off-block-diagonal entries are due to insufficient sample size. We can precisely bound how large the sampling noise is in the following lemma.

###### Lemma 4.2.

Let

be the sample co-variance matrix. Also, let

. Then,

 ∥E∥2≤8√n2log(nℓ/δ)|S|.−5pt

Moreover, , for all

The above theorem shows that can be recovered exactly from infinite many samples, if . Furthermore, using Lemma 4.2, can be recovered approximately, with sample size . Now, recovering recovers the left-singular space of , i.e., range(). However, we still need to recover and the right-singular space of , i.e., range().

To this end, we can estimate the tensor , “whiten” the tensor using (recall that, ), and then use tensor decomposition techniques to solve for . However, we show that estimating is not necessary, we can directly estimate the “whitened” tensor by solving a system of linear equations. In particular, we design an operator such that , where

 ˜G≡∑q∈[r]1√wq(R3eq⊗R3eq⊗R3eq), and R3≡ˆΣ−1/2M2ˆU