A Spectral Algorithm for Latent Dirichlet Allocation

04/30/2012 ∙ by Dean P. Foster, et al. ∙ 0

The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k× k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d ≫ k).

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is general agreement that there are multiple unobserved or latent factors affecting observed data. Mixture models offer a powerful framework to incorporate the effects of these latent variables. A family of mixture models, popularly known as topic models, has generated broad interest on both theoretical and practical fronts.

Topic models incorporate latent variables, the topics, to explain the observed co-occurrences of words in documents. They posit that each document has a mixture of active topics (possibly sparse) and that each active topic determines the occurrence of words in the document. Usually, a Dirichlet prior is assigned to the distribution of topics in documents, giving rise to the so-called latent Dirichlet allocation (LDA) (Blei et al., 2003). These models possess a rich representational power since they allow for the words in each document to be generated from more than one topic (i.e., the model permits documents to be about multiple topics). This increased representational power comes at the cost of a more challenging unsupervised estimation problem, when only the words are observed and the corresponding topics are hidden.

In practice, the most common estimation procedures are based on finding maximum likelihood (ML) estimates, through either local search or sampling based methods, e.g.

, Expectation-Maximization (EM) 

(Redner and Walker, 1984), Gibbs sampling (Asuncion et al., 2011), and variational approaches (Hoffman et al., 2010). Another body of tools is based on matrix factorization (Hofmann, 1999; Lee and Seung, 1999). For document modeling, typically, the goal is to form a sparse decomposition of a term by document matrix (which represents the word counts in each document) into two parts: one which specifies the active topics in each document and the other which specifies the distributions of words under each topic.

This work provides an alternative approach to parameter recovery based on the method of moments (Lindsay, 1989; Lindsay and Basak, 1993), which attempts to match the observed moments with those posited by the model. Our approach does this efficiently through a spectral decomposition of the observed moments through two singular value decompositions. This method is simple and efficient to implement, based on only low order moments (third or fourth order), and is guaranteed to recover the parameters of a wide class of mixture models, including the LDA model. We exploit exchangeability of the observed variables and, more generally, the availability of multiple views drawn independently from the same hidden component.

1.1 Summary of Contributions

We present an approach known as Excess Correlation Analysis (ECA) based on the knowledge of low order moments between the observed variables, assumed to be exchangeable (or, more generally, drawn from a multi-view mixture model). ECA differs from Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) in that it is based on two singular value decompositions: the first SVD whitens the data (based on the correlation between two variables) and the second SVD utilizes higher order moments (based on third or fourth order) to find directions which exhibit moments that are in

excess

of those suggested by a Gaussian distribution. Both SVDs are performed on matrices of size

, where is the number of latent factors, making the algorithm scalable (typically the dimension of the observed space ).

The method is applicable to a wide class of mixture models including exchangeable and multi-view models. We first consider the class of exchangeable variables with independent latent factors, such as a latent Poisson mixture model (a natural Poisson model for generating the sentences in a document, analogous to LDA’s multinomial model for generating the words in a document). We establish that a spectral decomposition, based on third or fourth order central moments, recovers the parameters for this model class. We then consider latent Dirichlet allocation and show that a spectral decomposition of a modified third order moment (exactly) recovers both the probability distributions over words for each topic and the Dirichlet prior. Note that to obtain third order moments, it suffices for documents to contain just

words. Finally, we present extensions to multi-view models, where multiple views drawn independently from the same latent factor are available. This includes the case of both pure topic models (where only one active topic is present in each document) and discrete hidden Markov models. For this setting, we establish that ECA correctly recovers the parameters and is simpler than the eigenvector decomposition methods of 

Anandkumar et al. (2012).

Finally, “plug-in” moment estimates can be used with sampled data. Section 5 provides a sample complexity of the method showing that estimating the third order moments is not as difficult as it might naively seem since we only need a matrix to be accurate.

Some preliminary experiments that illustrate the efficacy of the proposed algorithm are given in the appendix.

1.2 Related Work

For the case of a single topic per document, the work of Papadimitriou et al. (2000) provides the first guarantees of recovering the topic distributions (i.e., the distributions over words corresponding to each topic), albeit with a rather stringent separation condition (where the words in each topic are essentially non overlapping). Understanding what separation conditions (or lack thereof) permit efficient learning is a natural question; in the clustering literature, a line of work has focussed on understanding the relation between the separation of the mixture components and the complexity of learning. For clustering, the first learnability result (Dasgupta, 1999) was under a somewhat strong separation condition; a subsequent line of results relaxed (Arora and Kannan, 2001; Dasgupta and Schulman, 2007; Vempala and Wang, 2002; Kannan et al., 2005; Achlioptas and McSherry, 2005; Chaudhuri and Rao, 2008; Brubaker and Vempala, 2008; Chaudhuri et al., 2009) or removed these conditions (Kalai et al., 2010; Belkin and Sinha, 2010; Moitra and Valiant, 2010); roughly speaking, the less stringent the separation condition assumed, the more difficult the learning problem is, both computationally and statistically. For the topic modeling problem in which only a single topic is present per document, Anandkumar et al. (2012) provides an algorithm for learning topics with no separation (only a certain full rank assumption is utilized).

For the case of latent Dirichlet allocation (where multiple topics are present in each document), the recent work of Arora et al. (2012) provides the first provable result under a certain natural separation condition. The notion of separation utilized is based on the existence of “anchor words” for topics — essentially , each topic contains words that appear (with reasonable probability) only in that topic (this is a milder assumption than that in Papadimitriou et al. (2000)). Under this assumption, Arora et al. (2012) provide the first provably correct algorithm for learning the topic distributions. Their work also justifies the use of non-negative matrix (NMF) as a procedure for this problem (the original motivation for NMF was as a topic modeling algorithm, though, prior to this work, formal guarantees as such were rather limited). Furthermore, Arora et al. (2012) provides results for certain correlated topic models.

Our approach makes further progress on this problem by providing an algorithm which requires no separation condition. The underlying approach we take is a certain diagonalization technique of the observed moments. We know of at least three different settings which utilize this idea for parameter estimation.

Chang (1996) utilizes eigenvector methods for discrete Markov models of evolution, where the models involve multinomial distributions. The idea has been extended to other discrete mixture models such as discrete hidden Markov models (HMMs) and mixture models with single active topics (see Mossel and Roch (2006); Hsu et al. (2009); Anandkumar et al. (2012)). A key idea in Chang (1996) is the ability to handle multinomial distributions, which comes at the cost of being able to handle only certain single latent factor/topic models (where the latent factor is in only one of states, such as in HMMs). For these single topic models, the work in Anandkumar et al. (2012)

shows how this method is quite general in that the noise model is essentially irrelevant, making it applicable to both discrete models like HMMs and certain Gaussian mixture models.

The second setting is the body of algebraic methods used for the problem of blind source separation (Cardoso and Comon, 1996)

. These approaches rely on tensor decomposition approaches (see

Comon and Jutten (2010)) tailored to independent source separation with additive noise (usually Gaussian). Much of literature focuses on understanding the effects of measurement noise (without assuming knowledge of their statistics) on the tensor decomposition, which often requires more sophisticated algebraic tools.

Frieze et al. (1996)

also utilize these ideas for learning the columns of a linear transformation (in a noiseless setting). This work provides a different efficient algorithm, based on a certain ascent algorithm (rather than joint diagonalization approach, as in

(Cardoso and Comon, 1996)).

The underlying insight that our method exploits is that we have exchangeable (or multi-view) variables, e.g., we have multiple words (or sentences) in a document, which are drawn independently from the same hidden state. This allows us to borrow from both the ideas in Chang (1996) and in Cardoso and Comon (1996). In particular, we show that the “topic” modeling problem exhibits a rather simple algebraic solution, where only two SVDs suffice for parameter estimation. Moreover, this approach also simplifies the algorithms in Mossel and Roch (2006); Hsu et al. (2009); Anandkumar et al. (2012), in that the eigenvector methods are no longer necessary (e.g., the approach leads to methods for parameter estimation in HMMs with only two SVDs rather than using eigenvector approaches, as in previous work).

Furthermore, the exchangeability assumption permits us to have arbitrary noise models (rather than additive Gaussian noise, which are not appropriate for multinomial and other discrete distributions). A key technical contribution is that we show how the basic diagonalization approach can be adapted to Dirichlet models, through a rather careful construction. This construction bridges the gap between the single topic models (as in Chang (1996); Anandkumar et al. (2012)) and the independent factor model.

More generally, the multi-view approach has been exploited in previous works for semi-supervised learning and for learning mixtures of well-separated distributions (

e.g., as in Ando and Zhang (2007); Kakade and Foster (2007); Chaudhuri and Rao (2008); Chaudhuri et al. (2009)). These previous works essentially use variants of canonical correlation analysis (Hotelling, 1935) between two views. This work shows that having a third view of the data permits rather simple estimation procedures with guaranteed parameter recovery.

2 The Exchangeable and Multi-view Models

We have a random vector . This vector specifies the latent factors (i.e., the hidden state), where specifies the value taken by

-th factor. Denote the variance of

as

which we assume to be strictly positive, for each , and denote the higher -th central moments of as:

At most, we only use the first four moments in our analysis.

Suppose we also have a sequence of exchangeable random vectors ; these are considered to be the observed variables. Assume throughout that ; that are conditionally independent given ; and there exists a matrix such that

for each . Throughout, we make the following assumption.

Assumption 2.1.

is full rank.

This is a mild assumption, which allows for identifiability of the columns of . The goal is to estimate the matrix , sometimes referred to as the topic matrix.

Importantly, we make no assumptions on the noise model. In particular, we do not assume that the noise is additive (or that the noise is independent of ).

2.1 Independent Latent Factors

Here, suppose that has a product distribution, i.e., each component of is independent from the rest. Two important examples of this setting are as follows:

(Multiple) mixtures of Gaussians: Suppose , where is Gaussian noise and is a binary vector (under a product distribution). Here, the -th column can be considered to be the mean of the -th Gaussian component. This is somewhat different model than the classic mixture of -Gaussians, as the model now permits any number of Gaussians to be responsible for generating the hidden state (i.e., is permitted to be any of the vectors on the hypercube, while in the classic mixture problem, only one component is responsible. However, this model imposes the independent factor constraint.). We may also allow to be heteroskedastic (i.e., the noise may depend on , provided the linearity assumption holds.)

(Multiple) mixtures of Poissons: Suppose specifies the Poisson rate of counts for . For example, could be a vector of word counts in the -th sentence of a document (where are words counts of a sequence sentences). Here, would be a matrix with positive entries, and would scale the rate at which topic generates words in a sentence (as specified by the -th column of ). The linearity assumption is satisfied as (note the noise is not additive in this case). Here, multiple topics may be responsible for generating the words in each sentence. This model provides a natural variant of LDA, where the distribution over is a product distribution (while in LDA, is a probability vector).

2.2 The Dirichlet Model

Now suppose the hidden state is a distribution itself, with a density specified by the Dirichlet distribution with parameter ( is a strictly positive real vector). We often think of as a distribution over topics. Precisely, the density of (where the probability simplex denotes the set of possible distributions over outcomes) is specified by:

where

and

Intuitively, (the sum of the “pseudo-counts”) is a crude measure of the uniformity of the distribution. As , the distribution degenerates to one over pure topics (i.e., the limiting density is one in which, with probability , precisely one coordinate of is and the rest are ).

Latent Dirichlet Allocation:

LDA makes the further assumption that each random variable

takes on discrete values out of outcomes (e.g., represents what the -th word in a document is, so represents the number of words in the language). Each column of represents a distribution over the outcomes (e.g., these are the topic probabilities). The sampling procedure is specified as follows: First, is sampled according to the Dirichlet distribution. Then, for each , independently sample according to , and, finally, sample according to the -th column of . Observe this model falls into our setting: represent with a “hot” encoding where if and only if the -th outcome is the -th word in the vocabulary. Hence, and . (Again, the noise model is not additive).

2.3 The Multi-View Model

The multi-view setting can be considered an extension of the exchangeable model. Here, the random vectors are of dimensions . Instead of a single matrix, suppose for each there exists an such that

Throughout, we make the following assumption.

Assumption 2.2.

is full rank for each .

Even though the variables are no longer exchangeable, the setting shares much of the statistical structure as the exchangeable one; furthermore, it allows for significantly richer models. For example, Anandkumar et al. (2012) consider a special case of this multi-view model (where there is only one topic present in ) for the purposes of learning hidden Markov models.

A simple factorial HMM: Here, suppose we have a time series of random hidden vectors and observations (we slightly abuse notation as is a vector). Assume that each factor . The model parameters and evolution are specified as follows: We have an initial (product) distribution over the first . The “factorial” assumption we make is that each factor evolves independently; in particular, for each component , there are (time independent) transition probabilities and . Also suppose that (where, again, does not depend on the time).

To learn this model, consider the first three observations . We can embed this three timestep model into the multiview model using a single hidden state, namely , and, with an appropriate construction (of and means shifts of to make the linearity assumption hold). Furthermore, if we recover we can recover and the transition model. See Anandkumar et al. (2012) for further discussion of this idea (for the single topic case).

3 Identifiability

The underlying question here is: what may we hope to recover about with only knowledge of the distribution on . At best, we could only recover the columns of up to permutation. At the other extreme, suppose no a priori knowledge of the distribution of is assumed (e.g., it may not even be a product distribution). Here, at best, we can only recover the range of . In particular, suppose is distributed according to a multivariate Gaussian, then clearly the columns of are not identifiable. To see this, transform to (where is any invertible matrix) and transform the distribution on (by ); after this transformation, the distribution over is unaltered and the distribution on is still a multivariate Gaussian. Hence, and are indistinguishable from any observable statistics. (These issues are well understood in setting of independent source separation, for additive noise models without exchangeable variables. See Comon and Jutten (2010)).

Thus, for the columns of to be identifiable, the distribution on must have some non-Gaussian statistical properties. We consider three cases. In the independent factor model, we consider the cases when

is skewed and when

has excess kurtosis. We also consider the case that

is Dirichlet distributed.

4 Excess Correlation Analysis (ECA)

We now present exact and efficient algorithms for recovering . The algorithm is based on two singular value decompositions: the first SVD whitens the data (based on the correlation between two variables) and the second SVD is carried out on higher order moments (based on third or fourth order). We start with the case of independent factors, as these algorithms make the basic diagonalization approach clear.

As discussed in the Introduction, these approaches can been seen as extensions of the methodologies in Chang (1996); Cardoso and Comon (1996). Furthermore, as we shall see, the Dirichlet distribution bridges between the single topic models (as in Chang (1996); Anandkumar et al. (2012)) and the independent factor model.

Throughout, we use to denote the pseudo-inverse:

(1)

for a matrix with linearly independent columns (this allows us to appropriately invert non-square matrices).

4.1 Independent and Skewed Latent Factors

  Input: vector ; the moments and
  1. Dimensionality Reduction: Find a matrix such that

    (See Remark 1 for a fast procedure.)

  2. Whiten: Find so is the identity matrix. Set:

  3. SVD: Let be the set of (left) singular vectors, with unique singular values, of

  4. Reconstruct: Return the set :

    where is the pseudo-inverse (see Eq 1).

Algorithm 1 ECA, with skewed factors

Denote the pairwise and threeway correlations as:

The dimensions of and are and , respectively. It is convenient to project to a matrix as follows:

Roughly speaking, we can think of as a reweighing of a cross covariance (by ).

In addition to not being identifiable up to permutation, the scale of each column of is also not identifiable. To see this, observe the model over is unaltered if we both rescale any column and appropriately rescale the variable . Without further assumptions, we can only hope to recover a certain canonical form of , defined as follows:

Definition 1 (The Canonical ).

We say is in a canonical form if, for each , . In particular, the transformation (and a rescaling of ) places in canonical form, and the distribution over is unaltered. Observe the canonical is only specified up to the sign of each column (any sign change of a column does not alter the variance of ).

Recall is the central third moment. Denote the skewness of as:

The first result considers the case when the skewness is non-zero.

Theorem 4.1 (Independent and skewed factors).

We have that:

  • (No False Positives) For all , Algorithm 1 returns a subset of the columns of , in a canonical form.

  • (Exact Recovery) Assume is nonzero for each . Suppose is a random vector uniformly sampled over the sphere . With probability , Algorithm 1 returns all columns of , in a canonical form.

The proof of this theorem is a consequence of the following lemma:

Lemma 4.1.

We have:

The proof of this Lemma is provided in the Appendix.

Proof of Theorem 4.1.

The analysis is with respect to it its canonical form. By the full rank assumption, , which is a matrix, is full rank; hence, the whitening step is possible. By construction:

where . Hence, is a orthogonal matrix.

Observe:

Since is an orthogonal matrix, the above is a (not necessarily unique) singular value decomposition of . Denote the standard basis as . Observe that are singular vectors. In other words, are singular vectors, where is the -th column of .

An SVD uniquely determines all singular vectors (up to sign) which have unique singular values. The diagonal of the matrix is the vector . Also, since is a rotation matrix, the distribution of is also uniform on the sphere. Thus, if is uniformly sampled over the sphere, then every singular value will be nonzero (and distinct) with probability . Finally, for the reconstruction, we have

since is a projection operator (and the range of and are identical). ∎

Remark 1 (Finding efficiently).

Suppose

is a random matrix with entries sampled independently from a standard normal. Set

. Then, with probability , .

Remark 2 (No false positives).

Note that if the skewness is for some then ECA will not recover the corresponding column. However, the algorithm does succeed for those directions in which the skewness is non-zero. This guarantee also provides the practical freedom to run the algorithm with multiple different directions , since we need only to find unique singular vectors (which may be easier to determine by running the algorithm with different choices for ).

Remark 3 (Estimating the skewness).

It is straight forward to estimate the skewness corresponding to any column of . Suppose is some unique singular vector (up to sign) found in step 3 of ECA (which was used to construct some column ), then:

is the corresponding skewness for . This follows from the proof, since corresponds to some singular vector and:

using that is an orthogonal matrix.

4.2 Independent and Kurtotic Latent Factors

  Input: vectors ; the moments and
  1. Dimensionality Reduction: Find a matrix such that

  2. Whiten: Find so is the identity matrix. Set:

  3. SVD: Let be the set of (left) singular vectors, with unique singular values, of

  4. Reconstruct: Return the set :

    where is the pseudo-inverse (see Eq 1).

Algorithm 2 ECA; with kurtotic factors

Define the following matrix:

This is a subspace of the fourth moment tensor.

Recall is the central fourth moment. Denote the excess kurtosis of as:

For Gaussian distributions, recall the kurtosis is , and so the excess kurtosis is . This function is also common in the source separation approaches (Hyvärinen et al., 2001) 333Their algebraic method require more effort due to the additive noise and the lack of exchangeability. Here, the exchangeability assumption simplifies the approach and allows us to address models with non-additive noise (as in the Poisson count model discussed in the Section 2..

In settings where the latent factors are not skewed, we may hope that they are differentiated from a Gaussian distribution due to their fourth order moments. Here, Algorithm 2 is applicable:

Theorem 4.2 (Independent and kurtotic factors).

We have that:

  • (No False Positives) For all , Algorithm 2 returns a subset of the columns of , in a canonical form.

  • (Exact Recovery) Assume is nonzero for each . Suppose are random vectors uniformly and independently sampled over the sphere . With probability , Algorithm 2 returns all the columns of , in a canonical form.

Remark 4 (Using both skewed and kurtotic ECA).

Note that both algorithms never incorrectly return columns. Hence, if for every , either the skewness or the excess kurtosis is nonzero, then by running both algorithms we will recover .

The proof of this theorem is a consequence of the following lemma:

Lemma 4.2.

We have:

The proof of this Lemma is provided in the Appendix.

Proof of Theorem 4.2.

The distinction from the argument in Theorem 4.1 is that:

The remainder of the argument follows that of the proof of Theorem 4.1. ∎

4.3 Latent Dirichlet Allocation

Now let us turn to the case where has a Dirichlet density, where, each is not sampled independently. Even though the distribution on is the product of , the ’s are not independent due to the constraint that lives on the simplex. These dependencies suggest a modification for the moments to be used in ECA, which we now provide.

Suppose is known. Recall that (the sum of the “pseudo-counts”). Knowledge of is significantly weaker than having full knowledge of the entire parameter vector . A common practice is to specify the entire parameter vector in a homogeneous manner, with each component being identical (see Steyvers and Griffiths (2006)). Here, we need only specify the sum, which allows for arbitrary inhomogeneity in the prior.

Denote the mean as

Define a modified second moment as

and a modified third moment as

Remark 5 (Central vs Non-Central Moments).

In the limit as , the Dirichlet model degenerates so that, with probability , only one coordinate of equals and the rest are (e.g., each document is about topic). Here, we limit to non-central moments:

In the other extreme, the behavior limits to the central moments:

(to prove the latter claim, expand the central moment and use that, by exchangeability, ).

  Input: a vector ; the moments and
  1. Dimensionality Reduction: Find a matrix such that

    (See Remark 1 for a fast procedure.)

  2. Whiten: Find so is the identity matrix. Set:

  3. SVD: Let be the set of (left) singular vectors, with unique singular values, of

  4. Reconstruct and Normalize: Return the set :

    where is a vector of all ones and is the pseudo-inverse (see Eq 1).

Algorithm 3 ECA for latent Dirichlet allocation

Our main result here shows that ECA recovers both the topic matrix , up to a permutation of the columns (where each column represents a probability distribution over words for a given topic) and the parameter vector , using only knowledge of (which, as discussed earlier, is a significantly less restrictive assumption than tuning the entire parameter vector). Also, as discussed in Remark 8, the method applies to cases where is not a multinomial distribution.

Theorem 4.3 (Latent Dirichlet Allocation).

We have that:

  • (No False Positives) For all , Algorithm 3 returns a subset of the columns of .

  • (Topic Recovery) Suppose is a random vector uniformly sampled over the sphere . With probability , Algorithm 3 returns all columns of .

  • (Parameter Recovery) We have that:

    where is a vector of all ones.

The proof is a consequence of the following lemma:

Lemma 4.3.

We have:

and

The proof of this Lemma is provided in the Appendix.

Proof of Theorem 4.3.

Note that with the following rescaling of columns:

we have that is in canonical form (i.e., the variance of each is 1). The remainder of the proof is identical to that of Theorem 4.1. The only modification is that we simply normalize the output of Algorithm 1. Finally, observe that claim for estimating holds due to the functional form of . ∎

Remark 6 (Limiting behaviors).

ECA seamlessly blends between the single topic model of Anandkumar et al. (2012) and the skewness based ECA, Algorithm 1 . In the single topic case, Anandkumar et al. (2012) provide eigenvector based algorithms. This work shows that two SVDs suffice for parameter recovery.

Remark 7 (Skewed and Kurtotic ECA for LDA).

We conjecture that the fourth moments can be utilized in the Dirichlet case such that the resulting algorithm limits to the kurtotic based ECA, when . Furthermore, the mixture of Poissions model discussed in Section 2 provides a natural alternative to the LDA model in this regime.

Remark 8 (The Dirichlet model, more generally).

It is not necessary that we have a multinomial distribution on , so long as . In some applications, it might be natural for the observations to come from a different distribution (say may represent pixel intensities in an image or some other real valued quantity). For this case, where has a Dirichlet prior (and where may not be multinomial), ECA still correctly recovers the columns of . Furthermore, we need not normalize; the set recovers in a canonical form.

4.4 The Multi-View Extension

  Input: vector ; the moments and
  1. Project views and : Find matrices and such that is invertible. Set:

    (See Remark 10 for a fast procedure.)

  2. Symmetrize: Reduce to a single view:

  3. Estimate with ECA: Call Algorithm 1, with , , and .

Algorithm 4 ECA; the multi-view case

Rather than being identical for each , suppose for each there exists an such that

For , define

We use the notation to stress that is a sized matrix.

Lemma 4.4.

For ,

The proof for Lemma 4.4 is analogous to those in Appendix A.

These functional forms make deriving an SVD based algorithm more subtle. Using the methods in Anandkumar et al. (2012), eigenvector based method are straightforward to derive. However, SVD based algorithms are preferred due to their greater simplicity. The following lemma shows how the symmetrization step in the algorithm makes this possible.

Lemma 4.5.

For and defined in Algorithm 4, we have:

Proof.

Without loss of generality, suppose are in canonical form (for each , ). Hence, . Hence, and are invertible. Note that:

which proves the first claim. The proof of the second claim is analogous. ∎

Again, we say that all are in a canonical form if, for each , .

Theorem 4.4 (The multi-view case).

We have:

  • (No False Positives) For all , Algorithm 4 returns a subset of , in a canonical form.

  • (Exact Recovery) Assume that is nonzero for each . Suppose is a random vector uniformly sampled over the sphere . With probability , Algorithm 4 returns all columns of , in a canonical form.

Proof of Theorem 4.4.

The proof is identical to that of Theorem 4.1. ∎

Remark 9 (Simpler algorithms for HMMs).

Mossel and Roch (2006); Anandkumar et al. (2012) provide eigenvector based algorithms for HMM parameter estimation. These results show that we can achieve parameter estimation with only two SVDs (see Anandkumar et al. (2012) for the reduction of an HMM to the multi-view setting). The key idea is the symmetrization that reduces the problem to a single view.

Remark 10 (Finding and ).

Suppose are random matrices with entries sampled independently from a standard normal. Set and . With probability , and , and the invertibility condition will be satisfied (provided that and are full rank).

5 Sample Complexity

  Input: an integer ; an integer ; vector ; the sum
  1. Find Empirical Averages: With independent samples (of documents), compute the empirical first, second, and third moments. Then compute the empirical moments and .

  2. Whiten: Let where is the matrix of the orthonormal left singular vectors of , corresponding to the largest singular values, and is the corresponding diagonal matrix of the largest singular values.

  3. SVD: Let be the set of (left) singular vectors of

  4. Reconstruct and Scale: Return the set where

    (See Remark 11 for a procedure which explicitly normalizes .)

Algorithm 5 Empirical ECA for LDA

Let us now provide an efficient algorithm utilizing samples from documents, rather than exact statistics. The following theorem shows that the empirical version of ECA returns accurate estimates of the topics. Furthermore, each run of the algorithm succeeds with probability greater than so the algorithm may be repeatedly run. Primarily for theoretical analysis, Algorithm 5 uses a rescaling procedure (rather than explicitly normalizing the topics, which would involve some thresholding procedure; see Remark 11).

Theorem 5.1 (Sample Complexity for LDA).

Fix . Let and let denote the smallest (non-zero) singular value of . Suppose that we obtain independent samples of in the LDA model. With probability greater than , the following holds: for sampled uniformly sampled over the sphere , with probability greater than , Algorithm 5 returns a set such that there exists a permutation of (a permutation of the columns) so that for all

where is a universal constant.

Remark 11 (Normalizing and accuracy).

An alternative procedure would be to just explicitly normalize . If large, to do this robustly, one should first set to the smallest elements and then normalize. The reason for clipping the smallest elements is related to obtaining low error.

Our theorem currently guarantees norm accuracy of each column. Another natural error measure for probability distributions is the error (the total variation error). Ideally, we would like the error to be small with a number of samples does not depend on the dimension (e.g., the size of the vocabulary). Unfortunately, in general, this is not possible. For example, in the simplest case where (i.e., every document is about the same topic), then this amounts to estimating the distributions over words for this topic; in other words, we must estimate a distribution over , which may require samples to obtain some fixed target -error. However, this situation occurs only when the target distribution is near to uniform. If instead, for each topic, say most of the probability mass is contained within the most frequent words (for that topic), then it is possible to translate our error guarantee into an guarantee (in terms of ).

6 Discussion: Sparsity

Note that sparsity considerations have not entered into our analysis. Often, in high dimensional statistics, notions of sparsity are desired as this generally decreases the sample size requirements (often at an increased computational burden).

Here, while these results have no explicit dependence on the sparsity level, sparsity is helpful in that it does implicitly affect the skewness (and the whitening) , which determines the sample complexity. As the model becomes less sparse, the skewness tends to . In particular, for the case of LDA, as note that error increases (see Theorem 5.1).

Perhaps surprisingly, the sparsity level has no direct impact on the computational requirements of a “plug-in” empirical algorithm (beyond the linear time requirement of reading the data in order to construct the empirical statistics).

Acknowledgements

We thank Kamalika Chaudhuri, Adam Kalai, Percy Liang, Chris Meek, David Sontag, and Tong Zhang for many invaluable insights. We also give warm thanks to Rong Ge for sharing early insights and their preliminary results (in Arora et al. (2012)) into this problem with us.

References

  • Achlioptas and McSherry (2005) D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.
  • Anandkumar et al. (2012) A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. In COLT, 2012.
  • Ando and Zhang (2007) R. Ando and T. Zhang. Two-view feature generation model for semi-supervised learning. In ICML, 2007.
  • Arora and Kannan (2001) S. Arora and R. Kannan. Learning mixtures of arbitrary Gaussians. In STOC, 2001.
  • Arora et al. (2012) S. Arora, R. Ge, and A. Moitra. Learning topic models — going beyond svd. arXiv:1204.1956, Apr 2012.
  • Asuncion et al. (2011) A. Asuncion, P. Smyth, M. Welling, D. Newman, I. Porteous, and S. Triglia. Distributed gibbs sampling for latent variable models. In

    Scaling Up Machine Learning: Parallel and Distributed Approaches

    . Cambridge Univ Pr, 2011.
  • Belkin and Sinha (2010) M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.
  • Blei et al. (2003) David M. Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
  • Brubaker and Vempala (2008) S. C. Brubaker and S. Vempala. Isotropic PCA and affine-invariant clustering. In FOCS, 2008.
  • Cardoso and Comon (1996) Jean-François Cardoso and Pierre Comon. Independent component analysis, a survey of some algebraic methods. In IEEE International Symposium on Circuits and Systems, pages 93–96, 1996.
  • Chang (1996) J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Mathematical Biosciences, 137:51–73, 1996.
  • Chaudhuri and Rao (2008) K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and independence. In COLT, 2008.
  • Chaudhuri et al. (2009) K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correlation analysis. In ICML, 2009.
  • Comon and Jutten (2010) P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press. Elsevier, 2010.
  • Dasgupta (1999) S. Dasgupta. Learning mixutres of Gaussians. In FOCS, 1999.
  • Dasgupta and Gupta (2003) S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60–65, 2003.
  • Dasgupta and Schulman (2007) S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research, 8(Feb):203–226, 2007.
  • Frieze et al. (1996) Alan M. Frieze, Mark Jerrum, and Ravi Kannan. Learning linear transformations. In FOCS, 1996.
  • Hoffman et al. (2010) M.D. Hoffman, D.M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems, 2010.
  • Hofmann (1999) Thomas Hofmann. Probilistic latent semantic analysis. In UAI, 1999.
  • Hotelling (1935) H. Hotelling. The most predictable criterion. Journal of Educational Psychology, 26(2):139–142, 1935.
  • Hsu et al. (2009) D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. In COLT, 2009.
  • Hyvärinen et al. (2001) Aapo Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley Interscience, 2001.
  • Kakade and Foster (2007) Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysi s. In Nader H. Bshouty and Claudio Gentile, editors, COLT, volume 4539 of Lecture Notes in Computer Science, pages 82–96. Springer, 2007.
  • Kalai et al. (2010) A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In STOC, 2010.
  • Kannan et al. (2005) R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In COLT, 2005.
  • Lee and Seung (1999) Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 1999.
  • Lindsay (1989) B. G. Lindsay. Moment matrices: applications in mixtures. Annals of Statistics, 17(2):722–740, 1989.
  • Lindsay and Basak (1993) B. G. Lindsay and P. Basak. Multivariate normal mixtures: a fast consistent method. Journal of the American Statistical Association, 88(422):468–476, 1993.
  • Moitra and Valiant (2010) A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In FOCS, 2010.
  • Mossel and Roch (2006) E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied Probability, 16(2):583–614, 2006.
  • Papadimitriou et al. (2000) Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci., 61(2), 2000.
  • Redner and Walker (1984) R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
  • Stewart and Sun (1990) G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory. Academic Press, 1990.
  • Steyvers and Griffiths (2006) Mark Steyvers and Tom Griffiths. Probabilistic topic models. In T. Landauer, D. Mcnamara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006. URL http://cocosci.berkeley.edu/tom/papers/SteyversGriffiths.pdf.
  • Vempala and Wang (2002) S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In FOCS, 2002.

Appendix A Analysis with Independent Factors

Lemma A.1 (Hidden state moments).

Let . For any vectors ,

and

Proof.

Let , , and be vectors. Since the are independent and have mean zero, we have:

and