Model Selection for Topic Models via Spectral Decomposition

10/23/2014
by   Dehua Cheng, et al.
University of Southern California
0

Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following recent advances in topic model inference via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. Under mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are accurate and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis for other latent models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/12/2009

A Nonconformity Approach to Model Selection for SVMs

We investigate the issue of model selection and the use of the nonconfor...
08/02/2012

Multidimensional Membership Mixture Models

We present the multidimensional membership mixture (M3) models where eve...
07/08/2013

Bridging Information Criteria and Parameter Shrinkage for Model Selection

Model selection based on classical information criteria, such as BIC, is...
12/10/2013

Guaranteed Model Order Estimation and Sample Complexity Bounds for LDA

The question of how to determine the number of independent latent factor...
04/23/2020

A Gamma-Poisson Mixture Topic Model for Short Text

Most topic models are constructed under the assumption that documents fo...
07/31/2017

Familia: An Open-Source Toolkit for Industrial Topic Modeling

Familia is an open-source toolkit for pragmatic topic modeling in indust...
09/02/2019

Clustering of count data through a mixture of multinomial PCA

Count data is becoming more and more ubiquitous in a wide range of appli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently topic models, such as latent Dirichlet allocation (LDA) [BNJ03] and its variants [TJBB06], have been proven extremely successful in modeling large, complex text corpus. These models assume that the words in a document are generated from a mixture of latent topics represented by multinomial distributions over a given dictionary. Therefore, the major inference problem becomes recovering latent topics from text corpus. Popular inference algorithms for LDA include variational inference [BNJ03, TKW07, WPB11, HBWP13], sampling methods [GS04, PNI08], and tensor decomposition [AGM12, AFH12, AGH12] recently. However, all of them require that the number of topics is given as input.

It is known that model selection, i.e., choosing the appropriate number of topics plays a vital role in successfully applying LDA models [TMN14, KRS14]. For example,  [TMN14] has shown that a large value of leads to severe deterioration in the learning rate;  [KRS14]

points out that incorrect number of mixture components can result in an unpredictable error when estimating parameters of mixture model via spectral methods. Moreover, as

increases, the computational cost of inference for the LDA model grows significantly.*Dehua Cheng and Xinran He contributed equally to this article.

Unfortunately, it is extremely challenging to choose the number of topics for the LDA model. In practice, [Tad12] approximates the marginal likelihood via Laplace’s method, while [AEF10, GS04] computes the likelihood via MCMC. Moreover, [Tad12] proposes another model selection method by analysis of residuals. However, it only provides rough measures for evidence in favor of a larger . Other model selection criteria, such as AIC [Aka74], BIC [S78] and cross validation can be applied. Though achieving practical success [AEF10], they only have asymptotic model selection consistency. Moreover, they require multiple runs of the learning algorithm with a wide range of , which limits its practicality on large-scale datasets. Bayesian nonparametrics, such as Hierarchical Dirichlet Processes (HDP)[TJBB06], provide alternatives to select in a principled way. However, it has been shown in a recent paper [MH13] that HDP is inconsistent for estimating the number of topics for LDA even with infinite amount of data.

In this paper, we provide theoretical analysis on the number of topics for latent topic models using spectral decomposition methods. By the results from Anandkumar et al. [AGH12]

, for the LDA model the second-order moment follows a special structure as the summation over the outer product of topic vectors. We show that a spectral decomposition on the second-order empirical moment with proper thresholding on the singular values can lead to the correct number of topics. Under mild assumptions, we show that our analysis provides both a lower bound and an upper bound on number of topics

in the LDA model. To the best of our knowledge, this is the first work of analyzing the number of topics with provable guarantee by utilizing the result of tensor decomposition approach.

Our main contributions are:

  • For LDA, we analyze the empirical second-order moment and derive an upper bound on its variance in terms of the corpus statistics, i.e., the number of documents, the length of each document and the number of unique words. Essentially, our results provide a computable guideline to the convergence of second-order moment. This contribution itself is valuable, e.g., for determining the correct down-sampling rate on a large-scale dataset.

  • We analyze the spectral structure of the true second-order moment for LDA. That is, we provide the spectral information on the covariance of Dirichlet design matrix.

  • Based on the results on empirical and true second-order moment for LDA, we derived three inequalities regarding the number of topics , which in turn provide both upper and lower bounds on with known parameters or constants. We also present the simulation study for our theoretical results.

  • We show that our results and techniques can be generalized to other mixture models. The results on Gaussian mixture models is presented as an example.

The rest of the paper is organized as follows: In section 2, we present our main result on how to analyze the number of topics in the LDA model. We carry out experiments on the synthetic datasets to demonstrate the validity and tightness of our bounds in section 3. We conclude the paper and show how our methodology generalizes to other mixture models in section 4.

2 Analyze the Number of Topics in LDA

Latent Dirichlet Allocation [BNJ03]

(LDA) is a powerful generative model for topic modeling. It has been applied to a variety of applications and also serves as building blocks in other powerful models. Most existing methods follow the empirical Bayes method for parameter estimation  

[BNJ03, TKW07, GS04, PNI08]. Recently, method of moments has been explored, leading to a series of interesting work and new insight into the LDA model. It has been shown in [AFH12, AGH12] that the latent topics can be directly derived from the properly constructed third-order moment (which can be directly estimated from the data) by orthogonal tensor decomposition. Following this line of work, we observe that the low-order moments are also useful for discovering the number of topics in the LDA model. In this section, we will investigate the structure of both empirical and true second-order moment, and show that they lead to effective bounds on the number of topics.

Notation Definition
() Number(index) of documents
() Number(index) of words in a document
() Number(index) of unique words
() Number(index) of latent topics
Multinomial parameters for the -th topic
Collection of all topics
Collection of all words in -th document
-th word in -th document
Topic mixing for -th document
Topic assignment for word
Hyperparameter for document topic distribution
Hyperparameter for generating topics
Table 1: Notation for LDA

2.1 Notation and Problem Formulation

As introduced in  [BNJ03], the full generative process for the -th document in the LDA model is described as follows:

  1. Generate the topic mixing .

  2. For each word in document :

    1. Generate a topic , where denotes the multinomial distribution.

    2. Generate a word , where is the multinomial parameter associated with topic .

The notation is summarized in Table 1. is represented by natural basis , meaning that the -th word in -th document is the -th word in the dictionary.

In [AGH12], the authors proposed the method of moment for learning the LDA model, where the empirical first-order moment is defined as

and the empirical second-order moment as

where and the outer product is defined as for any column vector . Then we define the first-order and second-order moments as the expectation of the empirical moments, i.e., and respectively. Furthermore, it has been shown that equals the weighted sum of the outer products of the topic parameter  [AGH12], i.e.,

This implies that the rank of is exactly the number of topics . Another interesting observation from this derivation is that since is the summation of rank-1 matrices and all the topics are linearly independent almost surely under our full generative model, we have the K-th largest singular value and K+1-th largest singular value . Therefore, the number of non-zero singular values of is exactly the number of topics, which provides a direct way to estimate under the noiseless scenario. However, in practice, we only have access to the estimated as an approximation to the true second-order moment . As a result, the rank of may not be and may be larger than zero. To overcome this obstacle, we need to study (1) the spectral structure of , and (2) the relationship between and its estimator .

2.2 Solution Outline

The second-order moment can be estimated directly from the observations, without inferring the topic mixing and estimating parameters. Our idea follows that when the sample size becomes large enough, can approximate well enough. That is, is very close to zero while is bounded away from zero. Then, by picking a proper threshold satisfying , we can obtain the value of by simply counting the number of singular values of greater than . We will work along two directions to achieve the goal: (1) examine the convergence rate of the singular values of ; (2) investigate the relationship between the spectral structure of and the model parameters. Next we will provide the analysis results from both directions.

2.3 Convergence of

Without loss of generality, we assume that both and are generated from symmetrical Dirichlet distribution, namely for and for . We also assume that all documents have the same length for simplicity. Since

is an unbiased estimator of

by definition, we can bound the difference between the singular value of and that of by bounding their variance as follows:

Theorem 2.1.

For the LDA model, with probability at least

, we have

where , represents higher-order terms.

Especially, when , we have

(1)
Proof.

Let and be the spectral and Frobenius norm of , respectively. We denote as the

-th largest eigenvalue of matrix

. We establish the result through the following chain of inequalities:

Step (i) follows directly based on the fact that is semi-definiteness and is symmetric. The detailed proof is deferred to Lemma A.1 in Appendix. Step (ii) and (iii) are well-known results on matrix norm and matrix perturbation theory [HJ]. And in Lemma 2.2, we provide upper bound on the Frobenius norm of matrix . Because , i.e., for , therefore, . ∎

Lemma 2.2.

For the LDA model, with probability at least , we have .

Proof.

We first compute the expectation and then use Markov inequality to complete the proof. The square of Frobenius norm is . Since we have , . The expectation of can be calculated as

The remaining task is to calculate the conditional variance of and , which is discussed in Lemma 2.3.

Then by Markov inequality, for any , we have

By setting , with probability at least , we have

Lemma 2.3.

For the LDA model, the following holds

and

for and represents higher-order terms.

We make a few relaxations and introduce notation (keeping the dominant terms and absorb the rest into to achieve an upper-bound on the variance). To be rigorous, we have the following assumptions on the scale of each statistics or parameters: , , , , , , and . The calculation of the variance is provided in Appendix D.

It is interesting to examine the role of , and in . decreases to as . Even if there are only two words in each document, would still converge to . Similar observation is made in [AGH12]. and have similar influence on .

To apply the results above, we simply ignore the higher-order terms. However, because will increase as , , or decreases, one should pay extra attention when are far from the asymptotic region. As shown in our simulated studies, our bound yields convincing results when are on the scale of hundreds or above, which is more than common in real-world applications.

2.4 Spectral Structure of

The spectral structure of depends on and . We use the following theorem to characterize the spectral structure of .

Theorem 2.4.

Assume that , , and and

  • With probability at least we have

    (2)
  • With probability at least , we have

    (3)
Proof.

We have , where is a matrix and is a diagonal matrix. The first singular values of are also the first singular values of . And we have

and

To estimate the singular value of , we need to utilize the fact that

. The random variables in the same column of

are dependent with each other. Thus, powerful results from random matrix theory can not be applied. To decouple the dependency, we design a diagonal matrix

, whose diagonal elements are drawn from independently. In this way, is a matrix with independent elements, i.e., each element is an i.i.d. random variable following .

We denote each row of as , then . In order to apply matrix Chernoff bound [Tro12], we need to bound the spectral norm of , i.e., . Because is a rank- matrix, we have . By Lemma C.3 (see Appendix) and the union bound, with probability greater than , we have

We also have and . Applying the matrix Chernoff bound to , with probability greater than

we have

And with probability greater than

we have

By definition, for , it follows

Therefore, we have

and

Since and are the maximum and minimum of a set of random variables following , we can bound them by Lemma C.4 with coefficient . Proper choices of coefficients (provided in Appendix A.1) leads to the conclusions of Theorem 2.4. ∎

With certain assumptions on and , we can fully utilize the bounds above. If we assume that , , then and . Therefore, decreases rapidly as increases, where approximately. This fact leads to increasing difficulty in distinguishing the topics with small singular values from noise. Note that also decreases with a slower rate as increases.

2.5 Analysis of the Number of Topics

(a) (b) (c)
(d) (e) (f)
Figure 1: Experimental results on synthetic data under LDA model. Results on are illustrated in Figure (a-c). and are illustrated in Figure (d-f).

The convergence of and the spectral structure of provide us the upper bounds and the lower bounds on the singular values of the empirical second-order moments . We can infer the number of topics by the following steps:

First, by setting , thresholding provides a lower bound on , since with high probability, every spurious topic has singular value smaller than .111Strictly speaking, there is no one-to-one correspondence between topics and the singular values of the second-order moments. Here we refer to the correspondence in terms of the total number of topics.

Secondly, if we set , thresholding provides a upper bound on , since with high probability, every true topic has singular value greater than the threshold. However, the above threshold is not computable, since depends on the true number of topics .

Instead, we can directly utilize the upper bound on to provide an upper bound for . We have as shown in Theorem 2.4. The left hand side, , is determined by the observed corpus, and the right hand side is a function of . When decreases as increases (see discussion in Section 2.4), solving the inequality leads to an upper bound on .

3 Experimental Results

We validate our theoretical results by conducting experiments on the synthetic datasets generated according to the LDA model. For each experiment setting, we report the results by averaging over five random runs.

In the first set of experiments, we test the convergence of the second-order moment as a function of . The parameter setting is as follows: , and . We vary the dictionary size , document length , or document number while keeping the other two fixed. The detailed settings are summarized as belows:

  1. [(a)]

  2. Fix and , vary the length of document from to .

  3. Fix and , vary the number of documents from to .

  4. Fix and , vary the size of dictionary from to .

Figure 1 (a-c) shows the matrix norms on and the -th and -th largest singular values of . The results match nicely with our theoretical analysis in that serves as an accurate upper bound on the Frobenius norm of . When the amount of data is large enough, the red line goes below the purple line, which indicates that with enough data, thresholding with provides a tight lower bound on the number of topics.

In the second experiment, we evaluate our bounds on the spectral structure of in Theorem 2.4. Similarly, we vary , or while keeping the other two parameters fixed. The detailed settings are as follows:

  1. [(a)]

  2. Fix , , and , vary from to .

  3. Fix , , and , vary number of topics from to .

  4. Fix , , and , vary the size of dictionary from to .

The results in Figure 1 (d-f) match well with our theoretical analysis.

In the last experiment, we calculate the upper bound and the lower bound of when varying the number of documents or the length of documents. The results are presented in Figure 2. As we can see, the lower bound indeed converges to the true number of topics. However, the upper bound converges to a value other than the ground truth, partly because the upper bound involves both and , whereas does not change as the size of dataset increases. The experiment results demonstrate that our upper and lower bounds on can effectively narrow down the range of possible .

(a), varying (b), varying (c), varying
Figure 2: The upper and lower bounds on number of topics for LDA based on discussion in section 2.5.

4 Discussion and Conclusions

So far we have shown that for the LDA model, by investigating the convergence of the empirical moments and the spectral structure of the expected moment , the singular value of the empirical moment provides useful information on the number of topics. This line of research provides an interesting direction for analyzing mixture models in general [HK13]. Next we show how to generalize our methodology with an example of Gaussian Mixture Models (GMM).

4.1 Generalization

Our analysis can be easily generalized to other mixture models whose empirical low-order moments have the same structures as the weighted sum of the outer products of mixture components. Convergence analysis of leads to the lower bound on the number of mixture components, while solving inequality on the first singular value provides an upper bound. In order to derive the convergence bound , the variance of need to be computed. Moreover, we need to explore the spectral structure of the true moment to provide upper and lower bound on the first and the -th singular values respectively.

As an example, we next show how to conduct the analysis on the Gaussian Mixture Model [Bis06] with spherical mixture components.

GMM assumes that the data points are generated from a mixture of multivariate Gaussian components. That is, for a dataset generated from spherical Gaussian mixtures with components, we assume that

where is the mixture probability, is the component assignment for the -th data point, and is a

-dimensional spherical Gaussian distribution with

. We further assume and for a Bayesian version of GMM. Note that we assume that the following parameters are known: .

The problem on how to correctly choosing the number of mixture components has been extensively studied. Such as traditional methods (cross validation, AIC and BIC [LV10]), penalized likelihood methods [THK13] and variational approaches [CB01]. Similar to the LDA model, we show that analyzing the empirical moments provides an alternative approach to bound the number of mixture components.

We define the empirical second-order moment as and the second-order moment as the expectation of the empirical moment, namely . Then by similar analysis, we have the following theorem for GMM:

Theorem 4.1.

Let , then

  1. [(1)]

  2. Let be the number of singular values of such that , where

    then with probability at least , we have

  3. Let be the maximal integer such that

    Then with probability at least , we have

The proof for Theorem 4.1 is similar to that of Theorem 2.4 where detailed proof is in Appendix B due to space limit. As our purpose is methodology demonstration, we omit the comparison with the excellent existing works on GMM, such as [SR09].

4.2 Conclusion

In this paper, we provide theoretical analysis for model selection in LDA. Specifically, we present both an upper bound and a lower bound on the number of topics based on the connection between second-order moments and latent topics. The upper bound is obtained by bounding the difference between the estimated second-order moment and the true moment . The lower bound is obtained via analyzing the largest singular value of . Furthermore, our analysis can be easily generalized to other latent models, such as Gaussian mixture models.

One major limitation of our approach is that all our analysis assumes that the data are generated exactly according to LDA. As a result, the analysis result may not hold when being applied to real world dataset.

For future work, we will examine effective ways to improve the theoretical results. For example, by bounding higher-order moments of or replacing Markov inequality with tighter inequalities. Moreover, we could bound the spectral norm of directly instead of its Frobenius norm, which potentially yields tighter bounds.

5 Acknowledgment

We thank Fei Sha and David Kale for helpful discussion and suggestion. The research was sponsored by the NSF research grants IIS-1254206, and U.S. Defense Advanced Research Projects Agency (DARPA) under Social Media in Strategic Communication (SMISC) program, Agreement Number W911NF-12-1-0034. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agency, or the U.S. Government.

References

  • [AEF10] Edoardo M. Airoldi, Elena A. Erosheva, Stephen E. Fienberg, Cyrille Joutard, Tanzy Love, and Suyash Shringarpure. Reconceptualizing the classification of pnas articles. Proceedings of the National Academy of Sciences, 107(49):20899–20904, 2010.
  • [AFH12] Anima Anandkumar, Dean P Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. In NIPS, pages 926–934, 2012.
  • [AGH12] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 2012.
  • [AGM12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd. In FOCS, pages 1–10, 2012.
  • [Aka74] Hirotugu Akaike. A new look at the statistical model identification. Automatic Control, 19(6):716–723, 1974.
  • [Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
  • [BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
  • [CB01] Adrian Corduneanu and Christopher M Bishop. Variational Bayesian model selection for mixture distributions. In AISTATS, pages 27–34, 2001.
  • [GS04] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, April 2004.
  • [HBWP13] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. JMLR, 14(1):1303–1347, 2013.
  • [HJ] Roger A. Horn and Charles R. Johnson. Matrix analysis, 1985.
  • [HK13] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In ITCS, 2013.
  • [KRS14] Alex Kulesza, N Raj Rao, and Satinder Singh. Low-rank spectral learning. In ICML, 2014.
  • [LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The annals of statistics, 28(5):1302–1338, 2000.
  • [LV10] Olga Lukociene and Jeroen K. Vermunt. Determining the number of components in mixture models for hierarchical data. In Advances in Data Analysis, Data Handling and Business Intelligence, pages 241–249. 2010.
  • [MH13] Jeffrey W Miller and Matthew T Harrison. A simple example of dirichlet process mixture inconsistency for the number of components. In NIPS, pages 199–206. 2013.
  • [PNI08] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In KDD, pages 569–577, 2008.
  • [S78] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
  • [SR09] Russell J Steele and Adrian E Raftery. Performance of bayesian model selection criteria for gaussian mixture models. Dept. Stat., Univ. Washington, Washington, DC, Tech. Rep, 559, 2009.
  • [Tad12] Matt Taddy. On estimation and selection for topic models. In

    Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12)

    , pages 1184–1193, 2012.
  • [THK13] Huang Tao, Peng Heng, and Zhang Kun. Model selection for gaussian mixture models. arXiv preprint arXiv:1301.3558, 2013.
  • [TJBB06] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
  • [TKW07] Yee Whye Teh, Kenichi Kurihara, and Max Welling. Collapsed variational inference for hdp. In NIPS, 2007.
  • [TMN14] Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In ICML, 2014.
  • [Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389–434, 2012.
  • [Ver10] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  • [WPB11] Chong Wang, John W Paisley, and David M Blei. Online variational inference for the hierarchical dirichlet process. In AISTATS, pages 752–760, 2011.

Appendix A Theoretical results for LDA

a.1 Coefficient Setting for Theorem 2.4

Bound of

We have that with probability greater than

we have

We can choose and as follows to simplify the formula of the bound

  • Choose , first probability term is less than .

  • Choose , third probability term is less than .

  • Choose as

    second probability term is less than .

As a result, with probability greater than , we have

As an alternative, we can choose and as follows to simplify the formula of the bound

  • Choose , first probability term is less than .

  • Choose , third probability term is less than .

  • Choose , second probability term is less than .

As a result, with probability greater than

we have

Bound of

We have that with probability greater than

we have

We can choose and as follows to simplify the formula of the bound

  • Choose , first probability term is less than .

  • Choose , third probability term is less than .

  • Choose as

    second probability term is less than .

As a result, with probability greater than , we have

As an alternative, we can choose and as follows to simplify the formula of the bound

  • Choose , third probability term is less than .

  • Choose , first probability term is less than .

  • Choose , second probability term is less than .

As a result, with probability greater than

we have

a.2 Lemma for Theorem 2.1

Lemma A.1.

With and previously defined, we have that

Proof.

Because is a symmetric semidefinite matrix, so we have

And because is a symmetric matrix, we have

for some permutation .

Because we have , so we have .

Let be the smallest index that , for , we have

By the fact that